Bug 137734 - Support of indexing in the filesystems
Support of indexing in the filesystems
Status: RESOLVED FIXED
Product: platform
Classification: Unclassified
Component: Filesystems
6.x
All All
: P2 (vote)
: 6.x
Assigned To: Jiri Skrivanek
issues@platform
: API_REVIEW_FAST
Depends on:
Blocks: 132388
  Show dependency treegraph
 
Reported: 2008-06-19 15:44 UTC by Jan Lahoda
Modified: 2008-12-22 11:43 UTC (History)
9 users (show)

See Also:
Issue Type: DEFECT
:


Attachments
Test module. (6.03 KB, application/octet-stream)
2008-06-19 15:46 UTC, Jan Lahoda
Details
Test data. (121.75 KB, application/octet-stream)
2008-06-19 15:46 UTC, Jan Lahoda
Details
Patch adding FileUtil.getMimeType(java.io.File). (13.79 KB, text/plain)
2008-06-27 12:05 UTC, Jiri Skrivanek
Details
Profiler snapshot of getMIMEType(FileObject). (38.01 KB, application/octet-stream)
2008-08-07 11:24 UTC, Jiri Skrivanek
Details
Profiler snapshot for getMIMEType(File). (38.69 KB, application/octet-stream)
2008-08-07 11:25 UTC, Jiri Skrivanek
Details
Patch implementing getMIMEType(File). (14.81 KB, text/plain)
2008-08-07 11:26 UTC, Jiri Skrivanek
Details
Patch implementing FileUtil.getMIMEType(FO, String...). (30.45 KB, text/plain)
2008-08-26 13:55 UTC, Jiri Skrivanek
Details
Patch implementing FileUtil.getMIMEType(FO, String...) - version2. (23.32 KB, text/plain)
2008-08-28 13:55 UTC, Jiri Skrivanek
Details
Patch implementing FileUtil.getMIMEType(FO, String...) - version 3. (23.60 KB, text/plain)
2008-08-28 16:44 UTC, Jiri Skrivanek
Details
Patch implementing FileUtil.getMIMEType(FO, String...) - version 4. (26.18 KB, text/plain)
2008-08-29 14:50 UTC, Jiri Skrivanek
Details
Patch implementing FileUtil.getMIMEType(FO, String...) - version 5. (26.29 KB, text/plain)
2008-08-29 18:07 UTC, Jiri Skrivanek
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jan Lahoda 2008-06-19 15:44:28 UTC
When a source root is being indexed, for each file, it is necessary to find out which indexer should be indexing the
given file. It seems to be appropriate to base this decision primarily on mime type (further filtering may occur).

The problem is how to find out the mime type for a given file. FileUtil.getMIMEType requires a FileObject and is not
very fast (see below). Hardcoded map between extensions and mime type does not work very well, as the user can assign a
new extension to a mime type.

Please note, that it may not be required to find out the mime type exactly - if there are only indexers for text/x-java
and text/x-php5, it is irrelevant if the mime type is text/xml or text/x-ant+xml, the important thing is that it is
surely neither text/x-java nor text/x-php5.

So, the request is to provide an ability to find out the mime type of a file very quickly.

I did a quick test that scans through a few source roots and sorts the files according their extension/mime type.
Implementation through j.i.File took (having everything in OS caches) 17ms, implementation through FileObject and
FileUtil.getMIMEType took 155ms (subsequent attempts in the same instance of the IDE were faster, but this won't help
the initial scan usecase). There were 354 .java files (no other files were present), ~147 directories.

I am attaching:
-module41.zip - module that adds File/TEST menu item, which will run the test on "/tmp/outgoing" (will scan through
/tmp/outgoing/*/src) - there are different variants of tests in org.test.module41.TEST class.
-FileVSFileObject.zip - test data, unpack into /tmp
Comment 1 Jan Lahoda 2008-06-19 15:46:01 UTC
Created attachment 63111 [details]
Test module.
Comment 2 Jan Lahoda 2008-06-19 15:46:37 UTC
Created attachment 63112 [details]
Test data.
Comment 3 Petr Pisl 2008-06-19 21:37:35 UTC
This issue is very important from PHP point of view. There are many type of files, which can be preprocessed with PHP.
So   we are not able to predict, in which files user wants to treat as PHP. Also projects like php wiki, or php media
wiki has more then 1000 files so time of initial indexing is important for us as well. 
Comment 4 Jaroslav Tulach 2008-06-19 23:28:06 UTC
Let someone add FileUtil.getMimeType(java.io.File) that will just use the already premade CachedFileObject to guess 
the mime type. The CachedFileObject shall be modified to handle common operations during recognition (getExt, getName, 
getNameExt, getInputStream) just directly on the provided java.io.File, for the others it would really convert the 
File to FileObject, but as these operations are not used during mime type check, it should not matter if these are a 
bit slower.
Comment 5 Jiri Skrivanek 2008-06-26 09:20:27 UTC
I tried suggested solution with FileUtil.getMimeType(java.io.File) but it doesn't help in full IDE. Maybe it helps for
PHP IDE only where less resolvers are registered. Using slightly modified test by Honza Lahoda with php wiki project
(~450 php files and other files) I got 140 ms for File.getName(), 2016 ms for FileUtil.getMIMEType(FileObject) and 2234
ms for FileUtil.getMIMEType(File). It seems also other methods than (getExt, getName, getNameExt, getInputStream) are
used during recognition and FileObject is created anyway. Probably it might be better to consider some filtering of
acting resolvers as Honza mentioned.
Comment 6 Jaroslav Tulach 2008-06-27 10:05:37 UTC
Can I see results from profiler? E.g. profile it or show me the patch to try myself later.
Comment 7 Jiri Skrivanek 2008-06-27 12:05:37 UTC
Created attachment 63593 [details]
Patch adding FileUtil.getMimeType(java.io.File).
Comment 8 Jiri Skrivanek 2008-08-07 11:22:17 UTC
I spent some time on this. Firstly I made various performance improvements in MIMESupport and BaseFileObj to make
getMIMEType(FileObject) faster (http://hg.netbeans.org/core-main/rev/e1ae60dcb027). Times in miliseconds of resolving of
1706 various files in php wiki project in java IDE are as follows:

3 consequent runs - 9047, 7172, 7515
after improvements - 1922, 1172, 1250

It will never be as fast as using only java.io.File.getName because some resolvers read content of files. I have also
tried to better implement FileUtil.getMIMEType(java.io.File) as Jarda suggested. It is faster but the difference is not
very big (1328, 985, 984). I think it is not worth to integrate it.

Are these performance improvements enough for you or do you still need something extra fast? You can also look at 2
snapshots from profiler for getMIMEType(FileObject) and getMIMEType(File) but I think there is not too much to be
further improved.
Comment 9 Jiri Skrivanek 2008-08-07 11:24:50 UTC
Created attachment 66776 [details]
Profiler snapshot of getMIMEType(FileObject).
Comment 10 Jiri Skrivanek 2008-08-07 11:25:20 UTC
Created attachment 66777 [details]
Profiler snapshot for getMIMEType(File).
Comment 11 Jiri Skrivanek 2008-08-07 11:26:17 UTC
Created attachment 66779 [details]
Patch implementing getMIMEType(File).
Comment 12 Jesse Glick 2008-08-07 19:29:13 UTC
I am backing out #e1ae60dcb027 as it caused test failures. When changing basic infrastructure like this it is wise to
run all applicable unit tests locally.
Comment 13 Quality Engineering 2008-08-08 04:34:39 UTC
Integrated into 'main-golden', available in build *200808080201* on http://bits.netbeans.org/dev/nightly/
Changeset: http://hg.netbeans.org/main/rev/e1ae60dcb027
User: Jiri Skrivanek <jskrivanek@netbeans.org>
Log: #137734 - Various performance improvemts for MIME type resolution. Namely cache extension in MIMESupport.CachedFileObject, do not call unnecessarily lastModified and override getPath. In BaseFileObj replace File.getParentFile by faster getParent and cache FileObjectFactory.
Comment 14 Jiri Skrivanek 2008-08-08 07:38:02 UTC
Of course I run tests locally and they passed on my WindowsXP. I will try to investigate what is wrong on unix.
Comment 15 Jesse Glick 2008-08-08 17:46:09 UTC
I see. I can reproduce the test failures (in masterfs) on Linux.
Comment 16 Jaroslav Tulach 2008-08-12 14:41:11 UTC
"some resolvers read content of files" - you may skip such resolvers if you enhance the API a bit. First of all you 
would need to give Jan a method like:

public String getMimeType(FileObject fo, Set<String> onlyFromThese);

which would try to guess the mime type, and if it is not from onlyFromThese, it would return null. This might skip 
some of the resolvers, especially if you add new resolver constructor:

protected MIMEResolver(String[] mimeTypesICanRecognize)

then you do not need to talk to any sniffing resolver in case Jan is only interested in text/x-java. My 2 Kč.
Comment 17 Jiri Skrivanek 2008-08-20 10:18:43 UTC
Thank you Jesse for backing out. I found a problem which caused tests to fail and pushed my performance improvements
again (http://hg.netbeans.org/core-main/rev/db43e97030a0).
I agree with Jarda than something like 

    FileUtil.getMimeType(FileObject fo, Set<String> onlyFromThese);

is necessary to improve MIME resolving perfomance even more. Maybe better there should be 

    MIMEResolver.providesMIMETypes()

which could work also for declarative resolvers. I will look at it.
Comment 18 Quality Engineering 2008-08-21 06:07:21 UTC
Integrated into 'main-golden', available in build *200808210201* on http://bits.netbeans.org/dev/nightly/
Changeset: http://hg.netbeans.org/main/rev/db43e97030a0
User: Jiri Skrivanek <jskrivanek@netbeans.org>
Log: #137734 - Various performance improvemts for MIME type resolution. Namely cache
extension in MIMESupport.CachedFileObject, do not call unnecessarily lastModified and
override getPath. In BaseFileObj replace File.getParentFile by faster getParent.
Comment 19 Jiri Skrivanek 2008-08-26 13:53:56 UTC
As discussed in this issue I suggest the following API change. An extra parameter is added to method
FileUtil.getMIMEType(FileObject, String...). Parameter contains one or more MIME types which we are only interested in.
It means only MIMEResolver instances which supply at lease one of these MIME types are queried. That's why MIMEResolver
 abstract class was extended by method getMIMETypes(). It has to return a String array of MIME types which can be
resolved by this resolver.
Comment 20 Jiri Skrivanek 2008-08-26 13:55:30 UTC
Created attachment 68337 [details]
Patch implementing FileUtil.getMIMEType(FO, String...).
Comment 21 Jesse Glick 2008-08-27 19:28:15 UTC
[JG01] You cannot add an abstract method to MIMEResolver. That would be grossly incompatible. It must have a concrete
implementation, e.g. returning null to signify "unknown/any".


[JG02] Do not use equalsIgnoreCase unless there is a compelling reason to believe that case sensitivity would in fact be
a bug.


[JG03] The unit test passes withinMIMETypes == null yet this usage is not permitted by the Javadoc and should throw an
exception.


Side notes:


[JG04] The implementation in MIMEResolverImpl.java need not copy Javadoc. Javadoc is inherited and subclasses should not
duplicate the Javadoc of the parent.


[JG05] Do not make long, detailed entries in apichanges.xml. Javadoc is the place to precisely specify behaviors and
contracts. An apichanges entry should be a brief mention of what is different and why someone might care - focus on what
existing module writers might need to do to fix their code to be compatible or to take advantage of a new opportunity.


[JG06] Replace

 * It has to return not empty String array.
 * @return an array of MIME types which can be resolved by this resolver.

which is redundant in several ways with

 * @return a non-empty array of MIME types


[JG07] The Javadoc for the withinMIMETypes parameter is not very explicit. "which only should be considered" does not
really tell me anything. You probably meant to say something such as:

"Only resolvers whose {@link MIMEResolver#getMIMETypes} contain one or more of the requested MIME types will be asked if
they recognize the file. It is possible for the resulting MIME type to not be a member of this list."
Comment 22 Jaroslav Tulach 2008-08-28 08:05:12 UTC
Y01 Updated modules over the NetBeans code base shall require new version of openide.filesystems to use its new API
Y02 String[] getMIMETypes() shall not be public. It is not supposed to be called by clients of the API
Y03 To [JG01] better to return null from getMIMETypes() by default and treat is as "I do not know yet" value, imho.
Y04 To add more to Y02 but with lower priority: I believe it would be better to hide getMIMETypes() completely and 
instead have MimeResolver(String... supportedMIMETypes) constructor. That will ensure immutability of the result 
(which is what all current resolvers do anyway), simplify the amount of code people need to write - e.g. just call 
super(...). The only exception is .../declmime/MIMEResolverImpl.java but that class is in the same module and the code 
in canResolveMIMETypes can handle it in special way. Moreover it will allow us to speed things up if necessary and 
replace the "for (int i = 0; i < desiredMIMETypes.length; i++) { for (int j = 0; j < resolvableMIMETypes.length; j++) 
{...}" with something more effective, if needed.
Comment 23 Jiri Skrivanek 2008-08-28 13:54:08 UTC
Re: [JG01] Y03 I did it intentionally to force subclasses of MIMEResolver to implement it. But assertion in
MIMESupport.canResolveMIMETypes if getMIMETypes returns null is probably enough. In such a case it doesn't make sense to
return any not null value.

Re: Y04 Idea with constructor is not clear to me. Could you write an example? I think we can think about protected
String[](or Set<String>) MIMEResolver.resolvableMIMETypes field instead of MIMEResolver.getMIMETypes() method. This
field can be then initialized in constructor. I don't know whether field or method is better.

Other comments fixed in new patch. Thank you.
Comment 24 Jiri Skrivanek 2008-08-28 13:55:05 UTC
Created attachment 68531 [details]
Patch implementing FileUtil.getMIMEType(FO, String...) - version2.
Comment 25 Jesse Glick 2008-08-28 13:58:38 UTC
Y03 - yes, that sounds the same as JG01.

Adding an assertion in case a MIMEResolver does not specify an explicit MIME type list is just as incompatible as adding
an abstract method, and it is unacceptable in such a foundational class. You need to deal with old resolvers somehow,
probably by always checking them. A one-line WARNING printed to log (once per session x resolver impl) is acceptable.

I like Y04. JG07 would still apply - must be specified exactly what this means, e.g. whether the resolver is ever
permitted to return a type which is not in the list.

Be careful with varargs - since there is already a zero-arg constructor, having a MIMEResolver(String...) introduces a
potential ambiguity: is super() a call to MIMEResolver() or to MIMEResolver(new String[0])? A slightly ugly but
effective trick is to declare the constructor MIMEResolver(String mimeType1, String... otherMimeTypes) which also
enforces that the list is not empty.
Comment 26 Jiri Skrivanek 2008-08-28 16:44:16 UTC
Why should I deal with old resolvers? They should be fixed immediately to not slow dow IDE. My change in FileUtil is
anyway binary incompatible and modules using FileUtil.getMIMEType(fo) has to be recompiled.

JG07 I thought your sentence "It is possible for the resulting MIME type to not be a member of this list." clearly says
that the resolver IS permitted to return a type which is not in the list.

Y04 - I tried to implement in patch 3. Also I tried super() and if I explicitely add default MIMEResolver constructor,
super() calls it.
Comment 27 Jiri Skrivanek 2008-08-28 16:44:49 UTC
Created attachment 68551 [details]
Patch implementing FileUtil.getMIMEType(FO, String...) - version 3.
Comment 28 Jesse Glick 2008-08-28 17:10:45 UTC
JG01 - the API must continue to deal with any old resolvers floating around on the internet. An incompatible change here
is not acceptable. You cannot fix all of them, though you can and should fix all of them you find in nb.org modules.
(Don't forget contrib - though it can only be updated after the API change is in main.) Nor may you replace
FileUtil.getMIMEType(FO) with gMT(FO, String...) as this is binary-incompatible. (Even modules that were recompiled
would be broken: passing String[0] would logically mean that the return value must always be null!) You may only add a
new overload of the method.

JG07 - I am just asking for the exact specification to be documented. In the most recent patch, in the new MIMEResolver
constructor, it is not documented. The Javadoc says only "MIME types which can be resolved by this resolver". It does
not clearly specify whether findMIMEType is permitted to return some other MIME type, and if it is not, what (if
anything) will happen if it tries. The getMIMEType Javadoc has been updated satisfactorily.

Y04 is not yet reflected in apichanges.xml


JG08 - {@link #getMIMEType(FileObject, String[]) getMIMEType(FileObject)} makes no sense. Simplify to {@link
#getMIMEType(FileObject, String[]).


JG09 - use Parameters.notNull in getMIMEType.


JG10 - the MIMEResolver(String...) constructor does not check for a null or empty array, nor for nulls in the array.
Comment 29 Jiri Skrivanek 2008-08-29 14:50:13 UTC
I updated patch according to your comments. Also I changed assert in canResolveMIMETypes to warning as you suggested sooner.
Comment 30 Jiri Skrivanek 2008-08-29 14:50:55 UTC
Created attachment 68634 [details]
Patch implementing FileUtil.getMIMEType(FO, String...) - version 4.
Comment 31 Jesse Glick 2008-08-29 15:36:43 UTC
[JG11] apichanges.xml claims the change is incompatible, which I hope it no longer is.


[JG12] You could make the no-arg variant of the MIMEResolver constructor @Deprecated to help alert module developers to
the desirability of specifying MIME types.


[JG13] The comment

//the MIMEResolver(String...) constructor does not check for a null or
//empty array, nor for nulls in the array.

is strange since the code directly above it does exactly that.


[JG14] The warning should be printed in the MIMEResolver constructor, not in canResolveMIMETypes.
Comment 32 Jiri Skrivanek 2008-08-29 18:06:25 UTC
Patch is updated. 

Re: [JG14] The warning cannot be printed in the MIMEResolver() constructor because declarative resolvers are created
this way.
 
BTW, in contrib the following will be updated:

contrib/fortress.editing/src/org/netbeans/modules/fortress/editing/FortressMimeResolver.java
contrib/scala.editing/src/org/netbeans/modules/scala/editing/ScalaMimeResolver.java
Comment 33 Jiri Skrivanek 2008-08-29 18:07:22 UTC
Created attachment 68655 [details]
Patch implementing FileUtil.getMIMEType(FO, String...) - version 5.
Comment 34 Jesse Glick 2008-08-29 18:43:41 UTC
JG14 - declarative resolvers could use a different, private constructor which does not warn. E.g.

@Deprecated public MIMEResolver() {
 this(false);
 warn(....this.getClass().getName()...);
}
MIMEResolver(boolean ignored) {}
public MIMEResolver(String... mimeTypes) {
 this(false);
 ...
}

Anyway the essential point is that the warning should be printed at most once per MIMEResolver implementation class per
session, not once per resolver per call to getMIMEType, which would quickly fill up the log file with spam.
Comment 35 Jiri Skrivanek 2008-08-29 22:26:50 UTC
Re: JG14 - There is a condition which fulfils the requirement that warning is printed just once per MIMEResolver
instance which is in fact once per session. I think MIMEResolver class should not become overcomplicated.
Comment 36 Jiri Skrivanek 2008-09-02 08:31:06 UTC
I will integrate suggested patch.
Comment 37 Jiri Skrivanek 2008-09-03 08:18:37 UTC
Fixed.
http://hg.netbeans.org/core-main/rev/81c36fefab56
Comment 38 Quality Engineering 2008-09-04 06:20:21 UTC
Integrated into 'main-golden', will be available in build *200809040201* on http://bits.netbeans.org/dev/nightly/ (upload may still be in progress)
Changeset: http://hg.netbeans.org/main/rev/81c36fefab56
User: Jiri Skrivanek <jskrivanek@netbeans.org>
Log: #137734 - To speed up MIME type recognition it is added method FileUtil.getMIMEType(FileObject, String...). We can supply one or more MIME types which we are only interested in. Module writers have to override MIMEResolver default constructor and provide supported MIME types.
Comment 39 Quality Engineering 2008-09-05 06:22:23 UTC
Integrated into 'main-golden', will be available in build *200809050201* on http://bits.netbeans.org/dev/nightly/ (upload may still be in progress)
Changeset: http://hg.netbeans.org/main/rev/1ca8bd7dc952
User: Jesse Glick <jglick@netbeans.org>
Log: Added note to <compatibility> for #137734.


By use of this website, you agree to the NetBeans Policies and Terms of Use. © 2012, Oracle Corporation and/or its affiliates. Sponsored by Oracle logo