239310 – FileUtil::normalizeFileOnWindows() is slow due to IO especially on network drives

This Bugzilla instance is a read-only archive of historic NetBeans bug reports. To report a bug in NetBeans please follow the project's instructions for reporting issues.

Bug 239310 - FileUtil::normalizeFileOnWindows() is slow due to IO especially on network drives

Summary: FileUtil::normalizeFileOnWindows() is slow due to IO especially on network dr...

Status:	NEW

Alias:	None

Product:	platform
Classification:	Unclassified
Component:	Filesystems (show other bugs)
Version:	8.0
Hardware:	PC Windows 7

Importance:	P3 normal (vote)
Assignee:	Jaroslav Havlin

URL:
Keywords:

Depends on:
Blocks:

Reported:	2013-12-10 12:28 UTC by victork
Modified:	2013-12-12 11:06 UTC (History)
CC List:	0 users

See Also:
Issue Type:	ENHANCEMENT
Exception Reporter:

Attachments
normalizeFileOnWindows update to not use IO to the filesystem (22.54 KB, application/octet-stream) 2013-12-10 12:28 UTC, victork	Details
normalizeFileOnWindows update to not use IO to the filesystem. (23.10 KB, patch) 2013-12-10 23:23 UTC, victork	Details \| Diff
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description victork 2013-12-10 12:28:43 UTC

Created attachment 143009 [details]
normalizeFileOnWindows update to not use IO to the filesystem

In the quest of IDE optimization for own purposes(Started at bug 239305) now i came to the second big bottleneck - the paths normalization on Windows(Code for Unix is fast and not using FS access(IO)). Those two bugs can lots of time come together within low level FileSystemFactoryCalls which "summarizes" the latency each one of those adds.

Current implementation relies on Java's getCanonicalPath/File() methods which use lower layer native int
wcanonicalize(WCHAR *orig_path, WCHAR *result, int size) C routine:

First part of is does exactly that does my special pure Java implementation here in the patch while the second handles the true "casing" discovering via FS access:

/* At this point we have copied either a drive specifier ("z:") or a UNC
       prefix ("\\\\host\\share") to the result buffer, and src points to the
       first byte of the remainder of the path.  We now scan through the rest
       of the path, looking up each prefix in order to find the true name of
       the last element of each prefix, thereby computing the full true name of
       the original path. */
    while (*src) {
        WCHAR *p = wnextsep(src + 1);    /* Find next separator */
        WCHAR c = *p;
        WCHAR *pathbuf;
        int pathlen;

        assert(*src == L'\\');        /* Invariant */
        *p = L'\0';            /* Temporarily clear separator */

        if ((pathlen = (int)wcslen(path)) > MAX_PATH - 1) {
            pathbuf = getPrefixed(path, pathlen);
            h = FindFirstFileW(pathbuf, &fd);    /* Look up prefix */
            free(pathbuf);
        } else
            h = FindFirstFileW(path, &fd);    /* Look up prefix */

Here we see slow FindFirstFile() function  is called - in a loop.


Reimplementation proposed in the supplied patch add new method getNormalizedPathOnWindows() (Currently public so can be used in unit testing if needed) which implements the core functionality of normalization(. and .. handling,\\ handling,drive letter uppercasing, /drive/ support - via getAbsolutePath) and it's usage increases performance of the IDE in the order of magnitude(Code completion now doesn't get above 1 sec even for files it never worked at all for(just stacked on thinking due to IO)).

Logic of C based wcanonicalize not implemented yet in Java based getNormalizedPathOnWindows():
Checking for wildcards(* and ?) and returning error if present - Can easily be added.
UNC support(Not needed as NetBeans API explicitly skipes normalization for UNC paths).
Retrieving the real case of names for existing path(IO) - Will not be supported.

The interesting part of those 3 now is the last one - Casing.
Platform has two tests in FileUtilTest.java:

testLowerAndCapitalNormalization() - Bug 199471(J. Tulach)
testNormalizePathChangeCase() - Bug 211038(J. Tulach)

which off course fail now because getNormalizedPathOnWindows() preserves cases of the original unnormalized "input" path.


Checks performed manually:
Bug 211038 - Checked using reproduce steps by the original reporter(Created file and renamed it from the IDE into same name with different case) - Succeeded on both NTFS drive(Case insensitive) and SMB mount ext3(Case sensitive - may loose this capability over SMB) drives(local/network).

Bug 199471 - surely will get back and even have unexpected behavior(original problem was about the caching of normalizations):
FileUtil.normalizeFile(A).getAbsolutePath()
will return the "C:\\A" with non-IO implementation.

Normalization should be done on "entry" points of the files - UI operations(Create,Rename,Drag&Drop,Copy&Paste,Open) - not on any access just like you sanitaze/filter the data before storing it in the DB and not store it dirty and then sanity millions of times on each read, right?

Reporter of bug 199471 expects "core" to do some stuff the caller should do (a.toLowerCase().equals(b.toLowerCase())) in his code.
Also reporter didn't specify that he is writing(Should be some module or NB Platform based project) - For second case access to slow implementation ca be given - It will be responsibility of plugin developer - Slow plugin - no users and popularity - instead of affecting the whole IDE and severly hurting it's reputation(having reputation of bloated and slow and acting like doesn't help any software).
There should be full distinguish between stuff which is executed in a few cases and stuff which should be always executing. First case should be performed by the caller, second by the calee and not vice-versa.

While the message can look like a little aggressive it's actually not so - Only full description of that happening and discovered is given.

I know chances such a change will pass to upstream are low(I'm sure about it) however having the bug and the patch here will allow users experiencing heavy performance issues to play with it and improve their overall experience of the IDE instead of looking for alternatives(Especially proprietary ones).

I really recommend some of the Windows users here to give this patch a test run
and share us with their experience and feeling especially those storing projects on a network drives(i"m interested hearing from those storing on local NTFS partitions as well). Do not forget to take patch from bug 239305 as well and enable special parameter via -J-D in a command line(Specified in a thread) as those slow calls come together lots of time(Within same stacktrace of 10 layers).

Then i have some spare time i'll add missing wildcard checking into the routine(needed) and possibly look into adding the UNC check which present in original one(However currently seems unneeded and just overhead).

Comment 1 victork 2013-12-10 23:23:20 UTC

Created attachment 143037 [details]
normalizeFileOnWindows update to not use IO to the filesystem.

Changes since previous patch:
Added wildcards checking in the path(From Java's internal C routine)
Added invalid dots checking in the path elements(From Java's internal C routine)
Replaced C-style "for" with short iterating "for" and update it's body accordingly.
Updated to use static methods of Character class without creating new object.
getNormalizedPathOnWindows() now throws exception on failure of the above checks which is catched by normalizeFileOnWindows().

Comment 2 Jaroslav Havlin 2013-12-11 14:33:24 UTC

It would be great if we could prevent any I/O in normalizeFile, but unfortunately we need it.

On Windows, files can still have 8.3 names, e.g. C:\PROGRA~1, which is just another name for C:\Program Files.
The 8.3 name is stored in the filesystem, as well as the long name. So, if we get the short (8.3) name, we need to call File.getCanonicalPath() to retrieve the long name.

If we get two paths for the same file, we should know (for several reasons, e.g. caching, filesystem operations) that we are working with a single file. That's why we should always work with canonical (unique) paths. If two files have different canonical name, they are surely different (not speaking about symbolic links now).

That's also why we need to know the correct case of file names. "C:\file.txt", "c:\FILE.TXT" and "c:\File.txt" are different paths for the same file. But what of these paths should be used by NetBeans as the unique path for this file? We could theoretically use the lower-case or the upper-case variant, but it's much better to use the name that was specified by the user (or operating system). This name is stored in the filesystem, so we need to call File.getCanonicalPath(), which performs some I/O to read the correct name.

Having some heuristics for avoiding disk access would be nice, but it is quite complicated. Mainly because of the fact that case of file names is quite important on Windows, although the file system is case insensitive.

Thank you very much for your great effort to fix this issue, but I cannot integrate this patch. It can work fine if you work with canonical paths only, but it cannot be guaranteed, so there is high risk of severe problems.

I'm changing the issue type to Enhancement.

Comment 3 victork 2013-12-11 19:54:26 UTC

I knew it can't get in(As i've wrote - i predicted it then i wrote it :)) - can be still useful for others.

Why we get canonical every time instead of converting to canonical then we store data about file(Open,Drag&Drop,etc). If user renamed a file to same with other case outside of the IDE then invalidate caches for old name and reindex.

"We could theoretically use the lower-case or the upper-case variant, but it's much better to use the name that was specified by the user (or operating system)"
Operating system shall provide canonical name already, then user provides good name then it's ok as well - problems arise then user provides bad name -> And here then user provides a name(Open dialog,etc) it should be canonicalized so core will always work with already canonicalized paths and will not need to recanonicalize stored invalid custom path with dots,diff case,etc over and over again or you say the system became so comlicated so there is no way to find all external "path input" spots in the code or the fear is about problems with indexes of projects relaying on bad absolute addresses?

Or am I missing something?

Solution should exist - it's just all about architecting/planning it cooperatively with devs managing simultaneous parts of the code(Possibly can take a lot of time up to coming to conclusion that writing new IDE from scratch is better).


Btw. i've heard Microsoft plans to remove legacy 8.3 names support in upcoming releases of Windows ;) and there is special registry key which can enable case sensitivity on Windows FS and is quite dangerous and ofc by default off(exists for years) :)

Comment 4 Jaroslav Havlin 2013-12-12 08:23:33 UTC

(In reply to victork from comment #3)
> I knew it can't get in(As i've wrote - i predicted it then i wrote it :)) -
> can be still useful for others.
I agree. Thus, the bug is still open for experimenting and discussion.

> Why we get canonical every time instead of converting to canonical then we
> store data about file (Open,Drag&Drop,etc).
It should not be used every time, but whenever a path is passed to the IDE from outside - a settings file, file chooser (it's not guaranteed that file choosers return canonical names), drag&drop, command-line argument, etc.

(We store data about the file even if we see the file, because we need e.g. detect MIME-Type, and some info is cached.)

If you find a place from which normalizeFile() is called for a file that is surely canonical, please let me know.

> If user renamed a file to same with other case outside of the IDE then 
> invalidate caches for old name and reindex.
It works this way currently.

> Operating system shall provide canonical name already, then user provides
> good name then it's ok as well - problems arise then user provides bad name
> -> And here then user provides a name (Open dialog,etc) it should be
> canonicalized so core will always work with already canonicalized paths and
> will not need to recanonicalize stored invalid custom path with dots,diff
> case,etc over and over again
I absolutely agree, but it should work this way already.

> Or am I missing something?
Surely not. See javadoc for FileUtil.html#normalizeFile(java.io.File), it contains the same recommendations as you are saying.

> Solution should exist - it's just all about architecting/planning it
> cooperatively with devs managing simultaneous parts of the code(Possibly can
> take a lot of time up to coming to conclusion that writing new IDE from
> scratch is better).
The IDE is already designed this way, the normalizedFile() should be called only if a path is passed from outside of the IDE.
Of course, there can be bugs that violates the design, which should be detected and fixed.(The IDE also try to make the normalization faster by using a cache for normalized paths.)

> Btw. i've heard Microsoft plans to remove legacy 8.3 names support in
> upcoming releases of Windows ;) and there is special registry key which can
> enable case sensitivity on Windows FS and is quite dangerous and ofc by
> default off(exists for years) :)
Thank you for the info! :-)

Comment 5 victork 2013-12-12 11:06:38 UTC

Thanks for the descriptive answer :)

""""
> Why we get canonical every time instead of converting to canonical then we
> store data about file (Open,Drag&Drop,etc).
It should not be used every time, but whenever a path is passed to the IDE from outside - a settings file, file chooser (it's not guaranteed that file choosers return canonical names), drag&drop, command-line argument, etc.

(We store data about the file even if we see the file, because we need e.g. detect MIME-Type, and some info is cached.)

If you find a place from which normalizeFile() is called for a file that is surely canonical, please let me know.
""""

In this specific case i'm not sure if its canonical. Need to open whole Netbeans tree and precisely check all flows/xrefs to be sure, and even if its ok no guarantee can be made that someone will not add "unsanitized" path in the future(Perfectly should be denied by code review process but not every patch gets enough reviews :().
There may be uses of normalize that came into some projects via good old "Copy/Paste" of existing code chunk handling the files then normalization was not actually needed there, because the path is already in its normal state.