This Bugzilla instance is a read-only archive of historic NetBeans bug reports. To report a bug in NetBeans please follow the project's instructions for reporting issues.

Bug 269229 - Background scan of not so large project runs for 40 minutes !!!
Summary: Background scan of not so large project runs for 40 minutes !!!
Status: NEW
Alias: None
Product: java
Classification: Unclassified
Component: Platform (show other bugs)
Version: Dev
Hardware: PC Windows 10 x64
: P2 normal with 1 vote (vote)
Assignee: Tomas Zezula
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-12-04 19:55 UTC by NukemBy
Modified: 2016-12-06 22:50 UTC (History)
1 user (show)

See Also:
Issue Type: DEFECT
Exception Reporter:


Attachments
RepositoryUpdater.worker-thread-dump (10.18 KB, text/plain)
2016-12-06 09:40 UTC, NukemBy
Details
chart with number of file operations (19.32 KB, image/png)
2016-12-06 22:47 UTC, NukemBy
Details
data for the chart (42.22 KB, application/vnd.ms-excel)
2016-12-06 22:50 UTC, NukemBy
Details

Note You need to log in before you can comment on or make changes to this bug.
Description NukemBy 2016-12-04 19:55:03 UTC
Recently I needed to dig into code of Gradle, therefore I've opened it in NetBeans and tried to look around. It was tough - NB was rather slow and unresponsive because background scan was running all the time. To measure exact time I did following:

1. downloaded latest dev build - 201612040001
2. installed Gradle plugin - I build it from sources from here https://github.com/kelemen/netbeans-gradle-project 
3. to measure what is happening via IDE log, i've used plugin you can download from here http://ww.planetnetbeans.org/ntopic62864.html 
4. to ensure a clean start - I've deleted 'cache' directory in workspace
5. open Gradle project which can be download from here - https://github.com/gradle/gradle

(note: recommend doing this on Windows computer - see the reasons below)

When scan was finished - the last record in log was at "+00:44:28.184" from start of NetBeans. Let's cut 2-3 minutes for initial startup and loading of the project - time for only a "Background scan" is roughly 40 minutes. This is faaaaar beyond my understanding of acceptable performance for project having just 10319 of Java/Groovy files (note: memory usage fits into 500MB out of allocated 1200MB).

...

The root cause for slowness appears to be the same as for this issue (https://netbeans.org/bugzilla/show_bug.cgi?id=250470 "Slow and Weird auto-suggest for Exception class names in New Breakpoint dialog"). The staff which is happening is like following:

1. source scanner tries to compile java/groovy sources via NetNeans compiler (outputs of that 'compiler' is cached object model of sources)
2. to resolve packages and classes, that compiler calls back into NetBeans' cache  ... 
3. ... and NetBeans just performes file-scan in 'workspace/cache/index' - there are queries like 'where is package com.company.libxxx', cache implementation goes through "100 cached libraries" and queries local file subtree for presence of needed file/folder.

Taking into account that number of such 'where' queries is counted in millions per large project for whole scan, this consequently generates millions of file query operations multiplied by number of cached libraries in dependencies (because each of them is scanned until 'first match' or 'no match at all'). File-query operations are rather slow, especially on Windows - this is the major bottlneck in that whole process.

To fix that performance issues - caching (via lucene) should be significantly refactored.

Another cause of slowness here, i guess (around 5-10%), is this issue - https://netbeans.org/bugzilla/show_bug.cgi?id=268310 (Excessive InapplicableMethodException causes slowness in bg-scan during initial load).
Comment 1 Tomas Zezula 2016-12-04 19:57:09 UTC
Gradle is 3rd party plugin.
Comment 2 NukemBy 2016-12-04 20:35:02 UTC
Yes, Gradle plugin is a 3rd party plugin, but ... please take a better look onto my description ... plugin's participation in that lengthy process ends on 1st-2nd minute. After plugin lists source roots into NetBeans - process goes within NetBeans-native areas. I was digging rather deep through NB sources in that area and pretty much sure in correctness of my 'diagnose'.

Absolutelly same slowness happens when I open NetBeans sources - 20 projects are scanned for 8+ minutes. None of 3rd party plugins participate here. I just fear to open whole set of NetBeans source modules simultaneously.
Comment 3 NukemBy 2016-12-06 09:40:42 UTC
Created attachment 163163 [details]
RepositoryUpdater.worker-thread-dump
Comment 4 NukemBy 2016-12-06 09:55:02 UTC
For more insights - I've attached a "RepositoryUpdater.worker" thread dump, which shows (according to my observations) the most CPU consuming part. Here are top few methods

"RepositoryUpdater.worker"
	...
	at org.openide.util.CachedFile.listFiles(CachedFile.java:605)
	at org.netbeans.modules.java.source.parsing.FolderArchive.getFiles(FolderArchive.java:145)
	at org.netbeans.modules.java.source.parsing.CachingFileManager.list(CachingFileManager.java:130)
	at org.netbeans.modules.java.source.parsing.ProxyFileManager.list(ProxyFileManager.java:153)
	at com.sun.tools.javac.code.ClassFinder.scanPlatformPath(ClassFinder.java:616)
	...
	at com.sun.tools.javac.model.JavacElements.getTypeElement(JavacElements.java:65)
	at org.netbeans.modules.groovy.editor.java.ElementSearch.getClass(ElementSearch.java:63)
	at org.netbeans.modules.groovy.editor.compiler.CompilationUnit$CompileUnit$1.run(CompilationUnit.java:142)
	...
	at org.netbeans.modules.parsing.api.ParserManager.parse(ParserManager.java:334)
	at org.netbeans.api.java.source.JavaSource.runUserActionTaskImpl(JavaSource.java:415)
	at org.netbeans.api.java.source.JavaSource.runUserActionTask(JavaSource.java:407)
	at org.netbeans.modules.groovy.editor.compiler.CompilationUnit$CompileUnit.getClass(CompilationUnit.java:156)
	at org.codehaus.groovy.control.ResolveVisitor.resolveFromCompileUnit(ResolveVisitor.java:477)
	...
	at org.codehaus.groovy.control.CompilationUnit.compile(CompilationUnit.java:542)
	at org.netbeans.modules.groovy.editor.api.parser.GroovyParser.parseBuffer(GroovyParser.java:477)
	at org.netbeans.modules.groovy.editor.api.parser.GroovyParser.parse(GroovyParser.java:157)
 
Notes:

- this call-stack is from my custom-built version of NB (as described here https://netbeans.org/bugzilla/show_bug.cgi?id=268744), therefore line number may not match any of public NB sources

- that particular project has rather many Groovy files and one of the lines in log says

   Indexing of: D:\Temp\git\gradle\subprojects\core\src\test\groovy took: 501,795 ms 
  
  This is ~8:20 minutes per 427 files, 1.2sec / file
  So, there may be some problem in Groovy part which increases slowness even more than for plain java files.
  (I believe Groovy is not a 3rd-party component in NetBeans, anyway - roots go to java.*.CachingFileManager)
Comment 5 Tomas Zezula 2016-12-06 09:59:11 UTC
In fact this is Windows problem.
The CachingFileManager just list packages as javac asks for (in the same was as command line javac). The Linux caches FS metadata and repeated readdir syscall is fast, the read of fs metadata on Windows is rather slow.
Comment 6 NukemBy 2016-12-06 15:36:09 UTC
Re: "In fact this is Windows problem."

I would not agree with this :) This is "the problem of NetBeans" running on Windows. 

Actually, relying on native file system as a fast-access-cache is bad by design. When someone introduces a component into software architecture - he/she assumes that component satisfies functional/performance requirements. In that case - native file system - is the dependency selected by developer and it obviously does not satisfy in terms of performance, but it is already hard-coded into implementation, though is out of developer's control (i.e. performance of that cached cannot be fixed or replaced). 

From the other hand - as I remember while digging through NB's code - everything within 'worskpace/cache/index' is managed solely by NetBeans and can be safely cached in memory and there will be just no need to query file system at all (in 95% of cases).

When talking about Windows vs *nux - I've conducted some perf-tests and yes, linux is faster, but only up to 10 times faster (depends on operations). In-memory cache-lookup will be 1000 faster. In numbers it will be something like 40min (windows) -> 4min (*nuix) -> 4sec (in-memory cache) ... i did not count actual parsing of files in that, just cache lookup.

One more thing - bg scan generated 20400 files and 14000 folders. 6900 folders are of 0 size - i.e. either empty or contain other empty folder. Creating and scanning of these folders is a pure waste of time.


So ... I think there is much to improve particularly in NetBeans, not in Windows :) (this definitely will not work).
Comment 7 Tomas Zezula 2016-12-06 16:04:37 UTC
In fact i've experimented in time of NB 7.2 with replacing the file based storage on windows by a  custom one (all files merged into a single one similar to tar). As far as I remember on the java project on windows the improvement was about 15% on Unix no improvement at all. On windows even the cache of package names which handles negative answers helps.
Keeping all the metadata in memory increases memory consumption we do it for archives but in smarter way CachingArchive, CachingPathArchive.

However first the high number of JavaFileManager::list calls from groovy should be evaluated.
Comment 8 NukemBy 2016-12-06 22:46:21 UTC
Probably this is the last thing I can add to this topic.

Since I have version of NB running mostly through the CachedFile, it was not great problem for me to add some counters there.
Whole process from startup of NetBeans to completion of bg scan has generated following (this is number of invocations):

 time	            
 File.exists()	           384 721
 isFile()	               321 563
 isDirectory()	         2 523 540
 readAttributes(total)	 9 383 990
 readAttributes(n/a)	 7 144 839
 listFiles(total)	    35 618 198
 listFiles(n/a)	        35 570 953
 listFiles(cache-miss)	    18 176

Comments:

- my configured Cache Size is 100k of File references - it includes both existing and non-existing files. It appears that number of unique file paths requested during scan is higher than 100k, therefore files are evicted rather frequently and cache just does not work (at least efficiently, 200k also were exhausted rather quickly - there can be plenty of variations of non-existing files). Self profiler says that 'readFileAttributes()' takes roughly 50% of time and parent listFiles() is around 65% time.

- where other CPU time may go (apart from parcing of sources)? ... at scale of 35 millions of invocation actually anything may cause a problem. For example that line ...

    http://hg.netbeans.org/main/file/tip/java.source.base/src/org/netbeans/modules/java/source/parsing/CachingArchiveProvider.java
    final URI rootURI = toURI(root);
    
    ... will cost around around 1 minute on 35mln because this conversion internally invokes massive string scan/substring/concatenate (JMH test should 2sec for 1 mln of invocations).
    I've already mentioned that case here: https://netbeans.org/bugzilla/show_bug.cgi?id=268292 
    
To summ up ...
--------------

- Compilation/introspection of sources invokes plenty of class/var/package lookups into index (this happens also while editing code in Editor and I reularly experience lag of few seconds in updates of intellisense, hints and high-lighting on huge project codebases with large number of dependencies). I think that high number of lookups is generally not avoidable (on enterprise-sized projects).

- current implementation cause plenty of further lookups into native OS file system, and 99.9% of them are lookup for non-existing file paths - and these 'false' lookups are the major reason of slowness here

- assuming average path within 'workspace/cache/index' is around 100 chars, dictionary for 100k of files will cost around (100 * 2byte/char + 100 bytes of overhead) * 100k = 30MB. This is rather acceptable to speed-up a process by 1000 times (I really expect that difference).

- taking into account huge number of invocations - various convertions of types on the way from 'client query' to 'actual lookup in the Map' should be minimal, better - none at all.

I think it would be possible to reduce number of folders in cache/index by flatting of java packages, for example by ...
    changing  cache\index\s10\java\15\classes\org\gradle\plugin\use\resolve\internal\CorePluginResolverTest.sig
    to        cache\index\s10\java\15\classes\org.gradle.plugin.use.resolve.internal\CorePluginResolverTest.sig
number of folders is reduced by 5 - this will require smaller number of cache entries.
Comment 9 NukemBy 2016-12-06 22:47:58 UTC
Created attachment 163172 [details]
chart with number of file operations
Comment 10 NukemBy 2016-12-06 22:50:12 UTC
Created attachment 163173 [details]
data for the chart