166340 – Files crawling is slow; mostly due to mime-type resolution

This Bugzilla instance is a read-only archive of historic NetBeans bug reports. To report a bug in NetBeans please follow the project's instructions for reporting issues.

Bug 166340 - Files crawling is slow; mostly due to mime-type resolution

Summary: Files crawling is slow; mostly due to mime-type resolution

Status:	RESOLVED FIXED

Alias:	None

Product:	editor
Classification:	Unclassified
Component:	Parsing & Indexing (show other bugs)
Version:	6.x
Hardware:	Sun All

Importance:	P2 blocker (vote)
Assignee:	Vitezslav Stejskal

URL:
Keywords:	PERFORMANCE

Duplicates (1):	170204 (view as bug list)
Depends on:
Blocks:

Reported:	2009-06-01 13:43 UTC by Vladimir Voskresensky
Modified:	2009-08-12 12:16 UTC (History)
CC List:	4 users (show)

See Also:
Issue Type:	DEFECT
Exception Reporter:

Attachments
no snapshot (1.07 KB, patch) 2009-07-13 16:27 UTC, Vladimir Voskresensky	Details \| Diff
no extra store (902 bytes, text/plain) 2009-07-13 16:27 UTC, Vladimir Voskresensky	Details
jlahoda's patch for C++ ClassPathProvider implementation (5.01 KB, text/plain) 2009-07-15 16:25 UTC, Vitezslav Stejskal	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Vladimir Voskresensky 2009-06-01 13:43:42 UTC

CND uses ACE+TAO project as benchmark to improve performance && memory consumption and Scanning Projects task slows down
performance of the parse.

Here you are some numbers:
I)cold open of project gives the following numbers:
-- CND parser stopwatch:    794082 ms
-- Indexer stops: INFO [org.netbeans.modules.parsing.impl.indexing.RepositoryUpdater]: Complete indexing of 2 source
roots took: 1852277 ms

II)close ide/remove cnd cache, leave indexer cache, start ide:
-- CND parser stopwatch:    717407 ms
-- Indexer stops: INFO [org.netbeans.modules.parsing.impl.indexing.RepositoryUpdater]: Complete indexing of 3 source
roots took: 291207 ms

Comment 1 Vladimir Voskresensky 2009-06-01 13:57:00 UTC

I see several issues here:
- CND is able to parse ACE+TAO faster than indexer enumerate files... not good, I would say.
- Indexer touches files and consume I/O resources very intensively affecting our lexing phase which also needs I/O
operations


Some more details about used project:
- project dir has about 65 000 files, it's normal for C/C++ project to mix object files and source files in the same
folder during build
- only 14 500 of them are source files 
- only crawler task is doing it's work and most of the time in the detection of mime-type

What I can propose:
- provide IndexerVisibilityQueries for source roots and C++ source roots provider will provide own filter which files to
skip, there is no sense to detect mime types of all binaries in the directory like object files, intermediate files and
so on.
-- this is different from VisibilityQuery, because such files still needs to be shown in Files View
- allow to postpone crawler task, until CND finishes it's own parse

and of course, please, speed up :-)
1852277 ms is unbelievable, 
we use only 10 minutes to read all full dwarf info from all that files and construct CND project 
(3x times faster than read of few bytes in the beginning of file...)

Comment 2 Vladimir Voskresensky 2009-06-01 13:58:26 UTC

(this is CND requirement for 6.8 planning)

Comment 3 Vladimir Voskresensky 2009-07-13 16:25:46 UTC

I have profiled scanning of ACE+TAO and I see two problems:
- probably unnecessary flush of lucene index during each 2000's addDocument
- unnecessary creation of snapshots when no corresponding parser for mime-type

Please, review and apply proposed patches. Scanning have changed from 97 sec to 38 sec after applying this fixes.

Comment 4 Vladimir Voskresensky 2009-07-13 16:27:01 UTC

Created attachment 84662 [details]
no snapshot

Comment 5 Vladimir Voskresensky 2009-07-13 16:27:44 UTC

Created attachment 84663 [details]
no extra store

Comment 6 Vitezslav Stejskal 2009-07-15 16:20:22 UTC

From jlahoda by private email:

Hi,
    I am sending a quick and dirty (esp. the ClassPathProvider part) patch to test the excluding idea on the indexing I
talked today on the meeting. You will need to adjust the "includes" method in "PathResourceImpl" to suite your needs.
Could you please test the patch to see how big is the improvement, if any? The excluding seems to work (I did no see a
.o file in Go to File when I used the patch).

Jan

Comment 7 Vitezslav Stejskal 2009-07-15 16:25:30 UTC

Created attachment 84793 [details]
jlahoda's patch for C++ ClassPathProvider implementation

Comment 8 Vladimir Voskresensky 2009-07-15 17:00:50 UTC

Hi, Vita.
I have applied this already. But it's not enough :-)
http://hg.netbeans.org/cnd-main?cmd=changeset;node=f49234c9860a

Comment 9 Vitezslav Stejskal 2009-07-17 09:46:56 UTC

I pushed changes that I believe fix this problem. The main change is the new crawling algorithm that does not resolve
mime types until they are really needed. And even then it's done in the way that should minimize the number of disk
reads from the files (ie. using FileUtil.getMIMEType(FileObject f, String... mimeTypes) and preferring indexers for
mimetypes that are recognized a file extension, etc). I tested this change with ACE+TAO and the up-to-date check for
~50k files in the project takes around 20 seconds, which I think is acceptable. The cold start is still slow; the
indexing and C++ parsing runs in parallel and C++ parsing is still faster then the indexing. I'll try to investigate
what exactly is done there. If this is important for C++ folks please file a separate issue. The second start is much
faster with the indexing doing only its up-to-date check (~20 sec).

The other two changes are Vladimir's patches attached here earlier. I did not do any extensive measurements and so can't
say whether they improved the situation or not. Let's see what the performance guys have to say to that.

http://hg.netbeans.org/jet-main/rev/f22eac907102 - new crawler
http://hg.netbeans.org/jet-main/rev/9353b095ddda - no Snapshot patch
http://hg.netbeans.org/jet-main/rev/394302be6599 - no MAX_DOCS based lucene documents flush

Comment 10 Quality Engineering 2009-07-20 09:48:25 UTC

Integrated into 'main-golden', will be available in build *200907200201* on http://bits.netbeans.org/dev/nightly/ (upload may still be in progress)
Changeset: http://hg.netbeans.org/main-golden/rev/f22eac907102
User: Vita Stejskal <vstejskal@netbeans.org>
Log: #166340: more efficient files crawling

Comment 11 Vladimir Voskresensky 2009-07-20 14:40:39 UTC

initial scanning is very slow again. I have filed
http://www.netbeans.org/issues/show_bug.cgi?id=168817

Comment 12 Vitezslav Stejskal 2009-07-27 16:25:42 UTC

http://hg.netbeans.org/jet-main/rev/f2780b119b6b

Comment 13 Vitezslav Stejskal 2009-07-28 14:48:27 UTC

I had to backout f2780b119b6b. http://hg.netbeans.org/jet-main/rev/64a6bf12c424

Comment 14 Vladimir Voskresensky 2009-07-28 14:59:10 UTC

btw, I didn't realize what it was about? Did it have a good impact? How good it was?

Comment 15 Vitezslav Stejskal 2009-07-29 09:09:53 UTC

Another attempt on filtering roots - http://hg.netbeans.org/jet-main/rev/332df3a78ea8

Comment 16 Vitezslav Stejskal 2009-07-29 09:39:22 UTC

Vladimir, due to fixing files crawling custom indexers (eg. JavaCustomIndexer) are now asked to index even roots that
never contain files that they are interested in (eg. text/x-java). They usually ignore the roots, but for example
JavaCustomIndexer prints warnings to the log file. These recent fixes attempt to improve the situation. Unfortunately
the first attempt was rather bad, my apologies. I stash the changesets here, because they are related to the original fix.

Comment 17 Quality Engineering 2009-07-29 17:42:53 UTC

Integrated into 'main-golden', will be available in build *200907291401* on http://bits.netbeans.org/dev/nightly/ (upload may still be in progress)
Changeset: http://hg.netbeans.org/main-golden/rev/332df3a78ea8
User: Vita Stejskal <vstejskal@netbeans.org>
Log: another attempt: #166340 (follow up): do not scan roots by CustomIndexers, which are registered for mime types different than the root's mimetypes

Comment 18 Vitezslav Stejskal 2009-08-12 12:16:50 UTC

*** Issue 170204 has been marked as a duplicate of this issue. ***