Bug 207232 - Allow Parsing API and VCSes to share information about modified files
Allow Parsing API and VCSes to share information about modified files
Status: STARTED
Product: platform
Classification: Unclassified
Component: Filesystems
7.2
Other Linux
: P1 (vote)
: 7.2
Assigned To: Tomas Zezula
issues@platform
: API_REVIEW_FAST, PLAN
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-01-12 17:05 UTC by Jaroslav Tulach
Modified: 2012-02-13 13:05 UTC (History)
7 users (show)

See Also:
Issue Type: TASK
:


Attachments
Introducing TreeStamp (24.72 KB, patch)
2012-01-12 17:32 UTC, Jaroslav Tulach
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jaroslav Tulach 2012-01-12 17:05:30 UTC
Especially on remote file systems, but possibly on local ones as well, we would benefit from doing more efficient up-to-date check than the one perform by parsing API currently.

Design an API to allow parsing API to query the status on disk and later (next day) check for changes that happened meanwhile.
Comment 1 Jaroslav Tulach 2012-01-12 17:32:25 UTC
Created attachment 114841 [details]
Introducing TreeStamp
Comment 2 Jaroslav Tulach 2012-01-12 17:32:59 UTC
Tomáši, please confirm whether this API is usable from Parsing API side.
Comment 3 Jesse Glick 2012-01-18 19:04:05 UTC
Interesting. I guess we need impls from versioning modules, and calls from parsing, to see how this will work. (Probably not really API_REVIEW_FAST before then.)


[JG01] The apichanges entry hints without saying outright that masterfs could be a provider on systems using ZFS (Mac OS X, Solaris). Is this planned?


[JG02] TreeStamp.series for Subversion would I guess record `svn info`?


[JG03] Last I remember, Eclipse has a generic API for tracking what changed in the workspace on disk between shutdown and subsequent startup, and features like the Java builder then query this API to determine what needs to be rechecked. This would roughly correspond to a default generic TreeStamp provider that simply collects a list of timestamps/sizes/checksums/etc. of all files under a root. The advantages over the current up-to-date check in parsing.impl would be that (1) parsing would have a single code path - via TreeStamp (*) - that would always be available, whether or not a checksum-based filesystem or VCS were in use; (2) any other code which needed to check for changes in a tree might be able to reuse the stamp, if it were somehow cached per root. Any plans to do something like this?


(*) Proposed FilteringPathResourceImplementation2 in bug #170231 seems related.
Comment 4 Tomas Zezula 2012-01-19 09:18:49 UTC
Thanks Jardo!
The API seems good to me from parsing.api point of view.

It solves the switch to branch problem as the parsing.api can use the series as cache identifier.
Parsing.api can also benefit from findModified() on VCS which marks modified files in transaction (ADE, ClearCase) on mercurial without inotify the IDE crawler is faster for checking up to date state. 

Unfortunately it does not help to solve the biggest problem of the up to date check - the slowness of project system. The problem is that the parsing.api knows only source roots not project metadata so it cannot check that project metadata did not changed and it cannot use the cached dependencies from last IDE run. :-( But this api can be used by project system to notify parsing.api what project metadata did not changed.

Regarding to JG03: +1 for moving the parsing.api FileObjectCrawler (or parts of it) into FS. It cares about sym link cycles, etc. It will be perfect, if someone else takes care about it.
Comment 5 Jesse Glick 2012-01-19 13:22:44 UTC
(In reply to comment #4)
> the parsing.api [...] cannot check that project
> metadata did not change and it cannot use the cached dependencies from last
> IDE run.

Project metadata needs to be loaded for all sorts of reasons other than serving queries to parsing.api, and classpath computation is typically not much extra work once the basics are loaded.

Whether calculating this metadata is a major component of the UTD check depends on which projects you are referring to, and the warmth of disk caches. In the case of opening one or two nb.org projects and the resulting several-minute scan, the main problems are preferred source rather than binary dependencies (not solvable without massive changes), and the policy of scanning all transitive dependencies even when they could have no effect on open sources (filed separately). It could be the largest component in the case of a project with extremely complex metadata and a tiny source root.

> this api can be used by project system to notify parsing.api
> what project metadata did not changed.

I cannot really imagine how that would work. In a Maven project for example, the classpaths could vary according to ~/.m2/settings.xml, or environment variables, etc.
Comment 6 Tomas Zezula 2012-01-19 13:40:57 UTC
>and the policy of scanning all transitive dependencies even when they could have no effect on open >sources
Scanning just direct dependencies is not good idea. As navigation to dependence will be very expensive and raw index time is much worse as explained in the issues. Also it will increase number of "unexpected scans" scans which happen during the work not after IDE start also explained in the issue.
But this is unrelated to this review.

Right it will not work for maven but it will solve the netbeans api support project type for which there are most of complains.
Comment 7 Jesse Glick 2012-01-19 17:48:04 UTC
(In reply to comment #6)
> it will solve the netbeans api support
> project type for which there are most of complaints.

How so? These also rely on various external files, e.g. ~/.nbbuild.properties. And again you would still be loading the project anyway for other reasons, so I am not sure calculating the classpath directly from loaded project metadata would be any more work than loading some possibly unreliable cache.

Anyway apisupport users are a small minority of IDE users. We should not be wasting effort optimizing for them.
Comment 8 Tomas Zezula 2012-01-19 20:50:37 UTC
>am not sure calculating the classpath directly from loaded project metadata
>would be any more work than loading some possibly unreliable cache.
Which is true for most of project type, not for api support project.
For up to date check it takes biggest amount of time as seen in http://wiki.netbeans.org/IndexingMeasurement71.

Anyway in the background scan branch it may not be a problem as nearly all features work during up to date check, so it's transparent. The only problem may be hiccups caused by the IO done by project system. I had not yet a time to measure it.

I am not promoter of project deps cache, I just got a requirement to implement it. :-)
I personally prefer slowness detector in project queries. If SFBQImpl takes more then 500ms report it and it should be fixed or removed.
Comment 9 Jesse Glick 2012-01-19 21:41:15 UTC
(In reply to comment #8)
> For up to date check it takes biggest amount of time as seen in
> IndexingMeasurement71.

Strange; this does not match my experience at all, e.g. for java.source with a lukewarm cache and a slightly outdated index

Resolving dependencies took: 2,364 ms
Complete indexing of 20 binary roots took: 458 ms
Complete indexing of 71 source roots took: 26350 ms (New or modified files: 198, Deleted files: 0) [Adding listeners took: 238 ms]

or for a warm cache and a fully up-to-date index

Resolving dependencies took: 321 ms
Complete indexing of 20 binary roots took: 21 ms
Complete indexing of 185 source roots took: 5043 ms (New or modified files: 0, Deleted files: 0) [Adding listeners took: 306 ms]

i.e. queries are below 10%. Probably depends a lot on disk speed, RAM, etc.

> I am not promoter of project deps cache, I just got a requirement to implement
> it.

And to maintain it with all the resulting bugs indefinitely? Enumerating every file or environmental factor that might possibly affect query results across IDE restarts seems pretty hard. In a functional language you could set up dataflow analysis but in Java it would be mostly trial and error.

I also doubt that the present API would be very useful for maintaining such a cache, since you are interested in particular individual files (nbproject/project.{xml,properties}, ~/.nbbuild.properties, etc.) scattered across the disk, rather than the set of files under one or two roots. Would be more straightforward to xor the paths of these files with their length ^ lastModified, adding in relevant environment variables and who knows what else, to determine if a cache might be stale. In other words, whether or not a project deps cache is a good idea, it probably is off topic here.
Comment 10 Tomas Zezula 2012-01-19 22:03:14 UTC
>Probably depends a lot on disk speed, RAM, etc.
Probably yes.
Also the time of "Resolving dependencies" is just a part of project queries, other significant time is spent in project queries inside the indexers like CosSynchronizer. Jirka R. measured that significant part of TaskList indexer was project queries, he had disabled them in trunk and TL asks just for root (not correct if there are excludes but fast). But the cache will not help to these use cases, so this is why I prefer the slowness detection.

>it probably is off topic here
Right.
Comment 11 Jesse Glick 2012-01-19 23:17:42 UTC
(In reply to comment #10)
> significant part of TaskList indexer was project queries

BTW the numbers above are with the Task List off.
Comment 12 Jaroslav Tulach 2012-02-06 16:37:12 UTC
I guess I can safely ignore comments 5-11 as unrelated. Next time consider discussing things like this in other places than an API review issue.

I've put my changes to branch:
http://hg.netbeans.org/core-main/rev/treestamp-207232

I've added new support for 'isAncestor' query. E.g. ability to find out if the change is forward or backward. We discussed that on perf team and the parsing API could query the user if it is really intended to re-parse backward like changes.

Re. JG03 - yes, it would make sense to guarantee that there is a TreeStamp for every tree (e.g. have a default fallback). I just can't imagine we could encode the state of files in a single string. The parsing API would have use serialization to store the TreeStamp. If that is OK, I can provide such implementation.

Re. JG02 - it is up to versioning guys to comment. I don't know, I don't care.

Re. JG01 - Tomáš Zezula promised to provide the implementation for Mac OS X.
Comment 13 Jaroslav Tulach 2012-02-09 09:59:16 UTC
Ondra will implement the API in Hg. Tomáš will try to use it then.
Comment 14 Ondrej Vrabec 2012-02-10 15:55:07 UTC
mercurial is able to create timestamps now: http://hg.netbeans.org/core-main/rev/e95952a931d2
Tomas, you can start using it. Let me know if it works at least somehow.
Comment 15 Ondrej Vrabec 2012-02-10 16:15:54 UTC
[OV01] i think TreeStamp.Provider.class.getName() used as the attribute name in getAttribute() to get an instance of the Provider class from versioning subsystem breaks the contract between VCS and masterfs. VCS should probably respond only to those starting with "ProvidedExtensions.". Adding tomas on CC.
Comment 16 Tomas Stupka 2012-02-13 13:05:12 UTC
(In reply to comment #15)
> [OV01] i think TreeStamp.Provider.class.getName() used as the attribute name in
> getAttribute() to get an instance of the Provider class from versioning
> subsystem breaks the contract between VCS and masterfs. VCS should probably
> respond only to those starting with "ProvidedExtensions.". Adding tomas on CC.
yes, masterfs delegates only getAttribute calls where the attribute has the mentioned prefix


By use of this website, you agree to the NetBeans Policies and Terms of Use. © 2012, Oracle Corporation and/or its affiliates. Sponsored by Oracle logo