Bug 126156 - If using hg in a shell, file badging/coloring is usually wrong
If using hg in a shell, file badging/coloring is usually wrong
Product: versioncontrol
Classification: Unclassified
Component: Mercurial
All All
: P3 (vote)
: 6.x
Assigned To: issues@versioncontrol
Depends on:
  Show dependency treegraph
Reported: 2008-01-28 23:43 UTC by _ tboudreau
Modified: 2009-12-08 02:29 UTC (History)
6 users (show)

See Also:
Issue Type: DEFECT

All-Java program to read .hg/dirstate (13.88 KB, application/x-compressed)
2009-07-06 23:08 UTC, Jesse Glick
Java hg dirstate parser (15.87 KB, application/x-tar)
2009-07-07 00:46 UTC, _ tboudreau

Note You need to log in before you can comment on or make changes to this bug.
Description _ tboudreau 2008-01-28 23:43:10 UTC
I've noticed invoking status from the mercurial menu will correct this;  but given that with mercurial we have as complete information as we could possibly 
want, quickly accessible, why are we keeping such a cache at all?  With CVS it makes sense because updating can involve network operation.  It doesn't make 
so much sense for mercurial integration.
Comment 1 Jesse Glick 2008-01-29 00:43:27 UTC
Seconded, I have noticed this as well and find it very annoying. There is no reason for the mercurial module to employ
any persistent cache. And if running 'hg stat' for individual files (e.g. after getting Filesystems notifications of new
timestamps) is too slow, then it would not be particularly hard to read .hg/dirstate directly.
Comment 2 John Rice 2008-01-29 18:46:59 UTC
Ah if only life were so simple :)

Mercurial does indeed have the info, but have you tried running a status command from the command line on the main
clone, it's taking me about 5 mins to complete. There is a need for a file status cache as we have to make external hg
calls to get that info from the plugin and it costs us a lot every time we make any external hg call (0.5 sec at least).

The ideal solution would be to have some type of Inotify support so we could be notified of the file changes in a
reliable fashion and then the status could be reliably kept up to date, regardless if you are modifying things in the
IDE or on the command line. I believe this is coming into nevada soon and we'll use it when its there.
Comment 3 Jesse Glick 2008-01-29 19:03:51 UTC
'hg stat' in a main is about 5 seconds for me on Ubuntu (on a laptop no less), but the point is that you can directly
access dirstate and check the status of an individual file far more quickly, since there is no need to do a statwalk.

(The inotify extension for Linux makes stat even faster, but unfortunately it has some serious bugs.)
Comment 4 John Rice 2008-01-29 21:40:37 UTC
Do you mean we should port the dirstate access routines from mercurial, used by hg stat into java? Presumably there's a
parser that needs to be implemented or do you mean something else.
Comment 5 John Rice 2008-01-29 21:52:05 UTC
Do you mean we should port the dirstate access routines from mercurial, used by hg stat into java? Presumably there's a
parser that needs to be implemented or do you mean something else.
Comment 6 Jesse Glick 2008-01-29 21:54:26 UTC
Sure, just parse it from Java. Pretty simple binary format, should not be a big deal.
Comment 7 John Rice 2008-01-29 21:59:10 UTC
Attaching some stats on hg status on my system on clones of hg.netbeans.org/main: first is with main_work that I've been
using today and has had plenty of updates and so on (11sec), second is cold clone main_test (250sec).

$ cd main_work/
$ hg stat --profile --time
? hg.prof
? main_work-64961-ca62dfc09ab7
         891874 function calls (889380 primitive calls) in 11.729 CPU seconds
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    8.628    8.628   10.041   10.041 dirstate.py:404(findfiles)
   190915    0.565    0.000    0.565    0.000 posixpath.py:56(join)
        1    0.475    0.475   11.728   11.728 dirstate.py:493(status)
        1    0.337    0.337    0.337    0.337 dirstate.py:124(_read)
    74729    0.269    0.000   10.909    0.000 dirstate.py:352(statwalk)
    20087    0.259    0.000    0.259    0.000 posixpath.py:373(normpath)
    773/1    0.209    0.000    0.389    0.389 sre_parse.py:374(_parse)
    74883    0.169    0.000    0.220    0.000 dirstate.py:376(imatch)
    74728    0.138    0.000    0.315    0.000 dirstate.py:333(_supported)
    74728    0.123    0.000    0.177    0.000 stat.py:54(S_ISREG)
    836/1    0.085    0.000    0.109    0.109 sre_compile.py:27(_compile)
    32649    0.076    0.000    0.076    0.000 sre_parse.py:182(__next)
    31874    0.059    0.000    0.133    0.000 sre_parse.py:201(get)
    95214    0.059    0.000    0.059    0.000 util.py:1135(pconvert)
    74729    0.054    0.000    0.054    0.000 stat.py:29(S_IFMT)
    74728    0.051    0.000    0.051    0.000 util.py:252(always)
   897/62    0.043    0.000    0.043    0.001 sre_parse.py:140(getwidth)
$ cd main_test/
$ hg stat --profile --time
? hg.prof
         889440 function calls (886945 primitive calls) in 250.259 CPU seconds

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   189433    0.716    0.000    0.716    0.000 posixpath.py:56(join)
        1    0.551    0.551    0.552    0.552 dirstate.py:124(_read)
    20185    0.348    0.000    0.348    0.000 posixpath.py:373(normpath)
    74612    0.269    0.000  248.185    0.003 dirstate.py:352(statwalk)
    774/1    0.190    0.000    0.396    0.396 sre_parse.py:374(_parse)
    74611    0.177    0.000    0.230    0.000 dirstate.py:376(imatch)
    74611    0.143    0.000    0.343    0.000 dirstate.py:333(_supported)
    74611    0.142    0.000    0.200    0.000 stat.py:54(S_ISREG)
    836/1    0.086    0.000    0.112    0.112 sre_compile.py:27(_compile)
    32826    0.076    0.000    0.076    0.000 sre_parse.py:182(__next)
    32050    0.073    0.000    0.147    0.000 sre_parse.py:201(get)
    94625    0.060    0.000    0.060    0.000 util.py:1135(pconvert)
    74612    0.058    0.000    0.058    0.000 stat.py:29(S_IFMT)
    74611    0.053    0.000    0.053    0.000 util.py:252(always)
    31990    0.049    0.000    0.049    0.000 sre_parse.py:138(append)
   896/61    0.044    0.000    0.044    0.001 sre_parse.py:140(getwidth)
Comment 8 Jesse Glick 2008-01-29 23:09:54 UTC
Obviously stat with no args is much slower on a cold repo. The point is that by reading dirstate directly you should be
able to check status of files actually displayed in the IDE quite quickly and on demand, without trying to maintain a cache.
Comment 9 John Rice 2008-01-30 08:13:28 UTC
Given that disc access is always going to be by far the slowest operation, it is not at all clear that accessing the
dirstate directly will speed things up for us. Depends on how the cache is operating, how many specific stat calls we
are making and so on. Having said that its something we could experiment with. 

Currently the slowest operation for us is actually dealing with Ignored files, as due to the bug you reported in
mercurial about not ignoring dirs, so we are parsing the .hgignore files directly. We need to optimize this first before
looking to see if the cache is really a problem or not.

I'm reluctant to access the dirstate directly as this is an internal implementation detail of mercurial and not a
published spec, so it could change from under us at any time.
Comment 10 Maros Sandor 2008-01-30 09:53:07 UTC
IMHO we can NOT have a decent badging/coloring performace in the IDE without any in-memory status cache and running hg
anytime IDE asks for coloring/badges is completely out of question. For example, currently we are just parsing
CVS/Entries in AWT in response to annotator requests and already have a bug filed for blocking the event queue. 
I think that what Tim complains about is poor/nonexistent detection of file changes that happen outside IDE. If Hg
detected such change, it could update badges, no problem there. This is a problem that all versioning systems have to
live with currently.
Comment 11 _ tboudreau 2008-01-30 11:39:27 UTC
Well, the problem here is the badging being out-of-date - particularly if I've committed something from the command line, it will be marked modified even 
across restarts.

If you need an in-memory cache, why not memory map the dirstate file?  That should keep file access reasonable, NIO has reasonable locking, and there's 
your in-memory cache.  

Pattern matching the path you need should be reasonably cheap, and the OS's memory manager should do a pretty good job of keeping it available if it's 
being hit a lot (we've been using a similar approach in the output window for years and you can have hundreds of Mb of text in it without any performance 
problems scrolling up and down - not exactly the same problem, but the point is that NIO memory mapping works very well).
Comment 12 Jesse Glick 2008-02-27 19:31:04 UTC
This is becoming increasingly annoying. In practice, I run commits about half the time from the shell, and do all other
write operations from the shell. Now, although I often have no outstanding modifications, most of the projects I work on
permanently display modification badges.

If you can't do any better, at least discard caches after an IDE restart, i.e. make them nonpersistent. For now I am
adding an entry to my NB wrapper script:

rm -rf $userdir/var/cache/mercurialcache
Comment 13 jonast 2008-10-30 20:13:59 UTC
I just removed 2000 unneded files that I have in my repro to speed up compiling. After this netbeans was pretty messed 
up. I had to update again (using command line) and the cache really screwed up. The solution was to remove it, however 
it took me a while to realize what it was. The previous poster was the one that helped me out. 

I think this issue needs some more attention
Comment 14 _ tboudreau 2008-11-01 22:54:05 UTC
Any progress on this?  Specifically the suggestion of directly memory mapping the dirstate file from hg (and not using
any other cache)?  It seems like an impedance mismatch to try to shoehorn mercurial into the same shoe as CVS and SVN
(i.e. any general-purpose caching system [wasn't it called "turbo" when CVS support was rewritten?] simply should not be
used for hg - if it is impossible to use the API for integrating VCS's without getting caching, that should be fixed).

AFAIK this is a file which will only ever be *appended* to.  So testing whether the IDE's UI is in sync with the
repository's metadata is as simple as checking the last-modified date against a cached one.

The only objection I can see to doing this is:
> I'm reluctant to access the dirstate directly as this is an internal implementation detail of mercurial and not a
> published spec, so it could change from under us at any time.

This sort of thing should be fairly easy to detect and show an error message.  Jesse, you're involved in hg, I think. 
Any idea of the likelihood of this?  Has it happened in the past?  Would you find out ahead of time if the hg folks plan
to change it?
Comment 15 Jesse Glick 2008-11-03 19:13:26 UTC
dirstate is not appended to; you are thinking of repository files.

I do not know of historical changes to the dirstate format, but I doubt these would be more likely than other changes to
Hg that the module needs to track.
Comment 16 Jesse Glick 2009-07-06 23:08:12 UTC
Created attachment 84419 [details]
All-Java program to read .hg/dirstate
Comment 17 Jesse Glick 2009-07-06 23:17:08 UTC
The attached program demonstrates that it is very easy & fast to parse .hg/dirstate in Java. Printing the repo's status
a la 'hg st' given the combination of .hg/dirstate and the current disk stats (as the program also does) is trickier,
though this problem may be more general than what the IDE needs for its display. (The whole loop is roughly as fast as
the CPython-based Hg executable, and I have made no attempt to optimize it.)

The point to note is that you can cache the DirState object so long as .hg/dirstate does not change on disk (i.e.
between non-r/o calls to Hg), and then very quickly get status for a subdirectory of the repo without forking an Hg process.
Comment 18 _ tboudreau 2009-07-07 00:46:28 UTC
Created attachment 84422 [details]
Java hg dirstate parser
Comment 19 _ tboudreau 2009-07-07 01:03:46 UTC
I've attached a Java parser for hg dirstate files based on the format documentation here:

You can try it just with java -jar DirstateParser.jar /path/to/.hg/dirstate

The lookup time for any FileEntry (basically a pointer to an offset in dirstate that can parse its contents) on my
machine is in the nanosecond range.  Things it would need to make it productions-worthy are few and doable:

1. Need to find out what encoding filenames in dirstate us - I'm assuming US-ASCII, but it could be system-specific or
UTF-8 - I didn't find any documentation of that

2. Currently it memory-maps the entire dirstate file.  Probably it should:
 - Memory-map recently requested sections rather than the whole file
 - Start a timer whenever the file is opened, and close the mapping if unused for a while

3. I've run into a few OOMEs running this over $NBSRC/.hg/dirstate.  Things to deal with:
 - It's not documented whether MappedByteBuffer.slice() returns a MappedByteBuffer, or whether it actually does a
heap-copy of the requested slice
 - Currently the byte buffers we hand back to FileEntry have a start point of the offset of the FileEntry, but are as
long as the remaining length of the file - if we are getting heap-copies, this is probably the source of the OOMEs. 
Dirstate.buf() could read the int at offset 15 to get the needed length and only return a ByteBuffer of the necessary size.

4. Currently we re-map if the file's last-modified time changes.  Probably we only need to remap if the *length* changes.

At any rate, it works, it's fast, and the design is solid.  It shouldn't take too much adaptation to make it work an
always-accurate, pseudo-in-memory, high-speed cache for hg metadata.
Comment 20 _ tboudreau 2009-07-07 01:17:08 UTC
Great minds think alike :-)
Comment 21 Jesse Glick 2009-07-07 16:06:45 UTC
Thanks for the wiki link. I just read mercurial/pure/parsers.py - the Python code to read dirstate is only a few lines
long, so it was very easy to adapt to Java. BTW Tim you neglected to parse the "copied from" field.

Tim's version is surely more memory-efficient, though even my simplistic impl was very fast to scan for a clone of the
NB main repo. The best impl would probably use a memmap like Tim's, then create some kind of compact index (in heap) in
to be able to quickly look up an entry's offset by filename, perhaps using CRC32 sums to save space. Such an index would
just get recomputed in case the timestamp changed. (Tim's impl is interesting but incorrect - could be fooled by other
fields in the file which happen to look like ASCII chars, or simply by "foo/bar" vs. "foo/bars" in the same repo. It
also requires a linear search over the file, which could be undesirable.)

As to #1, the answer is that the filename is encoded using whatever encoding the platform decides to use for filenames,
i.e. new String(byte[]) is probably correct.

To #2, mapping recently-requested sections would probably not work as the entries seem to be randomly mixed together.
You could close the file after a timeout if it seemed necessary.
Comment 22 Tomas Stupka 2009-07-08 20:12:40 UTC
thanks for the dirstate parsers.

Now back to the problems mentioned in this issue:
A) external changes aren't well reflected in the ide
B) hg status is too slow - this was/is already addressed by several issues. Please note that just 
recently there was introduced a fix which should quite improve the status refresh caused by file 
change events.

the proposed solution is to give up the already existing hg st/nb cache infrastructure and to 
base the status coloring and badgeing on .hg/dirstate. As far as we understand the dirstate file 
- it holds an entry for each file manged by mercurial as well as their last known modification 
timestamp, size, state etc. Those values are eventually used by the hg st command to compute their 
status and the idea suggested above is that netbeans could do the same instead of calling a hg shell 

Unfortunatelly, there are a few more things involved in how file annotations are 
computed than just to find out if a particular file is modified, deleted, added...

1) A clone might also contain files which aren't tracked by dirstate: 
1a) Files with the status "? = not tracked" - rendered in the IDE as localy new. 
1b) files with a conflict
the question in case of a folders icon annotation request is how to find out that there is 
such a file hidden somewhere deep in that folders file structure. To scan the whole folder tree 
and check each file against dirstat (1a) or otherwise (1b) to learn if it's 1a or 1b? This might cause 
massive file io and as already mentioned by msandor - a persistent cache seems to make more sense 
than to do this on the fly. 

1c) ignored files - if a file is ignored is given by hgignore and by the sharability query, dirstat 
isn't of much help here and a persistent cache also seems to make more sense at this point.

2) External modifications
2a) externally modified, deleted or added files are usually recognized by the ide and properly 
rendered in the relevant views. The only gap we know about is in case of files which 
weren't introduced to the IDEs filesystem to the moment the user returns back into the IDE. Therefore 
they aren't refreshed by the fs and no event is send to the VCS modules. Again - dirstate doesn't seem to be of 
much help in such a scenario. The mercurial module won't be notified about a externally changed file 
just by hanging on the dirstate file. The trigger must come from somewhere else. Not sure how to solve 
this and it is a general problem in all VCS modules.

2b) external changes to a files status without changing it's contents and modification timestamp - 
e.g a commit won't change a files modification timestamp, the IDE won't recognize such a file as
externally changed, no event is sent to the mercurial module and a wrong annotation is rendered in the 
IDE annotation. Seems to be the the most prominent cause for the reported coloring/badgeing problem. 
Dirstate could help us out of this one (thanks for the tip!). A hg commit changes the files lastMod 
timestamp, and the parent hashcodes, so this could be the trigger to invoke a status refresh. No need 
to parse the whole file. Then, as suggested by Jesse - even a hg st on the whole clone should return 
fast enough. Calling the command only on a few relevant subtrees given by the files known (cached) as 
modified at the moment would be even faster. Will be a bit tricky but a fix seems to be possible and
will follow.

To give it a point - the fix in 2b should solve the most critical and fixable part of the reported 
problem. It also still seems to us that relying only on dirstate wouldn't be a sufficient solution and we still 
think we need a cache. Not to mention the effort needed to for a big bang rewrite. A high chance that bringing 
it to a mature production stage in all it's details won't be less hairy as the actual implementation and 
the probable gain isn't that evident yet.

Comment 23 Jesse Glick 2009-07-08 21:42:01 UTC
To 1a - right; as the demo program shows, these can be identified by their absence from both dirstate and .hgignore.

1b - not sure offhand, but I think dirstate has a status for these.

To status of folders - fine enough to remember status for a limited time, but better to recalculate if in doubt. Better
indeed to not mark folders at all (or only to a limited depth), than to mark them the way it is currently done, i.e.
often wrong.

1c - ignored files are indicated by .hgignore; no need for any caching.

The suggested changes in 2b would be helpful, I think. The current state is especially painful for MQ users, since the
IDE's status is wrong more often than not.

I will continue to recommend that no cache be persisted to disk by the IDE. If you really think you need cache
persistence (and I have been happier since I started deleting the cache before every IDE restart), much better to read
dirstate which gets updated by actual Hg commands.
Comment 24 _ tboudreau 2009-07-09 04:25:15 UTC
> 1c) ignored files - if a file is ignored is given by hgignore and by the sharability query, dirstat 
> isn't of much help here and a persistent cache also seems to make more sense at this point.

For new files and ignored files, the regexp approach I took (wrap the ByteBuffer of the whole file into a CharSequence
implementation and run a regexp with the literal flag over it) - as Jesse rightly points out, can be fooled by moved
files and missing subsequences);  however for files(not folders), it should be a very fast way to determine if a file is
unknown (i.e. should be marked as new) or not;  then just parse it against the expressions in .hgignore (or do those
steps in reverse order if the performance is better).  As I mentioned, the CharSequence over ByteBuffer w/ regexp
approach performed lookups consistently in < 1ms (on an AMD64 quad core machine with 4gb - YMMV).

Re a persistent cache, Jesse mentioned the option of maintaining a map of SHA1 hash of filename -> entry offset in
dirstate.  Although if it could be made accurate, the regexp approach seems to be extremely fast even over files as
large as $nbsrc/.hg/dirstate.

> The mercurial module won't be notified about a externally changed file just by hanging on the dirstate file.
No, you would want to compare dirstate's lastModified with the File's, but this should be simple and fast.

> external changes to a files status without changing it's contents and modification timestamp

AFAIK Mercurial is smart about this - if you touch a file in mercurial, and then commit, it should not show up in the
list of the commits - if that's what you mean.  

If you mean an external Mercurial commit changing the status of a file, that's exactly where mapping dirstate should
help.  Cache the lastmodified time of dirstate itself.  Do it the same way we refresh filesystems when the IDE regains
focus - if dirstate has changed, re-fetch the status from dirstate for each live FileObject and see if it's changed, and
update the filesystem annotations appropriately.

> relying only on dirstate wouldn't be a sufficient solution and we still think we need a cache.
What is left that you would need to cache?

> effort needed to for a big bang rewrite
I took a quick (and I mean a quick look) look at the Mercurial module sources, and it looks as if most of the changes
could be done in o.n.m.mercurial.VersionsCache and o.n.m.mercurial.FileStatusCache - just eliminate the usage of
DiskMapTurboProvider in favor of direct use of dirstate.  It looks like all access goes through FileStatusCache, so
probably you could leave the signatures of that alone and end up with a fairly small changeset.  Like I said, I only did
a quick look and some usages searches, so I could be wrong about that.

Comment 25 _ tboudreau 2009-07-09 04:38:07 UTC
> To #2, mapping recently-requested sections would probably not work as the entries seem to be randomly mixed together.
> You could close the file after a timeout if it seemed necessary.

Well, it seems that the mapping is not the source of the OOMEs, so mapping the whole file is probably harmless.  The
OS's memory manager should be smart enough not to hold the whole thing in RAM.

The only case where you need multiple mappings would be dirstate files with size > Integer.MAX_VALUE (ByteBuffer is
indexed [unfortunately] in ints).  I doubt we will run into too many such dirstate files, so my suggestion would be to
simply not implement support for files of such sizes - just say "I can't read this" and have the HG filesystem do
nothing.  If there are real use cases for supporting it, we can address that when there's a request for it (in that
case, I suggest just wrapping multiple mappings in a signature-clone of ByteBuffer that is long-indexed, remove the
(int) casts and the rest of the code can remain unchanged).
Comment 26 Ondrej Vrabec 2009-12-07 03:44:44 UTC
fix: http://hg.netbeans.org/cdev/rev/0cfd663e6e60
Comment 27 Jesse Glick 2009-12-07 10:29:34 UTC
I look forward to trying it.

By the way, the initialization of HANDLED_HGFOLDER_FILES could be accomplished more simply with

private static final String[] HANDLED_HGFOLDER_FILES = {"branch", ...};

and isHandledHgFolderFile could use Arrays.asList(HANDLED_HGFOLDER_FILES).contains(file.getName()). Could even make a one-liner:

return file.getName().matches("(undo[.])?(branch|dirstate)");

Similarly, REPOSITORY_NOMODIFICATION_COMMANDS could be initialized as a one-liner with

private static final Set<String> REPOSITORY_NOMODIFICATION_COMMANDS = new HashSet<String>(Arrays.asList(HG_ANNOTATE_CMD, ...));
Comment 28 Quality Engineering 2009-12-08 02:29:21 UTC
Integrated into 'main-golden', will be available in build *200912080200* on http://bits.netbeans.org/dev/nightly/ (upload may still be in progress)
Changeset: http://hg.netbeans.org/main/rev/0cfd663e6e60
User: Ondrej Vrabec <ovrabec@netbeans.org>
Log: Issue #126156 - If using hg in a shell, file badging/coloring is usually wrong
listening for FS changes on metadata in .hg

By use of this website, you agree to the NetBeans Policies and Terms of Use. © 2014, Oracle Corporation and/or its affiliates. Sponsored by Oracle logo