cornercorner
FeaturesPluginsDocs & SupportCommunityPartners

Bug 126156 - If using hg in a shell, file badging/coloring is usually wrong
: If using hg in a shell, file badging/coloring is usually wrong
Status: NEW
: versioncontrol
Mercurial
: 6.1
: All All
: P3 (vote)
: 6.9
Assigned To:
:
:
:
:
:
:
  Show dependency treegraph
 
Reported: 2008-01-28 23:43 by
Modified: 2009-10-05 08:08 (History)
Issue Type: DEFECT
:


Attachments
All-Java program to read .hg/dirstate (13.88 KB, application/x-compressed)
2009-07-06 23:08, Jesse Glick
Details
Java hg dirstate parser (15.87 KB, application/x-tar)
2009-07-07 00:46, _ tboudreau
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2008-01-28 23:43:10
I've noticed invoking status from the mercurial menu will correct this;  but
given that with mercurial we have as complete information as we could possibly 
want, quickly accessible, why are we keeping such a cache at all?  With CVS it
makes sense because updating can involve network operation.  It doesn't make 
so much sense for mercurial integration.
------- Comment #1 From 2008-01-29 00:43:27 -------
Seconded, I have noticed this as well and find it very annoying. There is no
reason for the mercurial module to employ
any persistent cache. And if running 'hg stat' for individual files (e.g. after
getting Filesystems notifications of new
timestamps) is too slow, then it would not be particularly hard to read
.hg/dirstate directly.
------- Comment #2 From 2008-01-29 18:46:59 -------
Ah if only life were so simple :)

Mercurial does indeed have the info, but have you tried running a status
command from the command line on the main
clone, it's taking me about 5 mins to complete. There is a need for a file
status cache as we have to make external hg
calls to get that info from the plugin and it costs us a lot every time we make
any external hg call (0.5 sec at least).

The ideal solution would be to have some type of Inotify support so we could be
notified of the file changes in a
reliable fashion and then the status could be reliably kept up to date,
regardless if you are modifying things in the
IDE or on the command line. I believe this is coming into nevada soon and we'll
use it when its there.
------- Comment #3 From 2008-01-29 19:03:51 -------
'hg stat' in a main is about 5 seconds for me on Ubuntu (on a laptop no less),
but the point is that you can directly
access dirstate and check the status of an individual file far more quickly,
since there is no need to do a statwalk.

(The inotify extension for Linux makes stat even faster, but unfortunately it
has some serious bugs.)
------- Comment #4 From 2008-01-29 21:40:37 -------
Do you mean we should port the dirstate access routines from mercurial, used by
hg stat into java? Presumably there's a
parser that needs to be implemented or do you mean something else.
------- Comment #5 From 2008-01-29 21:52:05 -------
Do you mean we should port the dirstate access routines from mercurial, used by
hg stat into java? Presumably there's a
parser that needs to be implemented or do you mean something else.
------- Comment #6 From 2008-01-29 21:54:26 -------
Sure, just parse it from Java. Pretty simple binary format, should not be a big
deal.
------- Comment #7 From 2008-01-29 21:59:10 -------
Attaching some stats on hg status on my system on clones of
hg.netbeans.org/main: first is with main_work that I've been
using today and has had plenty of updates and so on (11sec), second is cold
clone main_test (250sec).

$ cd main_work/
$ hg stat --profile --time
? hg.prof
? main_work-64961-ca62dfc09ab7
         891874 function calls (889380 primitive calls) in 11.729 CPU seconds
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    8.628    8.628   10.041   10.041 dirstate.py:404(findfiles)
   190915    0.565    0.000    0.565    0.000 posixpath.py:56(join)
        1    0.475    0.475   11.728   11.728 dirstate.py:493(status)
        1    0.337    0.337    0.337    0.337 dirstate.py:124(_read)
    74729    0.269    0.000   10.909    0.000 dirstate.py:352(statwalk)
    20087    0.259    0.000    0.259    0.000 posixpath.py:373(normpath)
    773/1    0.209    0.000    0.389    0.389 sre_parse.py:374(_parse)
    74883    0.169    0.000    0.220    0.000 dirstate.py:376(imatch)
    74728    0.138    0.000    0.315    0.000 dirstate.py:333(_supported)
    74728    0.123    0.000    0.177    0.000 stat.py:54(S_ISREG)
    836/1    0.085    0.000    0.109    0.109 sre_compile.py:27(_compile)
    32649    0.076    0.000    0.076    0.000 sre_parse.py:182(__next)
    31874    0.059    0.000    0.133    0.000 sre_parse.py:201(get)
    95214    0.059    0.000    0.059    0.000 util.py:1135(pconvert)
    74729    0.054    0.000    0.054    0.000 stat.py:29(S_IFMT)
    74728    0.051    0.000    0.051    0.000 util.py:252(always)
   897/62    0.043    0.000    0.043    0.001 sre_parse.py:140(getwidth)

$ cd main_test/
$ hg stat --profile --time
? hg.prof
         889440 function calls (886945 primitive calls) in 250.259 CPU seconds

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   189433    0.716    0.000    0.716    0.000 posixpath.py:56(join)
        1    0.551    0.551    0.552    0.552 dirstate.py:124(_read)
    20185    0.348    0.000    0.348    0.000 posixpath.py:373(normpath)
    74612    0.269    0.000  248.185    0.003 dirstate.py:352(statwalk)
    774/1    0.190    0.000    0.396    0.396 sre_parse.py:374(_parse)
    74611    0.177    0.000    0.230    0.000 dirstate.py:376(imatch)
    74611    0.143    0.000    0.343    0.000 dirstate.py:333(_supported)
    74611    0.142    0.000    0.200    0.000 stat.py:54(S_ISREG)
    836/1    0.086    0.000    0.112    0.112 sre_compile.py:27(_compile)
    32826    0.076    0.000    0.076    0.000 sre_parse.py:182(__next)
    32050    0.073    0.000    0.147    0.000 sre_parse.py:201(get)
    94625    0.060    0.000    0.060    0.000 util.py:1135(pconvert)
    74612    0.058    0.000    0.058    0.000 stat.py:29(S_IFMT)
    74611    0.053    0.000    0.053    0.000 util.py:252(always)
    31990    0.049    0.000    0.049    0.000 sre_parse.py:138(append)
   896/61    0.044    0.000    0.044    0.001 sre_parse.py:140(getwidth)
------- Comment #8 From 2008-01-29 23:09:54 -------
Obviously stat with no args is much slower on a cold repo. The point is that by
reading dirstate directly you should be
able to check status of files actually displayed in the IDE quite quickly and
on demand, without trying to maintain a cache.
------- Comment #9 From 2008-01-30 08:13:28 -------
Given that disc access is always going to be by far the slowest operation, it
is not at all clear that accessing the
dirstate directly will speed things up for us. Depends on how the cache is
operating, how many specific stat calls we
are making and so on. Having said that its something we could experiment with. 

Currently the slowest operation for us is actually dealing with Ignored files,
as due to the bug you reported in
mercurial about not ignoring dirs, so we are parsing the .hgignore files
directly. We need to optimize this first before
looking to see if the cache is really a problem or not.

I'm reluctant to access the dirstate directly as this is an internal
implementation detail of mercurial and not a
published spec, so it could change from under us at any time.
------- Comment #10 From 2008-01-30 09:53:07 -------
IMHO we can NOT have a decent badging/coloring performace in the IDE without
any in-memory status cache and running hg
anytime IDE asks for coloring/badges is completely out of question. For
example, currently we are just parsing
CVS/Entries in AWT in response to annotator requests and already have a bug
filed for blocking the event queue. 
I think that what Tim complains about is poor/nonexistent detection of file
changes that happen outside IDE. If Hg
detected such change, it could update badges, no problem there. This is a
problem that all versioning systems have to
live with currently.
------- Comment #11 From 2008-01-30 11:39:27 -------
Well, the problem here is the badging being out-of-date - particularly if I've
committed something from the command line, it will be marked modified even 
across restarts.

If you need an in-memory cache, why not memory map the dirstate file?  That
should keep file access reasonable, NIO has reasonable locking, and there's 
your in-memory cache.  

Pattern matching the path you need should be reasonably cheap, and the OS's
memory manager should do a pretty good job of keeping it available if it's 
being hit a lot (we've been using a similar approach in the output window for
years and you can have hundreds of Mb of text in it without any performance 
problems scrolling up and down - not exactly the same problem, but the point is
that NIO memory mapping works very well).
------- Comment #12 From 2008-02-27 19:31:04 -------
This is becoming increasingly annoying. In practice, I run commits about half
the time from the shell, and do all other
write operations from the shell. Now, although I often have no outstanding
modifications, most of the projects I work on
permanently display modification badges.

If you can't do any better, at least discard caches after an IDE restart, i.e.
make them nonpersistent. For now I am
adding an entry to my NB wrapper script:

rm -rf $userdir/var/cache/mercurialcache
------- Comment #13 From 2008-10-30 20:13:59 -------
I just removed 2000 unneded files that I have in my repro to speed up
compiling. After this netbeans was pretty messed 
up. I had to update again (using command line) and the cache really screwed up.
The solution was to remove it, however 
it took me a while to realize what it was. The previous poster was the one that
helped me out. 

I think this issue needs some more attention
------- Comment #14 From 2008-11-01 22:54:05 -------
Any progress on this?  Specifically the suggestion of directly memory mapping
the dirstate file from hg (and not using
any other cache)?  It seems like an impedance mismatch to try to shoehorn
mercurial into the same shoe as CVS and SVN
(i.e. any general-purpose caching system [wasn't it called "turbo" when CVS
support was rewritten?] simply should not be
used for hg - if it is impossible to use the API for integrating VCS's without
getting caching, that should be fixed).

AFAIK this is a file which will only ever be *appended* to.  So testing whether
the IDE's UI is in sync with the
repository's metadata is as simple as checking the last-modified date against a
cached one.

The only objection I can see to doing this is:
> I'm reluctant to access the dirstate directly as this is an internal implementation detail of mercurial and not a
> published spec, so it could change from under us at any time.

This sort of thing should be fairly easy to detect and show an error message. 
Jesse, you're involved in hg, I think. 
Any idea of the likelihood of this?  Has it happened in the past?  Would you
find out ahead of time if the hg folks plan
to change it?
------- Comment #15 From 2008-11-03 19:13:26 -------
dirstate is not appended to; you are thinking of repository files.

I do not know of historical changes to the dirstate format, but I doubt these
would be more likely than other changes to
Hg that the module needs to track.
------- Comment #16 From 2009-07-06 23:08:12 -------
Created an attachment (id=84419) [details]
All-Java program to read .hg/dirstate
------- Comment #17 From 2009-07-06 23:17:08 -------
The attached program demonstrates that it is very easy & fast to parse
.hg/dirstate in Java. Printing the repo's status
a la 'hg st' given the combination of .hg/dirstate and the current disk stats
(as the program also does) is trickier,
though this problem may be more general than what the IDE needs for its
display. (The whole loop is roughly as fast as
the CPython-based Hg executable, and I have made no attempt to optimize it.)

The point to note is that you can cache the DirState object so long as
.hg/dirstate does not change on disk (i.e.
between non-r/o calls to Hg), and then very quickly get status for a
subdirectory of the repo without forking an Hg process.
------- Comment #18 From 2009-07-07 00:46:28 -------
Created an attachment (id=84422) [details]
Java hg dirstate parser
------- Comment #19 From 2009-07-07 01:03:46 -------
I've attached a Java parser for hg dirstate files based on the format
documentation here:
http://mercurial.selenic.com/wiki/FileFormats#dirstate

You can try it just with java -jar DirstateParser.jar /path/to/.hg/dirstate

The lookup time for any FileEntry (basically a pointer to an offset in dirstate
that can parse its contents) on my
machine is in the nanosecond range.  Things it would need to make it
productions-worthy are few and doable:

1. Need to find out what encoding filenames in dirstate us - I'm assuming
US-ASCII, but it could be system-specific or
UTF-8 - I didn't find any documentation of that

2. Currently it memory-maps the entire dirstate file.  Probably it should:
 - Memory-map recently requested sections rather than the whole file
 - Start a timer whenever the file is opened, and close the mapping if unused
for a while

3. I've run into a few OOMEs running this over $NBSRC/.hg/dirstate.  Things to
deal with:
 - It's not documented whether MappedByteBuffer.slice() returns a
MappedByteBuffer, or whether it actually does a
heap-copy of the requested slice
 - Currently the byte buffers we hand back to FileEntry have a start point of
the offset of the FileEntry, but are as
long as the remaining length of the file - if we are getting heap-copies, this
is probably the source of the OOMEs. 
Dirstate.buf() could read the int at offset 15 to get the needed length and
only return a ByteBuffer of the necessary size.

4. Currently we re-map if the file's last-modified time changes.  Probably we
only need to remap if the *length* changes.

At any rate, it works, it's fast, and the design is solid.  It shouldn't take
too much adaptation to make it work an
always-accurate, pseudo-in-memory, high-speed cache for hg metadata.
------- Comment #20 From 2009-07-07 01:17:08 -------
Great minds think alike :-)
------- Comment #21 From 2009-07-07 16:06:45 -------
Thanks for the wiki link. I just read mercurial/pure/parsers.py - the Python
code to read dirstate is only a few lines
long, so it was very easy to adapt to Java. BTW Tim you neglected to parse the
"copied from" field.

Tim's version is surely more memory-efficient, though even my simplistic impl
was very fast to scan for a clone of the
NB main repo. The best impl would probably use a memmap like Tim's, then create
some kind of compact index (in heap) in
to be able to quickly look up an entry's offset by filename, perhaps using
CRC32 sums to save space. Such an index would
just get recomputed in case the timestamp changed. (Tim's impl is interesting
but incorrect - could be fooled by other
fields in the file which happen to look like ASCII chars, or simply by
"foo/bar" vs. "foo/bars" in the same repo. It
also requires a linear search over the file, which could be undesirable.)

As to #1, the answer is that the filename is encoded using whatever encoding
the platform decides to use for filenames,
i.e. new String(byte[]) is probably correct.

To #2, mapping recently-requested sections would probably not work as the
entries seem to be randomly mixed together.
You could close the file after a timeout if it seemed necessary.
------- Comment #22 From 2009-07-08 20:12:40 -------
thanks for the dirstate parsers.

Now back to the problems mentioned in this issue:
A) external changes aren't well reflected in the ide
B) hg status is too slow - this was/is already addressed by several issues.
Please note that just 
recently there was introduced a fix which should quite improve the status
refresh caused by file 
change events.

A)
the proposed solution is to give up the already existing hg st/nb cache
infrastructure and to 
base the status coloring and badgeing on .hg/dirstate. As far as we understand
the dirstate file 
- it holds an entry for each file manged by mercurial as well as their last
known modification 
timestamp, size, state etc. Those values are eventually used by the hg st
command to compute their 
status and the idea suggested above is that netbeans could do the same instead
of calling a hg shell 
command. 

Unfortunatelly, there are a few more things involved in how file annotations
are 
computed than just to find out if a particular file is modified, deleted,
added...

1) A clone might also contain files which aren't tracked by dirstate: 
1a) Files with the status "? = not tracked" - rendered in the IDE as localy
new. 
1b) files with a conflict
the question in case of a folders icon annotation request is how to find out
that there is 
such a file hidden somewhere deep in that folders file structure. To scan the
whole folder tree 
and check each file against dirstat (1a) or otherwise (1b) to learn if it's 1a
or 1b? This might cause 
massive file io and as already mentioned by msandor - a persistent cache seems
to make more sense 
than to do this on the fly. 

1c) ignored files - if a file is ignored is given by hgignore and by the
sharability query, dirstat 
isn't of much help here and a persistent cache also seems to make more sense at
this point.

2) External modifications
2a) externally modified, deleted or added files are usually recognized by the
ide and properly 
rendered in the relevant views. The only gap we know about is in case of files
which 
weren't introduced to the IDEs filesystem to the moment the user returns back
into the IDE. Therefore 
they aren't refreshed by the fs and no event is send to the VCS modules. Again
- dirstate doesn't seem to be of 
much help in such a scenario. The mercurial module won't be notified about a
externally changed file 
just by hanging on the dirstate file. The trigger must come from somewhere
else. Not sure how to solve 
this and it is a general problem in all VCS modules.

2b) external changes to a files status without changing it's contents and
modification timestamp - 
e.g a commit won't change a files modification timestamp, the IDE won't
recognize such a file as
externally changed, no event is sent to the mercurial module and a wrong
annotation is rendered in the 
IDE annotation. Seems to be the the most prominent cause for the reported
coloring/badgeing problem. 
Dirstate could help us out of this one (thanks for the tip!). A hg commit
changes the files lastMod 
timestamp, and the parent hashcodes, so this could be the trigger to invoke a
status refresh. No need 
to parse the whole file. Then, as suggested by Jesse - even a hg st on the
whole clone should return 
fast enough. Calling the command only on a few relevant subtrees given by the
files known (cached) as 
modified at the moment would be even faster. Will be a bit tricky but a fix
seems to be possible and
will follow.

To give it a point - the fix in 2b should solve the most critical and fixable
part of the reported 
problem. It also still seems to us that relying only on dirstate wouldn't be a
sufficient solution and we still 
think we need a cache. Not to mention the effort needed to for a big bang
rewrite. A high chance that bringing 
it to a mature production stage in all it's details won't be less hairy as the
actual implementation and 
the probable gain isn't that evident yet.
------- Comment #23 From 2009-07-08 21:42:01 -------
To 1a - right; as the demo program shows, these can be identified by their
absence from both dirstate and .hgignore.

1b - not sure offhand, but I think dirstate has a status for these.

To status of folders - fine enough to remember status for a limited time, but
better to recalculate if in doubt. Better
indeed to not mark folders at all (or only to a limited depth), than to mark
them the way it is currently done, i.e.
often wrong.

1c - ignored files are indicated by .hgignore; no need for any caching.

The suggested changes in 2b would be helpful, I think. The current state is
especially painful for MQ users, since the
IDE's status is wrong more often than not.

I will continue to recommend that no cache be persisted to disk by the IDE. If
you really think you need cache
persistence (and I have been happier since I started deleting the cache before
every IDE restart), much better to read
dirstate which gets updated by actual Hg commands.
------- Comment #24 From 2009-07-09 04:25:15 -------
> 1c) ignored files - if a file is ignored is given by hgignore and by the sharability query, dirstat 
> isn't of much help here and a persistent cache also seems to make more sense at this point.

For new files and ignored files, the regexp approach I took (wrap the
ByteBuffer of the whole file into a CharSequence
implementation and run a regexp with the literal flag over it) - as Jesse
rightly points out, can be fooled by moved
files and missing subsequences);  however for files(not folders), it should be
a very fast way to determine if a file is
unknown (i.e. should be marked as new) or not;  then just parse it against the
expressions in .hgignore (or do those
steps in reverse order if the performance is better).  As I mentioned, the
CharSequence over ByteBuffer w/ regexp
approach performed lookups consistently in < 1ms (on an AMD64 quad core machine
with 4gb - YMMV).

Re a persistent cache, Jesse mentioned the option of maintaining a map of SHA1
hash of filename -> entry offset in
dirstate.  Although if it could be made accurate, the regexp approach seems to
be extremely fast even over files as
large as $nbsrc/.hg/dirstate.

> The mercurial module won't be notified about a externally changed file just by hanging on the dirstate file.
No, you would want to compare dirstate's lastModified with the File's, but this
should be simple and fast.

> external changes to a files status without changing it's contents and modification timestamp

AFAIK Mercurial is smart about this - if you touch a file in mercurial, and
then commit, it should not show up in the
list of the commits - if that's what you mean.  

If you mean an external Mercurial commit changing the status of a file, that's
exactly where mapping dirstate should
help.  Cache the lastmodified time of dirstate itself.  Do it the same way we
refresh filesystems when the IDE regains
focus - if dirstate has changed, re-fetch the status from dirstate for each
live FileObject and see if it's changed, and
update the filesystem annotations appropriately.

> relying only on dirstate wouldn't be a sufficient solution and we still think we need a cache.
What is left that you would need to cache?

> effort needed to for a big bang rewrite
I took a quick (and I mean a quick look) look at the Mercurial module sources,
and it looks as if most of the changes
could be done in o.n.m.mercurial.VersionsCache and
o.n.m.mercurial.FileStatusCache - just eliminate the usage of
DiskMapTurboProvider in favor of direct use of dirstate.  It looks like all
access goes through FileStatusCache, so
probably you could leave the signatures of that alone and end up with a fairly
small changeset.  Like I said, I only did
a quick look and some usages searches, so I could be wrong about that.
------- Comment #25 From 2009-07-09 04:38:07 -------
> To #2, mapping recently-requested sections would probably not work as the entries seem to be randomly mixed together.
> You could close the file after a timeout if it seemed necessary.

Well, it seems that the mapping is not the source of the OOMEs, so mapping the
whole file is probably harmless.  The
OS's memory manager should be smart enough not to hold the whole thing in RAM.

The only case where you need multiple mappings would be dirstate files with
size > Integer.MAX_VALUE (ByteBuffer is
indexed [unfortunately] in ints).  I doubt we will run into too many such
dirstate files, so my suggestion would be to
simply not implement support for files of such sizes - just say "I can't read
this" and have the HG filesystem do
nothing.  If there are real use cases for supporting it, we can address that
when there's a request for it (in that
case, I suggest just wrapping multiple mappings in a signature-clone of
ByteBuffer that is long-indexed, remove the
(int) casts and the rest of the code can remain unchanged).