20168 – More effective storage of XML layers

This Bugzilla instance is a read-only archive of historic NetBeans bug reports. To report a bug in NetBeans please follow the project's instructions for reporting issues.

Bug 20168 - More effective storage of XML layers

Summary: More effective storage of XML layers

Status:	VERIFIED FIXED

Alias:	None

Product:	platform
Classification:	Unclassified
Component:	Module System (show other bugs)
Version:	3.x
Hardware:	PC Linux

Importance:	P3 blocker (vote)
Assignee:	Jesse Glick

URL:
Keywords:	PERFORMANCE

Depends on:	20628 20997 21036 21153
Blocks:	17722
	Show dependency tree

Reported:	2002-02-05 15:28 UTC by Jaroslav Tulach
Modified:	2008-12-23 08:33 UTC (History)
CC List:	3 users (show)

See Also:
Issue Type:	ENHANCEMENT
Exception Reporter:

Attachments
The patch that demonstrates the 9% speed up (15.98 KB, patch) 2002-02-05 15:36 UTC, Jaroslav Tulach	Details \| Diff
A testing patch against XmlFS which adds formatted save (4.79 KB, patch) 2002-02-20 10:56 UTC, Petr Nejedly	Details \| Diff
Proposed patch (23.92 KB, patch) 2002-02-22 12:04 UTC, Jesse Glick	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jaroslav Tulach 2002-02-05 15:28:00 UTC

The goal is to not compute the merge of all XML layers at startup and not keep
the computed content of in memory.

Comment 1 Jaroslav Tulach 2002-02-05 15:36:25 UTC

I implemented a code that copies the content of XMLlayers to the local
filesystem and recieved 9% startup time improvements. I have not
measured the improvements in memory consumption.

Startup without patch took 20.2s on my computer.
Startup with the patch tool 18.4s on my computer.

Comment 2 Jaroslav Tulach 2002-02-05 15:36:35 UTC

Created attachment 4573 [details]
The patch that demonstrates the 9% speed up

Comment 3 Jaroslav Tulach 2002-02-05 15:39:41 UTC

To try the patch start the ide with plain switches:

-J-Dusecache=false -J-Dorg.netbeans.log.startup=print

this should measure the time without any caching. Then start it with

-J-Dusecache=true -J-Dorg.netbeans.log.startup=print

this will create the cache in /tmp/NB-CACHE directory and also take
longer, I have not tried to optimize it. Then start it once more with

-J-Dusecache=true -J-Dorg.netbeans.log.startup=print

and you should see the performance improvement.

Comment 4 Jaroslav Tulach 2002-02-05 15:56:36 UTC

To implement the cache copying in better way, it is necessary not to
copy content of files in XMLFS but just their URLs as described in
issue 20170

Comment 5 _ ttran 2002-02-06 17:29:56 UTC

Folks I am tired of adding PERFORMANCE keywords for you :-)

Comment 6 Jesse Glick 2002-02-07 12:14:06 UTC

Why do you use a LocalFileSystem? I am not sure that is more efficient
on all platforms than a premerged XML file. LFS means more polling,
and some platforms are slow to access many files. Re. keeping the
whole content in memory: with LFS you will have softly held
FileObject's anyway. I suggest to keep an XML cache and to work on
optimizing XMLFileSystem for memory usage.

I assume the patch is supposed to be an experiment, not something to
really apply as is.

The patch does not seem to handle module
installation/uninstallation/upgrading, does it?

Comment 7 Petr Nejedly 2002-02-08 09:14:45 UTC

No, it was just an experiment if *any* other storage can speed
it up. The problem with XMLFS is that it parses everything at once
although not needed yet (maybe 15% of the XMLFS content is used
during the startup?).
The other part of the problem is that it causes changes in SFS
during the startup and they are very ineffective in layered MFS.
(2/5 of the setXmlUrls time is parsing, 3/5 is the change propagation)
Having the XMLFS with the "right" content from the beginning
is itself quite an improvement.

Re. memory usage: It is not that bad (~300Kb
last time I've checked) but it grows as we're adding more modules.

Re. premerged XML file: Its performance will be comparable
to separate XML files, as most of the time is spent elsewhere
than changing the stream during parsing.

Comment 8 Jesse Glick 2002-02-08 12:06:53 UTC

OK, so it sounds like a LocalFileSystem with polling disabled will be
faster to load than a premerged XMLFileSystem, though consumes more
disk space and probably causes heavier OS-level disk usage. (Ideal
solution would be a compact binary format I guess.)

Next question: as far as change propagation goes, how does adding an
LFS to the MFS stack differ from adding an XMLFS to it? Seems to me
like it would be exactly the same situation. During startup, all
module layers are collected into one XMLFS and it is added to the
SystemFileSystem as a layer in one operation, right? With a cache, you
would add one LFS at the same time instead.

For handling module installation, uninstallation, and other changes, I
think we would need to annotate the cache dir with information about
what layers are installed, and what version of each, so we know when
to recreate the cache. Either module spec version (fast to compute but
not robust) or CRC of layer XML (slower to compute but much safer).
Maybe module spec version + layer file size would be reasonably
effective and fast.

Comment 9 Jaroslav Tulach 2002-02-08 12:47:57 UTC

1. Hrebejk suggested to use MDR's btree for the compact binary format.
2. Yes, this is only a demo. Real implementation needs to take care of
synchronization of cache & real data (time stamps & module state seem
enough for me).

Comment 10 Jesse Glick 2002-02-08 13:35:57 UTC

True, module JAR timestamp might be enough to serve as the layer
version component of the cache key, I did not think of that.

Reusing MDR's btree or in fact MDR sounds technically good, but I know
little about it, and of course it is problematic to reuse parts of MDR
in the core when the module is not even in standard builds yet. I
guess the btree lib could be copied to core, but that is a bit messy.

Comment 11 Jaroslav Tulach 2002-02-08 14:13:41 UTC

We can create interface CacheLayers in core and then create new module
that will implement it using mdr's btree. So everything work without
btree but will work much better with it. Moreover the core will not
depend on experimental module.

Comment 12 Petr Hrebejk 2002-02-08 14:31:42 UTC

It think Jarda's idea should be easy. When we were developing the MDR
we didn't have the btree implemented. So we designed a simple API
which was first implemented in memory using Hashtables and then we
just moved to the disk based btree. The API are those few classes in
org.netbeans.mdr.persistence.

Comment 13 _ ttran 2002-02-08 16:13:49 UTC

> little about it, and of course it is problematic to reuse parts of MDR
> in the core when the module is not even in standard builds yet. I
> guess the btree lib could be copied to core, but that is a bit messy.

can someone lobby mdr team to factor out their btree code into a
standalone library packacged under org.netbeans.lib.btree.  The lib
can live under mdr or under core or under btree.netbeans.org.  I think
it would increase the reuse opportunity for btree.  I suspect it would
be very easy, a small perl script will do.  The question of course is
if it has a sensible API or not.

[I do not say that core will use btree for caching purposes (yet)]

Comment 14 Petr Nejedly 2002-02-11 14:47:09 UTC

FYI: I've done some additional measurements on the XMLFS in SFS:
Parsing of all standard module layers: ~950ms
explicit refresh + firing changes: ~2050ms
Parsing of premerged layer: ~650ms
above w/o SFS.[loc*|icon]: ~420ms
above w/o ordering attrs: ~350ms

I don't have numbers for separate XMLs as it would be a lot of work
to apply the filters on all of them.

When I tried refresh+firing from a XMLFS plugged into a fully
populated MFS (new MFS(fullXFS,newXFS); newXFS.setXmlUrls()),
it took ~4000ms

The Yarda's hack speeds it up because it places the LFS into
the SFS immediatelly from the beginning so the refresh+firing
step is not there. IMHO we should try to "repair" the MFS
behaviour and then probably use the premerged layer file
(which can be there from the beginning as well, if we'll check
it properly)


BTW: I've generated the premerged layer by a hack inside the XMLFS
implementation. Similar scheme could be used for creating
the real merged layer.

Comment 15 Jesse Glick 2002-02-11 19:46:42 UTC

Petr's suggestion makes sense to me. However I'm not sure how easy
it's going to be to get the XMLFS in the SFS layer stack right at
startup: currently just finding out what modules are enabled requires
having a SFS to inspect (Modules/ folder). Probably this can be hacked
around: make a temporary SFS with just Modules/, then throw it out and
create a new one with the premerged layer all ready (assuming we have
a cache hit).

Sounds like some profiling & optimizations in MFS will be useful no
matter how we do this.

The 650ms -> 350ms differential we can make up with other enhancements
I think, which don't depend on this impl. Both possibilities of
removing SFS.* attrs and removing most explicit ordering attrs have
been separately filed. I think some backdoor cooperation between
XMLFileSystem and FolderList could also help reduce overhead of
ordering attributes.

Merging in the event of a cache miss might be effectively done by
direct parsing of the consituent layers. This would remove the
XMLFileSystem dependencies, and might anyway be faster & more reliable.

Comment 16 rmatous 2002-02-14 09:21:43 UTC

Look at RFE: http://www.netbeans.org/issues/show_bug.cgi?id=20528 that
should allow to avoid updateAll call as reaction for
fileData/FolderCreated. Then only fileDeleted must be handled using
updateAll because fileDeleted is fired only for FileObject that was
deleted but not for its children in hieararchy.

Comment 17 Jesse Glick 2002-02-19 16:02:14 UTC

I'm starting to work on this. Proposed mechanism (please tell me if
this sounds stupid):

1. ModuleLayeredFileSystem will always have two delegates, #0 the
writable and #1 an XMLFileSystem.

2. With caching off, behavior as now: XML delegate initially empty,
and setURLs calls setXMLURLs on the XML delegate, done.

3. With caching on, a directory $userdir/cache/ or similar is
selected. Better than /tmp I think, since the cache is more or less
specific to the userdir's state.

3a. In the cache dir, a preparsed layers are kept, named layers.xml.
The initial XML delegate is loaded from this file.

3b. The cache dir also has a stamp.txt containing all the XML URLs
(jar:file:/..../foo.jar!/.../layer.xml) and also timestamps of the
associated JAR files. Or, just a hash (e.g. MD5) of the same. The hash
is a bit more compact, but the full data should be easier to inspect
when debugging.

3c. When setURLs is called, the new hash is computed.

3c1. If the new hash is in fact the one already loaded, nothing
happens. This should be the usual case.

3c2. If the stamp differs from the loaded one, layers.xml is
regenerated from the XML layers, setXmlURLs is called on the result
(triggering a slow refresh) and a new stamp.txt is written out.

3c3. If it turns out to be a problem to regenerate the cache whenever
modules are enabled or disabled - e.g. due to session support or
similar - it should be possible to keep multiple cache files, so that
an older cache can be returned to, and expire old cache files
according to an LRU mechanism. I don't see a particular need for this
now however.

The trick, if I understand correctly, is that most of the time (user
has not turned modules on or off etc.) the correct XML delegate will
already be there at startup time and no refresh will occur. The fact
that the system file system is full of module-supplied stuff before
you have actually turned on any modules should not much matter; lookup
has not been initialized, so no one is really inspecting the SFS much
yet, and since core is in the classpath and has a layer, it is
guaranteed that setURLs will be called pretty quickly.

The cache generation can I think be done by parsing the constituent
layers using SAX and generating an XML output tree. <file>s originally
having literal CDATA content will continue to; empty files will
continue to be empty; files with url= will get url= contents, though
the URL will be adjusted to not be relative. <attr>s will be copied
exactly as they appeared in the original XML. Such a merge process
should not be difficult to implement, should be faster than creating a
filesystem and doing FileObject operations on it, and should not
require any changes to openide code - so no dependency on issue #20169
nor on issue #20170.

One question I have for people: is there any particular reason why
NbInstaller dumps the layers for $userdir/modules/*.jar into the user
ModuleLayeredFileSystem? Why can't *all* module layers go into the
"installation layer"? I don't see the purpose of separating them;
functionally, user-installed modules should behave the same as
central-installed modules, I think.

Comment 18 Petr Nejedly 2002-02-19 16:21:27 UTC

I have a working XFS merger as part of XMLFS.setXmlURLs
No need for manual creation of the cache by a special parser,
it should suffice to dump the state of ResourceElems
just after the standart cacheless parse.
Tomorrow I'll attach a diff against XMLFS.

Comment 19 Jaroslav Tulach 2002-02-19 16:31:41 UTC

Ad. implementation: Jesse, please try to separate the module's API
with the actual cache implementation by creating interface that will
recieve URLs of layers and will return/generate a cache filesystem. I
still do not want to give up the idea of btree storage - so please
create an interface for communication between SystemFileSystem and the
cache.

Ad. "Why can't *all* module layers go into the installation layer?"
Because it is readonly?

Comment 20 Petr Nejedly 2002-02-20 10:56:47 UTC

Created attachment 4758 [details]
A testing patch against XmlFS which adds formatted save

Comment 21 Jesse Glick 2002-02-20 13:04:52 UTC

Thanks Petr for merging code, will try to use it.

Re. separation of cache impl via interface: will try, but consider
that a secondary priority; can always rewrite later. We hardly need
this to be pluggable; the best option should be selected and used.

Re. installation layer being read-only: yeah; so what? All module
layers are read-only. My point is that currently we have (ignoring
projects module) an installation layer (r/o; $nbhome/system/ + layers
from $nbhome/modules/*.jar + layers from $nbhome/lib/*.jar) and a user
layer (r/w; $nbuser/system/ + layers from $nbuser/modules/*.jar +
layers from other *.jar). Why not change this to inst. layer (r/o;
$nbhome/system/ + layers from all *.jar) and user layer (r/w;
$nbuser/system/)? Why do certain module layers have to be in the user
ModuleLayeredFileSystem and not others? Makes no sense to me.

Comment 22 Jaroslav Tulach 2002-02-20 13:28:45 UTC

Layers: Currently administrator may modify the shared installation and
those changes can be overriden by user modules and user system dir. To
make it easier to satisfy this goal the installation and the user
layers are separated (including module jars).

Comment 23 Jesse Glick 2002-02-22 12:04:48 UTC

Created attachment 4793 [details]
Proposed patch

Comment 24 Jesse Glick 2002-02-22 12:13:46 UTC

See the latest patch - incorporates pieces of both Yarda's and Petr's
patches, as well as some new stuff and corrections. It seems to be
working pretty well; run with -J-Dnetbeans.cache.layers=true and
preferably also -J-Dorg.netbeans.core.projects=0 and perhaps
-J-Dnetbeans.cache.layers.prettyprint=true
-J-Dorg.netbeans.core.modules=0. Cache hit/miss logic is implemented
though not completely tested yet; at least enabling/disabling modules
works as usual.

(Note: you need to first correct a syntax error in the form layer:
gratuitous url= for a <file> with CDATA contents. Also due to a bug in
NbErrorManager which I will correct, [org.netbeans.core.projects] log
messages do not reliably appear in the log file.)

There are a few more things I want to do to it before committing,
including making exact measurements of what the improvement is.

Note to Yarda: there is not an interface for plugging in a different
cache impl, mainly because the current integration is dependent on
using XMLFileSystem as the second delegate. Choosing a different cache
mechanism should not be too hard but you would still need to redesign
the initialization of ModuleLayeredFileSystem a bit, I think. Anyway
in agreement with Petr's measurements, I see cache-load times around
1100ms for stable-with-apisupport (I assume Petr's 950 was from stable
config), which doesn't seem unreasonable considering the number of
files and attributes in the merged layer (~ 400K of XML without
whitespace). If we later want to reduce this time (and heap
consumption) a btree or similar impl should work, but of course then
core would have to do the merging work, not XMLFileSystem.

Comment 25 Jesse Glick 2002-02-24 18:45:05 UTC

Startup time improvement: Using moduleconfig 'stable', JDK 1.4.0,
Linux 2.2.12 on Toshiba Tecra 8000 w/ 256MB RAM, running X and Emacs
and Bash only. Methodology: create two fresh user dirs (one for no
cache, one for cache). Go through four priming runs and then ten
measured runs, each consisting of a startup (netbeans.close=true) with
caching off, then one with caching on (interleaved). First priming run
creates cache, of course; others just compensate for disk cache
differences. Avg. time w/o cache: 25.2sec; w/ cache: 23.8sec; delta:
1.4sec = 5.5% (percentage would be higher for JDK 1.3.1_02 because
total startup time is more like 20sec).

I am changing tomcatint/tomcat32/manifest.mf to use
[org.apache.tomcat.core.Response] rather than
[org.apache.tomcat.core.Context]. For some reason I have not been able
to determine, when the cache is in effect (but only then), the package
dependency check on Context takes some 375msec (!) rather than the
usual 25msec or so. [Observed with JDK 1.3.1_02, moduleconfig
stable-with-apisupport.] There is no such differential for Response,
an interface in the same package. No other module's package dependency
check is similar affected. -J-verbose:class and grepping for tomcat or
jasper shows that only three or four such classes are being loaded,
whether or not the cache is on, and Context is always loaded anyway
because something else needs it. I have no idea why the time required
to do a dependency check would be affected by presence of the cache;
the only thing different when the cache is on is that the
SystemFileSystem from early on has a big XMLFileSystem in it rather
than an empty XMLFileSystem (with no URLs). Someone else can
investigate this, maybe.

Comment 26 Jesse Glick 2002-02-25 12:13:08 UTC

Done.

committed     Up-To-Date  1.43       
form/src/org/netbeans/modules/form/resources/layer.xml
committed     Up-To-Date  1.5         tomcatint/tomcat32/manifest.mf

committed   * Up-To-Date  1.13       
core/src/org/netbeans/core/projects/ModuleLayeredFileSystem.java
committed   * Up-To-Date  1.25       
core/src/org/netbeans/core/projects/SystemFileSystem.java
committed   * Up-To-Date  1.53       
openide/src/org/openide/filesystems/XMLFileSystem.java

Comment 27 Petr Nejedly 2002-02-25 13:00:39 UTC

Try -J-verbose:gc
The VM does two full-gc cycles during the startup
for no apparent reason but mostly in the same place each time,
if you keep the memory activity similar.
You've changed the memory activity which may have moved the full gc
to different place where you may not notice it.

Comment 28 Jesse Glick 2002-02-25 13:22:19 UTC

Ah, that makes sense re. GC timing. Didn't think of checking that.

Anyway, I still seem to be missing a few hundred milliseconds
somewhere, haven't figured out where yet. I.e. when you compare times
to call setURLs without cache, vs. time to read cache (~1000msec) plus
checking timestamps and misc (~75msec), the differential seems to be
somewhat more than the actual time improvement. So I am guessing that
something else is being slowed down by a few hundred msec, but not
sure what.

Comment 29 Petr Nejedly 2002-02-25 14:08:58 UTC

In the original way parsing, MFS have created full tree of SFS,
now it does in incrementally and only used part of the SFS is
reflected in the MFS structures. Probably part of the spared time
is re-spent during SFS lookups.

Comment 30 David Simonek 2002-02-27 16:28:52 UTC

I'm reporting problems with cache:
if you break IDE session by pressing Ctrl+Break, cache is not
consistent and IDE will not start anymore, until you clean the cache
manually. This should be more robust IMO.
----------------------------
java.io.IOException: org.xml.sax.SAXException: Premature end of input.
: file:/c:/Netbeans/Configs/MainTrunk/cache/all-layers.xml
        at
org.netbeans.core.projects.SystemFileSystem.create(SystemFileSystem.java:404)
        at
org.netbeans.core.projects.SessionManager.create(SessionManager.java:72)
        at
org.netbeans.core.NonGui.createDefaultFileSystem(NonGui.java:233)
        at
org.netbeans.core.NbTopManager.getRepository(NbTopManager.java:316)
        at org.netbeans.core.NonGui.run(NonGui.java:447)
        at org.netbeans.core.Main.run(Main.java:213)
        at
org.openide.TopManager.initializeTopManager(TopManager.java:120)
        at org.openide.TopManager.getDefault(TopManager.java:81)
        at org.netbeans.core.Main.main(Main.java:346)
        at org.netbeans.core.TopThreadGroup.run(TopThreadGroup.java:81)
        at java.lang.Thread.run(Thread.java:484)
Cannot add System filesystem: c:\Netbeans\Configs\MainTrunk\system,
exiting...
Press any key to continue . . .

Comment 31 Petr Nejedly 2003-07-09 13:02:32 UTC

It worked correctly until it was replaced by the binary cache ;-)