If I see this correctly:
Netbeans is still using the quite old Lucene 3.5
Lucene 4.1 supports compressed index, which would save considerable amount of space for the maven index.
At the same time the amount of I/O might be reduced so even compression adds CPU overhead, the saved IO could more then amortize the costs.
See f.e. http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene
I would assume that newer lucene versions are better optimized as well.
(I am not sure whether filing it for the maven project is correct, nor am I sure that Netbeans is really using the stone-aged lucene 3.5 / 3.6)
we're using 3.6.2 actually (shipping a separate lucene with indexer module. The reason is that we actually use maven-indexer component/jar that in turn is using this version of lucene, without maven-indexer guys at apache upgrading we cannot upgrade either I think
I see. From my understanding the nexus index format is based on Lucene (the old one). I am afraid therefore a change of the lucene version in the indexer-library itself is not possible - at least if the new format should be used.
The best alternative from my point of view is to use the new lucene format and maintain an own index inside of netbeans.
I did more background checking, it seems like the new index format is in fact used to be independent of the Lucene format, as described here:
So the options are to wait for a new maven-indexer release with a newer lucene bundled, or Netbeans could put use a newer lucene with maven-indexer, given that the API did not break.
btw as part of another issue, I've removed a osgi related processor that populated the lucene documents with osgi related manifest entries. That effectively halved the central index size (as a local lucene index, the download size is the same but that's already compressed)
I added an issue on the maven indexer :
@everflux : nexus has its own format but also produce a lucene index which is 'standard' de facto. By the way i think Lucene can update indexxes from previous versions
AFAIK the ".zip" index is plain lucene (legacy version) index file and the ".gz" is a simple compressed text file format which is lucene agnostic. But when you use the ".gz" you have to build your lucene (or whatever you use for searching/indexing) index yourself.
IMHO this is happening in Netbeans, so I see multiple options to solve this
- get a new maven indexer release and use that (out of control of netbeans group)
- fork the maven indexer and upgrade it for newer Lucene + pull-request and use that in Netbeans (could be done by a NetCAT participant as well f.e.)
- Create an own Lucene index outside of the maven indexer, using latest Lucene version (this could lead to have two indices, one from the maven indexer, one from Netbeans, obviously not what we want if the goal is to save space, or we would need to delete the maven indexer index afterwards, but this would prevent incremental updates)
If someone would help me with the Netbeans integration part, I would volunteer to have a look into the maven-indexer and see if I can get it to work with newer Lucene smoothly.
(In reply to everflux from comment #6)
> AFAIK the ".zip" index is plain lucene (legacy version) index file and the
> ".gz" is a simple compressed text file format which is lucene agnostic. But
> when you use the ".gz" you have to build your lucene (or whatever you use
> for searching/indexing) index yourself.
right, but the zip (legacy) content is not really present at many remote locations anymore.
> IMHO this is happening in Netbeans, so I see multiple options to solve this
> - get a new maven indexer release and use that (out of control of netbeans
preferable option. Please note that any external binary changes need to be approved by oracle legal thus it's not an option for 8.0 anymore. (yes, it takes a while unfortunately)
> - fork the maven indexer and upgrade it for newer Lucene + pull-request and
> use that in Netbeans (could be done by a NetCAT participant as well f.e.)
-1, forks have a maintainance price tag attached.
> - Create an own Lucene index outside of the maven indexer, using latest
> Lucene version (this could lead to have two indices, one from the maven
> indexer, one from Netbeans, obviously not what we want if the goal is to
> save space, or we would need to delete the maven indexer index afterwards,
> but this would prevent incremental updates)
-1 for the same reasons.
> If someone would help me with the Netbeans integration part, I would
> volunteer to have a look into the maven-indexer and see if I can get it to
> work with newer Lucene smoothly.
Sure, feel free to ask in this issue or me directly (firstname.lastname@example.org)
Created attachment 145169 [details]
Reflect lucene API changes
This patch is required due to lucene API changes, "indexExists" is moved to DirectoryReader.
Upstreams has merged my patch, will be release with maven indexer 6.0
Not sure about the process: Leave this issue open to track the dependency upgrade or close it and you have a separate issue to update Netbeans dependencies for next release?
What happened to this issue? Was anything ever integrated into a release?
My changes where never merged to NB, unfortunately. Upstream (maven indexer) did not release a version with my changes (yet).