117450 – Provide unified LexerInput across multiple joined embedded sections

This Bugzilla instance is a read-only archive of historic NetBeans bug reports. To report a bug in NetBeans please follow the project's instructions for reporting issues.

Bug 117450 - Provide unified LexerInput across multiple joined embedded sections

Summary: Provide unified LexerInput across multiple joined embedded sections

Status:	RESOLVED FIXED

Alias:	None

Product:	editor
Classification:	Unclassified
Component:	Lexer (show other bugs)
Version:	6.x
Hardware:	PC All

Importance:	P2 blocker (vote)
Assignee:	apireviews

URL:
Keywords:	API, API_REVIEW_FAST, PERFORMANCE

Duplicates (1):	118892 (view as bug list)
Depends on:
Blocks:	133906 121881 131357
	Show dependency tree

Reported:	2007-10-02 13:48 UTC by Miloslav Metelka
Modified:	2008-06-16 22:03 UTC (History)
CC List:	3 users (show)

See Also:
Issue Type:	ENHANCEMENT
Exception Reporter:

Attachments
Diff of the proposed change ~660kB (657.90 KB, patch) 2008-05-28 14:05 UTC, Miloslav Metelka	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Miloslav Metelka 2007-10-02 13:48:29 UTC

The LexerInput for the embedded lexer should contain characters from all the joined embedded sections.
This is continuation of requests from issue 87014. The advantage will be that the embedded lexer will not need to be
sensitive to "soft-EOFs" at the embedded sections boundaries and its internal state machine can be much simpler. The
lexer infrastructure will perform the token splitting (so that the token hierarchy remains a tree of tokens).
Also this request is must-have if there are tokens with larger lookaheads.

It's not yet clear whether some of the lexers would benefit from being aware of the soft-EOFs. By default we will not
add any API for that as it should be no problem to add it later.

Comment 1 Miloslav Metelka 2007-10-02 13:50:26 UTC

This should certainly be done for NB7.0 since there are features that depend on this.

Comment 2 Jan Jancura 2007-10-16 13:09:30 UTC

*** Issue 118892 has been marked as a duplicate of this issue. ***

Comment 3 Miloslav Metelka 2007-11-22 23:14:26 UTC

I'm already working on this expecting to be finished within 6.1 M1.
I'm working on implementation that will allow not only to join all the embedded sections together (e.g. all html
sections in jsp) together but also create multiple subsections. There is currently no API for this but we already agreed
with Marek that it will be needed anyway so the impl will be ahead.

Comment 4 Miloslav Metelka 2008-01-15 08:57:42 UTC

There is still some remaining work to finish so I've changed the TM to 6.1M2. I hope to have a first version ready in
about a week.

Comment 5 Miloslav Metelka 2008-05-28 14:02:40 UTC

Finally I have the implementation complete. I would like to ask for API review since the change adds extra methods to
Token that allow to find out whether a particular token is a join token or a token part.
Once the change gets integrated the lexers for language embeddings with LanguageEmbedding.joinSections()==true will see
all the sections with the particular language path as a one single continuous input. If the produced tokens split the
boundaries of sections then the infrastructure will break the tokens into parts automatically. There should be no
changes necessary in the lexers' implementations since the lexers should not care of how the characters in the
LexerInput get composed.
TokenSequence.embeddedJoined() allows to obtain token sequence over joined tokens.

I'm still adding some extra tests including a test with random modifications of a document with joined embeddings and
fixing some minor problems.

I apologize for a significant underestimation of the necessary work. Finally I had to rewrite both the TokenListUpdater
and TokenHierarchyUpdate almost completely separating analysis for regular and joined token lists (TLU.updateRegular()
and TLU.updateJoined()) while keeping a common relexing part TLU.relex().
It was a biggest change in the lexer module since its introduction. OTOH there are some positive aspects:

1) The implementation of a JoinTokenList is rather lightweight. The tokens from individual EmbeddedTokenList members are
not copied into a joined token list. Instead just the token parts split among multiple ETLs point to a special JoinToken
and there is an extra EmbeddedJoinInfo that carries meta info necessary for efficient searching and iteration in a
JoinTokenList implementation.

2) The implementation allows to maintain both single (current state) or multiple join token lists (having multiple
separated "section groups") across the document to support a potential usecase expressed by web team.

3) I have improved performance of accessing of the text of embedded tokens (originally complained by Schliemann team).
For a token implementing the CharSequence there is a certain overhead of acessing each character so for long tokens a
inputText-CharSequence.subSequence() gets used to have the most direct access to the characters.

4) Token.text().toString() caching is implemented (originally requested by Schliemann). Although some toString() usages
can be eliminated by using methods in TokenUtilities there may be legitimate usages e.g. for a use in HashMap. The
current impl starts to cache text strings for longer tokens sooner than for the short ones.

5) Embedded token lists initialization should possibly be more deterministic (Lazy ETL.init() was eliminated). It could
possibly eliminate the infamous "ISE: Removing at index=xx but real index is yy ..." which can sometimes still be seen.
I also plan to transfer the read/write-locking checks (that can now be turned on by using on a logger) into assertions
(that would be always checked except no-assertions FCS builds). But I first need to fix the problematic usages and also
migrate the lexer's tests.

6) Token.isRemoved() is implemented which allows to determine whether a particular token was already removed from the
token hierarchy or not. This is used both by the infrastructure but it's also useful externally when doing offset binary
searches in an array of tokens where some of them may already be removed. Since the removed tokens retain their original
offsets the binary search could run into an infinite loop. Token.isRemoved() allows to remedy the problem.

Comment 6 Miloslav Metelka 2008-05-28 14:05:15 UTC

Created attachment 62062 [details]
Diff of the proposed change ~660kB

Comment 7 Miloslav Metelka 2008-06-04 23:37:54 UTC

I have added a JoinRandomTest to test the algorithm by random document modifications. It revealed >10 problems that I
have fixed.
I have tried opening/editing of several file types including java and jsp. I hope that the change should not introduce
any significant regressions.

changeset:   82742:f6b2e0181e07
tag:         tip
user:        Miloslav Metelka <mmetelka@netbeans.org>
date:        Thu Jun 05 00:37:02 2008 +0200
summary:     #117450 - Provide unified LexerInput across multiple joined embedded sections.

Comment 8 Jesse Glick 2008-06-16 18:48:45 UTC

You seem to have broken commit validation due to an incompatible change in lexer.

Comment 9 Jan Lahoda 2008-06-16 18:51:33 UTC

I have added default bodies to the newly added methods:
http://hg.netbeans.org/main/rev/5264dc709dd0

Comment 10 Jesse Glick 2008-06-16 19:17:28 UTC

Test compilation for many modules also got broken due to the change in constructor signatures in TestRandomModify.

Comment 11 Miloslav Metelka 2008-06-16 22:03:58 UTC

Apologies, since the constructor of Token is

    protected Token() {
        if (!(this instanceof org.netbeans.lib.lexer.token.AbstractToken)) {
            throw new IllegalStateException("Custom token implementations prohibited."); // NOI18N
        }
    }

I've got a false assumption that I can add abstract methods to it. Thanks to Honza for fixing the problem.

I have fixed the TestRandomModify:
http://hg.netbeans.org/main/rev/ae1faae6f61b