87014 – Preserve lexer state between separate blocks of embedded language

This Bugzilla instance is a read-only archive of historic NetBeans bug reports. To report a bug in NetBeans please follow the project's instructions for reporting issues.

Bug 87014 - Preserve lexer state between separate blocks of embedded language

Summary: Preserve lexer state between separate blocks of embedded language

Status:	RESOLVED FIXED

Alias:	None

Product:	editor
Classification:	Unclassified
Component:	Lexer (show other bugs)
Version:	6.x
Hardware:	All All

Importance:	P1 blocker (vote)
Assignee:	Miloslav Metelka

URL:
Keywords:

Depends on:	91184
Blocks:	101119 102794 104242 104413 104958 105145 108549 110996 111546
	Show dependency tree

Reported:	2006-10-12 15:39 UTC by Marek Fukala
Modified:	2007-08-30 22:19 UTC (History)
CC List:	4 users (show)

See Also:
Issue Type:	DEFECT
Exception Reporter:

Attachments
Diff adding LanguageEmbedding.joinSections() (1.30 KB, text/plain) 2006-10-27 15:31 UTC, Miloslav Metelka	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Marek Fukala 2006-10-12 15:39:13 UTC

We have a following usecase in JSP editor:

<!-- 
<h1>
<%="hello"%>
</h1>
-->

JspLexer now returns JspTokenId.TEXT for the first part (up to <%=), then
JspTokenID.SCRIPTING and the rest of JspTokenId.TEXT again. The TEXT tokens are
then used for HTMLLanguage embedding. We need to preserve the end lexer state
from the first block of HTML to the second one (so the text below <%="hello"%>
is marked as comment. Now lexer just starts lexing in each embedding block from
INIT state. It is very important for JSP editor (and probably not only for this
one) to have such functionality.

Comment 1 Miloslav Metelka 2006-10-13 10:39:16 UTC

I understand this requirement in fact in some sense it's similar to handling of
the embedding of javadoc comments in java.
The way that I have explored already: As the html comment is treated as a single
token this requirement leads to possibility to have token consisting of
individual parts that may possibly be located at different places in the
document and that need to be merged together prior actual lexing. I've spent
several months in the past trying to implement this model and it is a nightmare.
Not only the implementation is complicated but also the usage is difficult - the
fact that a single token's content is spread across the document complicates the
usage for the clients and also the notification about the changes in the token
list. So I would not like to go this way again.

 Instead I would like to retain the token hierarchy as a regular tree of tokens
but of course we will need to add an additional information to the token so that
the user will be able to figure out that the token is a part of something
bigger. I propose to have an enum TokenPartInfo { COMPLETE, START, INNER, END }
and Token.getPartInfo() returning it. The TokenFactory will contain additional
methods to give the part info at token creation time. The thing that I
especially like on this solution is that we can conveniently use this mechanism
to express incomplete tokens - they are just the START part of the regular
tokens. This will allow us to remove *_INCOMPLETE token ids completely! It's
good because the incomplete token ids clutter the language definition (it's
necessary to check for them etc.).

 Now the lexer will have to be able to restart with the information that it
continues an incomplete token. Although the lexer could express that by going
into a special state recorded after the last incomplete token in the lexed
embedded section I don't like this solution much because many lexers only have
null state and this extra state after the last token would be the only non-null
states that they produce. For lexers with null state there is an optimization
that the state is simply not stored at all. Although I could benefit from the
case that it's only after the last token so I could treat the last state
specially it's still complicated. Instead I propose to pass the last incomplete
token instance from the previous section to the lexer at the time of its
construction.

Now it's a question how to express that the sections should be connected and how
to extend the notification model to cover this requirement.

Comment 2 Marek Fukala 2006-10-13 11:02:42 UTC

>The way that I have explored already: As the html comment is treated as a single
>token this requirement leads to possibility to have token consisting of
>individual parts that may possibly be located at different places in the
>document and that need to be merged together prior actual lexing. I've spent
>several months in the past trying to implement this model and it is a nightmare.

Actually I do not require exactly what you described. I have no problem with
having the entire HTML comment separated to several HTML comment tokens. So the
first HTML comment token would end just before '<%...' characters and another
one would start just after '...%>'. The problem is that I need to start lexing
of the second part of the comment in the same lexer state when the lexer stopped
in the previour html comment part. Maybe implementing this would be easier than
when you described. 

What would happen in the case you implement the 'continuous token' when the '<%'
symbol appears? The embedding for it will be null (or at least not -html) so the
lexer should somehow stop lexing. Would it receive EOL? or what? If it gets EOL,
it will create the HTML token anyway. 

Maybe I am too much usecase-oriented so I do not have the entire context, you
are the god of lexing so please do what is right ;-).

Comment 3 Miloslav Metelka 2006-10-13 13:51:43 UTC

> The problem is that I need to start lexing
> of the second part of the comment in the same lexer state when the lexer stopped
> in the previour html comment part. Maybe implementing this would be easier than
> when you described. 
Yes, I understand that and as I've described in the third paragraph the lexer
would be given the state (which would of course be the state after the last
token in the previous section) and the last token from the previous section (to
cover those stateless lexers efficiently). So this may a bit extra work for the
lexer to check the token from previous section but it's negligible.

> What would happen in the case you implement the 'continuous token' when the 
> '<%' symbol appears? The embedding for it will be null (or at least not -html)
> so the lexer should somehow stop lexing. Would it receive EOL? or what? If it 
> gets EOL, it will create the HTML token anyway.
The lexer will get EOF at the end of each embedded html section so it will be
pushed to create an incomplete html comment token. This is a difference from the
"The way that I have explored already:" - read this as "The way I don't want to
implement" :) where the lexer just would not notice the borders between the
embedded sections - it would just see one long character sequence consisting of
all the embedded sections that should be concatenated. That would be nice but
too hard to implement.

In practice the lexer must be able to return incomplete tokens anyway (to be
able to tokenize characters till the end of the input) so this should add no
extra burden to the lexer's complexity.

Comment 4 Miloslav Metelka 2006-10-27 15:29:53 UTC

Marku, we should first clarify whether it's enough to attempt to join ALL the
sections with the particular language path i.e. "text/x-jsp/text/html" or
whether there are any external criteria that define additional conditions
regarding joining.
I only assume simple cases like this:

<!-- Comment start <% System.out.println("Nazdar"); %> still in comment -->

but are there any more complicated where the html sections would eventually NOT
be joined?

If there is a simple joining only then I would propose to add "boolean
joinSections()" into LanguageEmbedding (attaching patch) and modify the
infrastructure to comply. Not sure whether any changes are needed on the API
level but I would first like to clarify the SPI.

Comment 5 Miloslav Metelka 2006-10-27 15:31:05 UTC

Created attachment 35602 [details]
Diff adding LanguageEmbedding.joinSections()

Comment 6 Marek Fukala 2007-04-05 12:30:35 UTC

IMHO all pieces a embedded language with the same language path should be
joined, in terms of passing the lexer state between all two closest sections. I
do not see any usecase which would require to have the joinSections() method
since you can always return a complete token and set lexer state to INIT on the
end of the section if you want. I am not sure about the stateless lexers, I do
not know how they works, I am just talking about my usecase.

Comment 7 Marek Fukala 2007-04-25 16:44:32 UTC

The issue is neither P1 nor J1 stopper since I partially fixed/workarounded the
Issue #99526 which is affected by this issue - the issue is now just P2.

We will need this issue fixed in M10 though.

Comment 8 Marek Fukala 2007-07-10 17:24:35 UTC

Making this issue as a P1 problem - it really breaks many things - see the list of blocked issues.

Comment 9 Marek Fukala 2007-07-10 17:24:46 UTC

Making this issue as a P1 problem - it really breaks many things - see the list of blocked issues.

Comment 10 Miloslav Metelka 2007-08-30 22:19:00 UTC

The committed is the basic implementation that lexes each section individually with only transferring the state between
the sections. As we've already talked with Marek and Hanz I will now work on the solution that will virtually join the
sections so that the lexer will not see the individual sections' EOFs.
I will also add some more tests and coordinate with Marek regarding possible problems with jsps etc.

Checking in src/org/netbeans/lib/lexer/TokenHierarchyOperation.java;
/cvs/lexer/src/org/netbeans/lib/lexer/TokenHierarchyOperation.java,v  <--  TokenHierarchyOperation.java
new revision: 1.18; previous revision: 1.17
done
Checking in src/org/netbeans/lib/lexer/LexerUtilsConstants.java;
/cvs/lexer/src/org/netbeans/lib/lexer/LexerUtilsConstants.java,v  <--  LexerUtilsConstants.java
new revision: 1.16; previous revision: 1.15
done
RCS file: /cvs/lexer/src/org/netbeans/lib/lexer/EmbeddedLexerInputOperation.java,v
done
Checking in src/org/netbeans/lib/lexer/EmbeddedLexerInputOperation.java;
/cvs/lexer/src/org/netbeans/lib/lexer/EmbeddedLexerInputOperation.java,v  <--  EmbeddedLexerInputOperation.java
initial revision: 1.1
done
Checking in src/org/netbeans/lib/lexer/TextLexerInputOperation.java;
/cvs/lexer/src/org/netbeans/lib/lexer/TextLexerInputOperation.java,v  <--  TextLexerInputOperation.java
new revision: 1.5; previous revision: 1.4
done
Checking in src/org/netbeans/lib/lexer/SubSequenceTokenList.java;
/cvs/lexer/src/org/netbeans/lib/lexer/SubSequenceTokenList.java,v  <--  SubSequenceTokenList.java
new revision: 1.11; previous revision: 1.10
done
Checking in src/org/netbeans/lib/lexer/LexerInputOperation.java;
/cvs/lexer/src/org/netbeans/lib/lexer/LexerInputOperation.java,v  <--  LexerInputOperation.java
new revision: 1.11; previous revision: 1.10
done
Checking in src/org/netbeans/lib/lexer/LAState.java;
/cvs/lexer/src/org/netbeans/lib/lexer/LAState.java,v  <--  LAState.java
new revision: 1.5; previous revision: 1.4
done
Checking in src/org/netbeans/lib/lexer/TokenListList.java;
/cvs/lexer/src/org/netbeans/lib/lexer/TokenListList.java,v  <--  TokenListList.java
new revision: 1.4; previous revision: 1.3
done
Checking in src/org/netbeans/lib/lexer/EmbeddedTokenList.java;
/cvs/lexer/src/org/netbeans/lib/lexer/EmbeddedTokenList.java,v  <--  EmbeddedTokenList.java
new revision: 1.10; previous revision: 1.9
done
Checking in src/org/netbeans/lib/lexer/EmbeddingContainer.java;
/cvs/lexer/src/org/netbeans/lib/lexer/EmbeddingContainer.java,v  <--  EmbeddingContainer.java
new revision: 1.10; previous revision: 1.9
done
RCS file: /cvs/lexer/src/org/netbeans/lib/lexer/TokenSequenceList.java,v
done
Checking in src/org/netbeans/lib/lexer/TokenSequenceList.java;
/cvs/lexer/src/org/netbeans/lib/lexer/TokenSequenceList.java,v  <--  TokenSequenceList.java
initial revision: 1.1
done
Checking in src/org/netbeans/lib/lexer/inc/TokenListUpdater.java;
/cvs/lexer/src/org/netbeans/lib/lexer/inc/TokenListUpdater.java,v  <--  TokenListUpdater.java
new revision: 1.17; previous revision: 1.16
done
Checking in src/org/netbeans/lib/lexer/inc/IncTokenList.java;
/cvs/lexer/src/org/netbeans/lib/lexer/inc/IncTokenList.java,v  <--  IncTokenList.java
new revision: 1.11; previous revision: 1.10
done
Checking in src/org/netbeans/lib/lexer/inc/MutableTokenList.java;
/cvs/lexer/src/org/netbeans/lib/lexer/inc/MutableTokenList.java,v  <--  MutableTokenList.java
new revision: 1.5; previous revision: 1.4
done
RCS file: /cvs/lexer/test/unit/src/org/netbeans/lib/lexer/lang/TestJoinSectionsTopTokenId.java,v
done
Checking in test/unit/src/org/netbeans/lib/lexer/lang/TestJoinSectionsTopTokenId.java;
/cvs/lexer/test/unit/src/org/netbeans/lib/lexer/lang/TestJoinSectionsTopTokenId.java,v  <--  TestJoinSectionsTopTokenId.java
initial revision: 1.1
done
RCS file: /cvs/lexer/test/unit/src/org/netbeans/lib/lexer/lang/TestJoinSectionsTextTokenId.java,v
done
Checking in test/unit/src/org/netbeans/lib/lexer/lang/TestJoinSectionsTextTokenId.java;
/cvs/lexer/test/unit/src/org/netbeans/lib/lexer/lang/TestJoinSectionsTextTokenId.java,v  <-- 
TestJoinSectionsTextTokenId.java
initial revision: 1.1
done
RCS file: /cvs/lexer/test/unit/src/org/netbeans/lib/lexer/lang/TestJoinSectionsTopLexer.java,v
done
Checking in test/unit/src/org/netbeans/lib/lexer/lang/TestJoinSectionsTopLexer.java;
/cvs/lexer/test/unit/src/org/netbeans/lib/lexer/lang/TestJoinSectionsTopLexer.java,v  <--  TestJoinSectionsTopLexer.java
initial revision: 1.1
done
RCS file: /cvs/lexer/test/unit/src/org/netbeans/lib/lexer/lang/TestJoinSectionsTextLexer.java,v
done
Checking in test/unit/src/org/netbeans/lib/lexer/lang/TestJoinSectionsTextLexer.java;
/cvs/lexer/test/unit/src/org/netbeans/lib/lexer/lang/TestJoinSectionsTextLexer.java,v  <--  TestJoinSectionsTextLexer.java
initial revision: 1.1
done
Checking in src/org/netbeans/api/lexer/TokenChange.java;
/cvs/lexer/src/org/netbeans/api/lexer/TokenChange.java,v  <--  TokenChange.java
new revision: 1.9; previous revision: 1.8
done
Checking in src/org/netbeans/api/lexer/LanguagePath.java;
/cvs/lexer/src/org/netbeans/api/lexer/LanguagePath.java,v  <--  LanguagePath.java
new revision: 1.9; previous revision: 1.8
done
Checking in src/org/netbeans/api/lexer/TokenSequence.java;
/cvs/lexer/src/org/netbeans/api/lexer/TokenSequence.java,v  <--  TokenSequence.java
new revision: 1.13; previous revision: 1.12
done
Checking in src/org/netbeans/api/lexer/TokenHierarchy.java;
/cvs/lexer/src/org/netbeans/api/lexer/TokenHierarchy.java,v  <--  TokenHierarchy.java
new revision: 1.10; previous revision: 1.9
done
Checking in src/org/netbeans/spi/lexer/LexerRestartInfo.java;
/cvs/lexer/src/org/netbeans/spi/lexer/LexerRestartInfo.java,v  <--  LexerRestartInfo.java
new revision: 1.3; previous revision: 1.2
done
Checking in src/org/netbeans/spi/lexer/LanguageEmbedding.java;
/cvs/lexer/src/org/netbeans/spi/lexer/LanguageEmbedding.java,v  <--  LanguageEmbedding.java
new revision: 1.9; previous revision: 1.8
done
Checking in src/org/netbeans/spi/lexer/LanguageHierarchy.java;
/cvs/lexer/src/org/netbeans/spi/lexer/LanguageHierarchy.java,v  <--  LanguageHierarchy.java
new revision: 1.14; previous revision: 1.13
done
Checking in test/unit/src/org/netbeans/lib/lexer/test/inc/TokenHierarchySnapshotTest.java;
/cvs/lexer/test/unit/src/org/netbeans/lib/lexer/test/inc/TokenHierarchySnapshotTest.java,v  <-- 
TokenHierarchySnapshotTest.java
new revision: 1.9; previous revision: 1.8
done
Checking in nbproject/project.properties;
/cvs/lexer/nbproject/project.properties,v  <--  project.properties
new revision: 1.15; previous revision: 1.14
done
RCS file: /cvs/lexer/test/unit/src/org/netbeans/lib/lexer/JoinSectionsTest.java,v
done
Checking in test/unit/src/org/netbeans/lib/lexer/JoinSectionsTest.java;
/cvs/lexer/test/unit/src/org/netbeans/lib/lexer/JoinSectionsTest.java,v  <--  JoinSectionsTest.java
initial revision: 1.1
done
Checking in test/unit/src/org/netbeans/lib/lexer/TokenSequenceListTest.java;
/cvs/lexer/test/unit/src/org/netbeans/lib/lexer/TokenSequenceListTest.java,v  <--  TokenSequenceListTest.java
new revision: 1.4; previous revision: 1.3
done
Checking in api/apichanges.xml;
/cvs/lexer/api/apichanges.xml,v  <--  apichanges.xml
new revision: 1.21; previous revision: 1.20
done