A lexer is a quite self-contained piece of code and it ought to be simple to
unit test. Probably just as easy or even easier to write a proper test than to
manually verify the behavior once in a running IDE. To make this as easy as
possible, there should be a test framework in the Lexer API which particular
language lexers can reuse. For example,
assertLex(HTMLLanguage.description(), /*some config maybe*/...,
"[BLOCK_COMMENT]<!-- ... -->[WS]\n" +
The assertLex method would strip out any recognized token IDs inside square
brackets, parse the remaining text, and verify that its tokenization follows the
specified sequence. (Of course you could extend this to check information about
embedded languages, etc.) The idea is to make the unit tests as short, readable,
and intuitive as possible.
Currently there is only a support for randomized testing in
org.netbeans.lib.lexer.test.TestRandomModify that allows to specify the
probability for specific chars and strings but it only compares the
incrementally updated token list with a batch lexed one. It does not check the
proper identity of the individual tokens.
It's true that with the declarative stuff the tests like e.g. JavaLexerBatchTest
could become more terse and could be written more quickly so I like this idea.
Some time ago I've made LexerTestUtilities.checkTokenDump() that gets an input
file as parameter and produces a text output describing the created tokens from
the input (please see the javadoc of the method). The output is written to a
file in the same directory with "tokens.txt" appended if such file does not
exist yet and the test fails notifying the user that the file was created (user
should check whether the produced tokens match expectations). If the output file
already exists its content is compared against the produced output and the test
would fail if the contents would not be exactly the same.
As I was coding it I have added some features like multiple inputs (virtual
EOFs), test naming and special chars. I'm using extra lines for control
sequences and interleave the directives with dots to distinguish them from
regular input. The input specification could be rewritten to xml if desirable.
Please see in lexer: TokenDumpTest, TokenDumpTestFile.txt (for control
directives) and in java/lexer: JavaTokenDumpTest and testInput.java.txt and
testInput.java.txt.tokens.txt for using it to test java lexer correctness.
If this would suffice I would then close this issue.