This Bugzilla instance is a read-only archive of historic NetBeans bug reports. To report a bug in NetBeans please follow the project's instructions for reporting issues.
A lexer is a quite self-contained piece of code and it ought to be simple to unit test. Probably just as easy or even easier to write a proper test than to manually verify the behavior once in a running IDE. To make this as easy as possible, there should be a test framework in the Lexer API which particular language lexers can reuse. For example, assertLex(HTMLLanguage.description(), /*some config maybe*/..., "[BLOCK_COMMENT]<!-- ... -->[WS]\n" + "[TAG_OPEN]<[TAG_OPEN_SYMBOL]hello[TAG_OPEN]>" + [TEXT]kitty[TAG_CLOSE]</[TAG_CLOSE_SYMBOL]hello[TAG_CLOSE]>"); The assertLex method would strip out any recognized token IDs inside square brackets, parse the remaining text, and verify that its tokenization follows the specified sequence. (Of course you could extend this to check information about embedded languages, etc.) The idea is to make the unit tests as short, readable, and intuitive as possible.
Currently there is only a support for randomized testing in org.netbeans.lib.lexer.test.TestRandomModify that allows to specify the probability for specific chars and strings but it only compares the incrementally updated token list with a batch lexed one. It does not check the proper identity of the individual tokens. It's true that with the declarative stuff the tests like e.g. JavaLexerBatchTest could become more terse and could be written more quickly so I like this idea.
Some time ago I've made LexerTestUtilities.checkTokenDump() that gets an input file as parameter and produces a text output describing the created tokens from the input (please see the javadoc of the method). The output is written to a file in the same directory with "tokens.txt" appended if such file does not exist yet and the test fails notifying the user that the file was created (user should check whether the produced tokens match expectations). If the output file already exists its content is compared against the produced output and the test would fail if the contents would not be exactly the same. As I was coding it I have added some features like multiple inputs (virtual EOFs), test naming and special chars. I'm using extra lines for control sequences and interleave the directives with dots to distinguish them from regular input. The input specification could be rewritten to xml if desirable. Please see in lexer: TokenDumpTest, TokenDumpTestFile.txt (for control directives) and in java/lexer: JavaTokenDumpTest and testInput.java.txt and testInput.java.txt.tokens.txt for using it to test java lexer correctness. If this would suffice I would then close this issue.