Warren Falk

ANTLR

I’m trying to write a parser to recognize Rich Text Format (RTF) and parse it into a meaningful syntax tree.  I’ve decided to use ANTLR which is a tool which is used to write a lexer/parser/tree parser combination.  It is supposed to make this task easier.  So far it has not been.  Either this is because I don’t know how to use the tool, or the tool is just not very useful.  I am still presuming the former, at this point.

ANTLR makes a distinction between “lexing” and “parsing”.  One uses a lexer to break the character stream of a file into “tokens” which are groups of one (actually zero) or more characters.  A parser, then, should be able to take the new stream of tokens and create a tree out of them.  This tree is known as the “abstract syntax tree” or AST.  Now the hard part here, if you’re new to this, as I am, is trying to figure out just what the difference is between a token and a node in the AST, because you have to create “lexer” rules to tease tokens out of the character stream, and then later use “parser” rules to further group/assemble those tokens into nodes in an AST.

The distinction is not obvious, and probably partially subjective, which makes figuring out how to accomplish this quite difficult using the standard try/fail/adjust/retry loop.

For example, here’s a very simple rtf document:

{\rtf\asc\lang1024 {\par This is a test}{\par of the \{emergency\} broadcast system}}

…which would be rendered, roughly, as follows:

This is a test
of the {emergency} broadcast system

Now the “\rtf” and “\lang1024″ are called “controls”.  The “This is a test” and “of the {emergency} broadcast system” are the regular text, and the whol document is in a hierarchy of “groups” delimited by curly braces.  This is a bit of an over simplification, but not by much.  The question is now, when the lexing is complete, where should the lines be drawn between tokens?

Eventually I began to think the tokens should be separated into something like the following:

{, \, rtf, \, asc, \, lang, 1024 , {, \, par, This is a test, }, {, \, par, of the \{emergency\} broadcast system, }

But then I finally got it to work in a reasonably useful manner by breaking it up like this:

{, \rtf, \asc, \lang1024 , {, \par, This is a test, }, {, \par, of the , \{, emergency, \}, broadcast system, }

But this was not obvious, and I’m not even sure whether this is the best way to break this up into tokens, but it ended up working the best.  Some interesting things arose here.  Such as that I expected to break up the “controls” into their constituent parts (the ‘\’ the name and the optional number) but it turns out that the whole control construct works out, nicely, as a single token.  Unfortunately, I’m not quite sure why.  Also, I tried to get the text to be one big token, but had to break it up into strings of text without control sequences and strings of text with control sequences.  I’m not sure why, in this case, either.

Leave a Reply

Not logged in: Log in

.
Entries (RSS) and Comments (RSS).