Design Patterns: Parsing A File

One of the more common problems we encounter when writing software is the problem of parsing a file. Generally that involves parsing a text file–that is, turning an array or stream of characters into a data structure that can then be operated on. While this routinely comes up when designing a compiler, we see text files and parsing text files literally everywhere: in configuration files, in HTTP transactions where we receive a text response to an API query, even internally parsing commands.

There are several tools out there for parsing a file, but the general design pattern I want to talk about is the two-phase parsing process we learn about in Computer Science when talking about building a compiler.

Those two steps are “lexifying” or “tokenizing” a stream–turning random arrays of characters into ‘tokens’ that indicate the type of thing we’re looking at (“word”, “number”, “string”, etc). And the second step of parsing the stream of tokens, to turn them into an internal data structure that we can then directly operate on.

This two-step process is performed when we compile a program. But we also see this two-step tokenization/parsing process happen when parsing a JSON file, when parsing XML, or when parsing our own configuration files.


As a side note, the idea that we’re employing here, of “chunking” big problems into little problems, is it’s own “design pattern” of sorts.

And “chunking” of big problems into more manageable parts is a common idea we see everywhere: in the implementation of a protocol stack for network communications, where each “layer” implements just one piece of the puzzle and is only responsible for solving one problem. In a graphics pipeline, where each state of computation is handled by a separate step which is responsible for handling only one part of the geometric transformation which results in pretty pictures on a screen. Even in the development of a computer operating system, which is generally assembled not as a single monolithic entity, but as a series of individual computer programs each responsible for solving only one specialized problem.

This idea of chunking a problem into smaller bits is also referred to as as separation of concerns, and is worth understanding in its own right. “Separation of concerns” not only discusses chunking problems, but formalizes the idea–and throwing in things like defining well-defined interfaces and building your code as a series of “black boxes” which can stand on their own.


Tokenization

The first step is tokenization–that is, taking the text file and discovering the “tokens” in that text file which represent our “language.”

Tokenization depends on what we’re parsing. For example, if we were interested in parsing JSON, the tokens we need to look for are specified in the JSON specification, which itself is a very small subset of Javascript. And we can use the specification on this page to build a tokenizer by hand.

The JSON specification itself describes three fundamental object types beyond the incidential one-character puncutation marks that are used to define objects: strings (which are surrounded by double quotes), numbers, and tokens (such as “true” or “false”). As JSON is based on the Javascript specification I assume “tokens” are all symbols that start with a letter or an underscore, and are followed by letters, underscores or numbers. (So, for example, NULL would be a token, and A134 is a token. Anything a modern computer programming language would consider a ‘variable’ would be a token.)

Now whenever we read a file for tokenization, we generally want to create a mechanism to “push back” read characters. The idea here is that sometimes we have to read the next character to know it’s not part of the current token. So, for example, if we see the following:

3.14abc

We know after reading ‘3’, ‘.’, ‘1’ and ‘4’, ‘a’ is not part of the number. So we want to save ‘a’ for later use for the next token.

So we need routines which handle reading from our input stream and pushing back to our input stream. One way to handle this is with an array of pushed-back characters.

We define our tokenizer class in C++, ignoring the things we’re not talking about now:

/*  JSONLexer
 *
 *      JSON Lexer engine
 */

class JSONLexer
{
    public:
                        JSONLexer(FILE *f);
                        ~JSONLexer();
...
    private:
        FILE            *file;
        uint32_t        line;

        uint8_t         pos;
        uint8_t         stack[8];

        int             readChar();
        void            pushChar(int ch);
...
};

That is, we create a lexer which tokenizes the contents of the file, and we define the push-back stack as the internal members pos and stack.

Our implementation of the internal readChar and pushChar methods–which follow our own notion of “chunking” from above (and thus are only responsible for reading and pushing characters from our file) look like this:

/*  JSONLexer::readChar
 *
 *      Read the next character in the stream. Note this reads as 8-bit bytes,
 *  and does not validate unicode as UTF-8 characters.
 */

int JSONLexer::readChar()
{
    int ret;

    if (pos > 0) {
        ret = stack[--pos];
    } else {
        ret = fgetc(file);
    }

    if (ret == '\n') ++line;
    return ret;
}

/*  JSONLexer::pushChar
 *
 *      Push the read character back for re-reading again
 */

void JSONLexer::pushChar(int ch)
{
    stack[pos++] = (uint8_t)ch;

    if (ch == '\n') --line;
}

The push routine is easiest to explain: we simply add the character to our stack. (It’s limited to 8 characters, but we assume we won’t need to push back nearly that many characters during parsing. We also keep track of the number of newlines we’ve read, for reporting purposes. (This isn’t strictly necessary, but becomes convenient later.)

Note that we’d write the same two routines regardless of the file we’re parsing.

That is because any text file we’re reading may require a few characters “look-ahead” to determine the token we’re currently reading. That is also true if we’re parsing the C language or parsing an XML file.


Now that we have our push-back stack, we can build the tokenizer itself. Essentially our tokenizer reads as many characters as possible which fits into the pattern for “number”, “token” or “string”–and returns the text value of that object as well as the token ID for that object.

Basically think of this as an overblown version of ‘readChar’, except we read a token and return the token type as well as the contents of the token. (So, for example, our routine readToken may read “null” and return value “TOKEN” and the token string value for our token is set to ‘null’.)

Note that sometimes our parsers also need to read a token ahead. So we also need a flag that indicates we should return the last parsed token rather than read a new token.

The parts of our class that parses our token looks like this:

/*  JSONLexer
 *
 *      JSON Lexer engine
 */

class JSONLexer
{
    public:
                        JSONLexer(FILE *f);
                        ~JSONLexer();

        int             readToken();
        void            pushToken()
                            {
                                pushBack = true;
                            }

        std::string     token;
...
    private:
...
        bool            pushBack;
        int             lastToken;
};

That is, we track the last token we read, and we set a variable indicating if we should parse the next token or not. The text value of the token is in the field token.

Our tokenizer routine looks like the following. First, we determine if we’re just returning the last read token. If so, clear the flag, and return the token.

/*  JSONLexer::readToken
 *
 *      Read the next token in the stream
 */

int JSONLexer::readToken()
{
    int c;

    /*
     *  If we need to, return last token
     */

    if (pushBack) {
        pushBack = false;
        return lastToken;
    }

Now we strip out whitespace.

Note that for most programming languages and configuration files, white space is basically syntax sugar which separates tokens; it’s what makes ‘ABC DEF’ two tokens instead of one, from the parser’s perspective.

(Some languages make white space semantically important, like Markdown. Which means you’d need to use a different tokenizer to track white space in those languages. But for JSON, white space is not semantically meaningful, so we can strip it out.)

If we hit the end of the file, we return the end of file marker value, -1.

    /*
     *  Skip whitespace
     */

    while (isspace(c = readChar())) ;
    if (c == -1) return -1;         /* At EOF */

Now comes the interesting part. Basically we read the next character and, based on that character, we determine the thing we’re reading. For most tokenizers the next character completely defines what we’re reading: if it’s a token, a number, or a string. Some languages may require us to read a few characters ahead to figure out what we’re doing–but for JSON it’s far simpler.

Stripped of the logic of reading the rest of the token, our tokenizer looks like this:

    /*
     *  Parse strings
     */

    token.clear();
    if (c == '"') {
        ... Read the rest of the string ...
        return lastToken = STRING;
    }

    if (isalpha(c)) {
        ... Read the rest of the token ...
        return lastToken = TOKEN;
    }

    if (isdigit(c) || (c == '-')) {
        ... Read the rest of the number, or determine if the minus sign is
            just a minus sign ...
        return lastToken = NUMBER;
    }

    /*
     *  Return single character
     */

    token.push_back((char)c);
    return lastToken = c;
}

The full source code for the class (and the rest of the JSON parser and pretty printer) can be found on GitHub.


Parsing

Now that we have a stream of tokens, we can carry out the second half of the problem, parsing the tokens and making some sense of what those tokens mean.

There are several ways we can handle the parsing problem. One conceptually easy way is through hand-rolling a recursive descent parser. The idea here is that at the top our language defines a ‘thing’ which our parser’s parsing method is responsible for parsing. As we uncover the next token we then descend into a method which parses that sub-thing, and sometimes that requires us to recursively call earlier routines to handle the parsing task.

For JSON, there are fundamentally three language constructs: a “value” which can be a string, a number, “true”, “false”, “null”, or it can be an “object”, or an “array.” Objects and arrays in turn contain values themselves.

Our JSON language parser explicitly defines three methods, each which parses a “value”, an “object” and an “array”. (Note during the parsing process an internal data structure of the JSON object is built inside our class–but for the purposes of the discussion that’s not as important now.)

/*  JSONParser::parseValue
 *
 *      Parse the value. This assumes we are at the start of a value to parse,
 *  and assumes the next token is either a start of object, start of array,
 *  true, false, null, a number or a string
 */

bool JSONParser::parseValue()
{
    ... Parse the value. If we see a '{', call 'parseObject'. If we see a '[', call 'parseArray' ...
}

/*  JSONParser::parseObject
 *
 *      Parse the object. This is called after the '{' token is seen
 */

bool JSONParser::parseObject()
{
    ... Parse the object. When we need to parse a value, call 'parseValue' ...
}

/*  JSONParser::parseArray
 *
 *      Parse the array object
 */

bool JSONParser::parseArray()
{
    ... Parse the array. Values in the array are parsed by calling 'parseValue' ...
}

Side note: One aspect we use when building our parser here is that we take a playbook from the SAX parser for XML, and instead of having our JSONParser object construct a data representation of the JSON object as we parse it, instead, our parsing routines call several abstract methods indicating ‘events’, like “start object” and “start array.” This allows our parser to focus on just parsing the text, and defers the logic of what to do as we see different symbols to a separate class.

You can see this in the way we separate the concerns: the JSONParser class parses the file and fires the 11 separate abstract virtual methods in a specific order as we encounter values or objects. The separate JSONRecordParser extends our parser and implements the interface routines, using them to assemble a data structure which we can use elsewhere.

By separating the concerns in this fashion, in theory you could create your own class as a child class of JSONParser, intercepting the events and handling them in a different way.


Overall, then, our parsing process is invoked at the top level with the calls:

JSONLexer lexer(f);
JSONRecordParser parser;
JSONNode *node = parser.parse(&lexer);

And this is a hallmark of our overall two-step pattern: each step is represented as a separate class which handles its own operations. The second class then returns a data structure or object which represents what we parsed, and we can then use the contents of that item as we need to later in our code.


Conclusion.

Whenever you see a text file that needs to be parsed, consider breaking the problem into a tokenization step where you break the text file into a series of tokens–and a separate parsing step which parses the tokens and makes sense of what the tokens represent. Usually the parser will wind up building an internal data structure: a parse tree, a document object model or an abstract syntax tree, though sometimes you may wind up employing other algorithms, such as the shunting yard algorithm to parse infix-notation mathematics, evaluating the expressions as they are parsed.

And there are tools which exist that make building such parsers easier–though they are not strictly required in order to build a parser. Sometimes building the parser and tokenizer by hand gives you greater flexibility in catching and correcting syntax errors in the input text.

Leave a comment