Right To Left

9 Apr 2008

What do the following three problems have in common?

  • Expanding make-like variable references in text.
  • Parsing Apache-like configuration files using flex and bison.
  • Parsing Python-like program structure (again using flex and bison).

The answer: they can all be done with essentially the same algorithm.

Variable references

Let's say I have a block of text and an environment consisting of a mapping between (text) variable names and (text) values, and that I want to go through the block locating variable references and replacing them with the value of the variable from the environment. Further, let us say the variable references are similar to make's: $(variable). In other words, the variable name is surrounded by parentheses, with a leading dollar sign.

Now, it would be simple to scan through the text looking for strings like "$*" and when finding one, scan forward for the next ")". However, this fails on nested evaluations like:

some text $(a$(variable)) some text

with:

"variable" => "Var"
"aVar" => "some text"

(I am assuming that the values do not have references in them, which was correctly handled in my original problem. However, allowing them does not significantly distort my final conclusion.)

To handle that while scanning forward, I would need to have some bizarro, recursive scanning and evaluation scheme that gets ugly and complex quickly. As I was looking at this problem (there is no real "let's say" here; this is exactly what I was trying to do), I realized it became much simpler if I did the original scan from right to left, from the end of the text block towards the beginning. Then, finding the first "$(" leaves the state looking like:

some text $(a$(variable)) some text
             ^

Subsequently finding the first ")" results in:

some text $(a$(variable)) some text
             ^---------^

which is easily replaced:

some text $(aVar) some text

Repeating that process clearly and easily produces the correct evaluated text (assuming I do not have recursive variable definitions):

some text $(aVar) some text
          ^
          ^-----^
some text some text some text

Further, I realized that, if I kept track of the location of the "$(", I would not need to re-scan the text block repeatedly. Given that I have found the right-most "$(", and that the expansion does not contain "$(", I can just continue moving left from my original location:

some text $(a$(variable)) some text
             ^<--------------------
             ^-------->^
             replace!
          ^<--
          ^------------>^    // not to scale
          replace!
<----------
done!

The resulting algorithm in C++ looks something like:

string::size_type lpos = string::npos;
while (true) {
  lpos = text.rfind("$(", lpos);
  if (lpos == string::npos) {
    // done!
    return text;
  }
  string::size_type rpos = text.find(")"), lpos);
  if (rpos == string::npos) {
    // no closing ")"
    throw Error("Blargh!");
  }
  string key = text.substr( lpos + 2, rpos - lpos - 2 );
  // ...lookup key to produce result, using "" as a default
  string result = get(key, "");
  // replace!
  text.replace(lpos, rpos - lpos + 1, result);
}

To handle variable values which could include references (such as "aVar" mapping to "a$(variable)"), the algorithm would need to change so that the rfind("$(",...) started at the end of the replacement rather than the beginning of the replacement (i.e. lpos + result.length, rather than lpos). However, that algorithm would not catch non-terminating recursive variable references, which are handled elsewhere in this algorithm. (Specifically, get() evaluates variable references in a result it finds before returning the result. Circular references are its responsibility.)

Parsing Apache-like config files

Apache config files are (if I remember correctly; I'm too lazy to look it up) broken into scoped sections using pseudo-XML:

<Directory>
 myDirective
</Directory>

<IfModule ...>
 directive...
</IfModule>

The configuration files I needed to parse are similar, except that (unlike Apache config files, as I recall) the scoped sections can be nested:

<OneThing>
  directive_one = 42
  <OneThingsStuff>
    whoopie_cushion
    hand_buzzer
  </OneThingsStuff>
<OneThing>

This is all well and good, except that

  • Apache config files (and my extensions) are not valid XML.
  • I wanted to parse the dang thing with a framework based on flex and Bison. (Actually, a C++ translation of my famous and award-winning FlexBisonModule framework things.)
  • The XML taggy things make the configuration format context sensitive.

To see the context sensitivity, consider that <a> is closed by </a>; other tags can be nested in between, so the structure is determined by the "a" tag names, which would be hard to do using Bison given that there is no a priori list of tags. (Hard to do, as far as I could tell, anyway.)

But, I suddenly realized, I can punt! Instead of trying to recover the structure while in the Bison parser, I could simply treat "<...>" and "</...>" the same as any other directives, parse the file as a simple list of directive lines, and then recover the tree structure in a pass over the parse result.

I will pass over the parsing process here; suffice it to say that it results in a parse tree consisting of a top element which contains a sequence of directive elements:

lines
  directive
  directive
  ...
  open-tag-directive
  ...
  open-tag-directive
  directive
  ...
  close-tag-directive
  close-tag-directive
  ...

Then, when trying to determine how to recreate the nested, tree structure, I realized it was a variation on the same algorithm as variable expansion above: Go backward through the list to find the first open-tag-directive, then go forward to the first closing-tag-directive; it should match the open-tag. Then, take the directives in between and insert them as children of the open-tag and remove them (and the close-tag-directive) from the list. Repeat as needed.

The C++ code is:

list<Parsing::Symbol::ptr>::reverse_iterator i = sym_children.rbegin();
while (true) {
  i = find_if( i, sym_children.rend(), _is_unstructured_open() );
  if (i == sym_children.rend()) {
    // done!
    break;
  }
  Sect_Open::ptr open = dynamic_pointer_cast<Apache::Sect_Open>( *i );
  Symbol::child_iterator ibase = i.base();
  ++i;
  Symbol::child_iterator j = find_if( ibase, sym_children.end(), _is_close( open->name() ) );
  if (j == sym_children.end()) {
    // no section close!
    throw Nesting_Error("Blag!");
  }
  // Create a new Lines object to hold section
  Lines::ptr lines( new Lines(LINES) );
  // copy lines in section into new Lines object
  for_each( ibase, j, _append(lines) );
  // record new Lines under Sect_Open
  open->set_lines( lines );
  // Remove copied lines, plus Sect_Close
  sym_children.erase( ibase, ++j );
}

This code is part of a method of the Lines class, which is created and filled in by the parser; it contains all of the directive lines from the file. The sym_children attribute is part of the Symbol class in the parsing framework; all Symbols can have sym_children; Lines is a Symbol.

The code uses several STL algorithms with functors; the first is _is_unstructured_open:

struct _is_unstructured_open
{
  bool operator()(Parsing::Symbol::ptr s) {
    Apache::Sect_Open::ptr open = dynamic_pointer_cast<Apache::Sect_Open>(s);
    return (open && !open->structured());
  }
};

The parsing framework makes heavy use of TR1 shared_ptr reference-counted smart pointers; dynamic_pointer_cast is used to down-cast those shared_ptr's. I typically define a type synonym for shared_ptr's to a class in the class as "ptr".

The structured() method returns true if the Sect_Open object already has children; in other words, if its structure has already been recovered, if it has already been handled by an iteration of this code. This test is needed if the search for an open-tag begins at the end of the lines list every time in order to skip over already-examined open-tags; while the final algorithm presented here does not do that, it was useful during development.

The _is_structured_open functor thus returns true if the Symbol being examined is a Sect_Open, and if it has not already been processed.

The iterator ibase, in the original code, represents a forward iterator constructed from the backward iterator i and points to the next element after i.

The iterator j should then point to the close-tag matching open (and thus i) by name. To find j, the find_if algorithm is used again, along with the _is_close functor:

struct _is_close {
  _is_close(const string& name) : _name(name) { }
  string _name;

  bool operator()(Parsing::Symbol::ptr& s) {
    Apache::Sect_Close::ptr close = dynamic_pointer_cast<Apache::Sect_Close>(s);
    return (close && close->name() == _name);
  }
};

This functor returns true if the element s points to a Sect_Close that has the same name as its constructor's argument.

The next step (if the close-tag was found) is to create a new Lines object to represent the sub-structure and to tuck the enclosed directives into that Lines. That is the work of _append:

struct _append {
  _append( Apache::Lines::ptr& lines ) : _lines(lines) { }
  Apache::Lines::ptr _lines;

  bool operator()(Parsing::Symbol::ptr& s) { _lines->append(s); }
};

The append() method accepts a pointer to a symbol and adds the pointer to the Lines object's children.

At this point, there are two references to the child-directives: that in the new Lines object and that in the original overall Lines. To finish recovering the structure, I remove the contained directives from the original:

sym_children.erase( ibase, ++j );

The ibase iterator is the first element after the open-tag; j is the close-tag. By incrementing j, the argument to erase() is the element after the close-tag; by the magic of reference-counted shared_ptr's, this cleanly deletes the close tag along with removing the extraneous references to the contained directives that are cluttering up the structure.

In glorious ASCII art, this looks like:

lines
  directive
  directive
  open-tag-directive
  open-tag-directive  ^
  directive           |
  close-tag-directive |
  close-tag-directive |

Search up to the first open tag; then find the closing tag.

lines
  directive
  directive
  open-tag-directive
  open-tag-directive  < |
  directive             |
  close-tag-directive   V
  close-tag-directive

Create a new-lines, add the contained directive to it and insert it into position.

lines
| directive
| directive
| open-tag-directive
| open-tag-directive  <
|   new-lines           // picture this as both lines and new-lines
----> directive         // having a pointer to directive
  close-tag-directive <
  close-tag-directive

Then, finally, remove the extra pointers from the lines (which deletes the close-tag).

lines
  directive
  directive
  open-tag-directive
  open-tag-directive  <
    new-lines
      directive
  close-tag-directive

Lather, rinse, repeat.

lines
  directive
  directive
  open-tag-directive
    new-lines-2
      open-tag-directive
        new-lines
          directive

The final result should match the original text's structure:

directive
directive
<a>
  <b>
    directive
  </b>
</a>

Parsing Python-like text

Another task from the same project called for parsing a Python-like domain specific language; again I wanted to use flex and Bison, and again discovered the same algorithm.

Python (and the DSL) uses indentation to structure blocks in the program text. The Python interpreter (and my initial parsing idea) uses synthetic BEGIN and END tokens inserted when the indentation increases or decreases. These are hard to do, however, with flex. So, I once again decided to punt.

I am parsing the whole DSL text as a sequence of lines/statements. The grammar has a rule similar to:

element : assignment
        | ...
        | if_stmt
        | elif_stmt
        | else_stmt
        | for_stmt
        | while_stmt

where "assignment" is something like a traditional variable assignment, the "..."s are other statements, and if, elif, and else are:

if_stmt : IF expression ':' ;

elif_stmt : ELIF expression ':' ;

else_stmt : ELSE ':' ;

How to recover the blocks (or "suites", following Python)? My parsing toolkit includes location for every token read; this includes the file name, line number, and character position within the line, both for the beginning and ending of every token. It also includes the same information for symbols; the locations are updated when a token or symbol is added as a sub-tree of another symbol. For example, the code for the if_stmt rule is:

$$ = push_symbol($1, $2);

In this code, $1 is the IF token and $2 is the expression; push_symbol adds the expression to the list of child sub-trees of the IF token (tokens are symbols, too). This process updates the location for the IF symbol to include the expression; the result will be something like "file 'text-1.txt', line 50, char 4 - char 20".

The final result of Bison is a flat parse tree: the top-level Elements symbol contains a flat sequence of elements. Recovering the indentation-based block structure is the responsibility of the algorithm:

list<Parsing::Symbol::ptr>::reverse_iterator i = find_if( sym_children.rbegin(),
                                                          sym_children.rend(),
                                                          _accepts_suite() );
while (i != sym_children.rend())
{
  //
  // Create a synthetic Suite symbol for the new block
  //
  Suite::ptr suite = Suite::ptr( new Suite(SUITE) );
  //
  // Insert the suite into position
  //
  (*i)->append(suite);
  //
  // First possible statement follows the element i
  //
  Symbol::child_iterator start = ibase();
  //
  // The first statement after the suite is the first with a
  // "shallower" indentation level (or the end of the sequence)
  //
  Symbol::child_iterator end = find_if( start,
                                        sym_children.end(),
                                        _is_shallower(
                                          (*i)->position().beginning().character()
                                          )
                                      );
  //
  // Move the elements into the suite
  //
  for_each( start, end, _append(suite) );
  //
  // ...and remove them from the top-level sequence
  //
  sym_children.erase(start, end);
  //
  // find the next element which accepts a suite
  //
  i = find_if( ++i, sym_children.rend(), _accepts_suite() );
}

[Note to self: add typedefs so that "list<...>::reverse_iterator" isn't needed.]

Once again, there are a small stack of helping functors. In this case, _accepts_suite is a test of whether a symbol is one which accepts a sub-block (If, Elif, Else, etc.) or not (an Assignment, for example):

struct _accepts_suite
{
  bool operator()(const Parsing::Symbol::ptr& p) {
    Accepts_Suite_Predicate_I::ptr p1 =
      dynamic_pointer_cast<Accepts_Suite_Predicate_I>(p);
    return (p1 && p1->accepts_suite());
  }
};

This could be done by a method adaptor, if the accepts_suite predicate were accepted by all Symbols. It isn't; hence the downcast.

The _is_shallower functor is also a closure; it tests each element against a given indentation:

struct _is_shallower
{
  _is_shallower(int indent) : _indent(indent) { }
  int _indent;
  bool operator()( const Parsing::Symbol::ptr& p ) {
    return (p->position().beginning().character() <= _indent);
  }
};

Currently, tabs are treated as being the same width as spaces (i.e. one character). Other choices are possible, and just as bad.

Finally, a functor is used to append the suite's symbols to the synthetic Suite:

struct _append
{
  _append( Parsing::Symbol::ptr suite ) : _suite(suite) { }
  Parsing::Symbol::ptr _suite;
  void operator()( Parsing::Symbol::ptr& symbol ) { _suite->append(symbol); }
};

The result of using this algorithm is that sub-suites are first moved into a suite under their element, then that element is moved into a suite under its element, and so on. The final result is the appropriate parse tree.

One unfortunate result is that nonsensical text is accepted:

else:
 n = 1
elif (foo < 1):
 m = 4

This is handled by the higher-level interpreter, which when seeing garbage like that throws a syntax error. As an alternative, another pass could be made over the parse tree allowing only legitimate structures.

Generalization

All three of these functions could be unified in a higher-level template function, I believe. I have not yet done it, however

gloria i ad inferni
faciamus opus

Return to Top | About this site...
Last edited Sat Aug 8 03:29:10 2009.
Copyright © 2005-2016 Tommy M. McGuire