In the last weeks I’ve been implementing a test environment for the key/value storage I’m nowadays working on. Part of the task was to come up with a simple client-side SQL parser: the original Tarantool protocol is binary, whereas, going forward, I would like to use Random Query Generator  and sysbench for testing and benchmarks. Since I chose Python to program the test driver, the rich (erm, too rich) world of Python parser generators was there to explore.

I was looking for a parser that would:<ul><li>provide a BNF-like language to specify the grammar</li><li>allow me to keep the grammar separate from the code (parsers parse!)</li><li>have a minimal or at least moderate runtime</li><li>be pure python</li><li>be maintained</li></ul>Putting aside LL and LR dispute, parsing performance (for some reason parsing geeks are obsessed with generation performance or the breadth of supported grammars, whereas parser performance is what really matters), C/C++ world provides you with at least two very decent solutions (Bison and Antlr).

Not so in Python.There is more than a dozen of Python parsers, none of them matching all the criteria.


In the end I chose yapps2, a tool that does not give best performance, and (worst of all) is no longer maintained.
But at least the result is clean, and the runtime is moderate.

This is how the resulting grammar looks:
<div class="python" style="font-family: monospace;"><ol><li class="li1"><div class="de1">kostja@shmita:~/work/tarantool/test$ cat lib/sql.g                     </div></li> <li class="li2"><div class="de2">import sql_ast</div></li> <li class="li1"><div class="de1"> </div></li> <li class="li2"><div class="de2">%%</div></li> <li class="li1"><div class="de1"> </div></li> <li class="li2"><div class="de2">parser sql:</div></li> <li class="li1"><div class="de1"> </div></li> <li class="li2"><div class="de2">    ignore:           '\\s+'</div></li> <li class="li1"><div class="de1">    token NUM:        '[0-9]+'</div></li> <li class="li2"><div class="de2">    token ID:         '[a-z_]+[0-9]+' </div></li> <li class="li1"><div class="de1">    token STR:        '"([^\\"]+|\\\\.)*"'</div></li> <li class="li2"><div class="de2">    token PING:       'ping'</div></li> <li class="li1"><div class="de1">    token INSERT:     'insert'</div></li> <li class="li2"><div class="de2">    token UPDATE:     'update'</div></li> <li class="li1"><div class="de1">    token DELETE:     'delete'</div></li> <li class="li2"><div class="de2">    token SELECT:     'select'</div></li> <li class="li1"><div class="de1">    token INTO:       'into'</div></li> <li class="li2"><div class="de2">    token FROM:       'from'</div></li> <li class="li1"><div class="de1">    token WHERE:      'where'</div></li> <li class="li2"><div class="de2">    token VALUES:     'values'</div></li> <li class="li1"><div class="de1">    token SET:        'set'</div></li> <li class="li2"><div class="de2">    token END:        '\\s*$'</div></li> <li class="li1"><div class="de1"> </div></li> <li class="li2"><div class="de2">    rule sql:         (insert {{ stmt = insert }} |</div></li> <li class="li1"><div class="de1">                      update {{ stmt = update }} |</div></li> <li class="li2"><div class="de2">                      delete {{ stmt = delete }} |</div></li> <li class="li1"><div class="de1">                      select {{ stmt = select }} |</div></li> <li class="li2"><div class="de2">                      ping {{ stmt = ping }}) END {{ return stmt }}</div></li> <li class="li1"><div class="de1">                      </div></li> <li class="li2"><div class="de2">    rule insert:      INSERT [INTO] ID VALUES value_list</div></li> <li class="li1"><div class="de1">                      {{ return sql_ast.StatementInsert(ID, value_list) }}</div></li> <li class="li2"><div class="de2">    rule update:      UPDATE ID SET update_list opt_where </div></li> <li class="li1"><div class="de1">                      {{ return sql_ast.StatementUpdate(ID, update_list, opt_where) }}</div></li> <li class="li2"><div class="de2">    rule delete:      DELETE FROM ID  opt_where</div></li> <li class="li1"><div class="de1">                      {{ return sql_ast.StatementDelete(ID, opt_where) }}</div></li> <li class="li2"><div class="de2">    rule select:      SELECT '\*' FROM ID opt_where</div></li> <li class="li1"><div class="de1">                      {{ return sql_ast.StatementSelect(ID, opt_where) }}</div></li> <li class="li2"><div class="de2">    rule ping:        PING</div></li> <li class="li1"><div class="de1">                      {{ return sql_ast.StatementPing() }}</div></li> <li class="li2"><div class="de2">    rule predicate:   ID '=' constant</div></li> <li class="li1"><div class="de1">                      {{ return (ID, constant) }}</div></li> <li class="li2"><div class="de2">    rule opt_where:   {{ return None }}</div></li> <li class="li1"><div class="de1">                      | WHERE predicate</div></li> <li class="li2"><div class="de2">                      {{ return predicate }}</div></li> <li class="li1"><div class="de1">    rule value_list:  '\(' {{ value_list = [] }}</div></li> <li class="li2"><div class="de2">                        (expr {{ value_list.append(expr) }} )+</div></li> <li class="li1"><div class="de1">                      '\)' {{ return value_list }}</div></li> <li class="li2"><div class="de2">    rule update_list: predicate [(',' predicate)+]</div></li> <li class="li1"><div class="de1">    rule expr:        constant {{ return constant }}</div></li> <li class="li2"><div class="de2">    rule constant:    NUM {{ return NUM }} | STR {{ return STR }}</div></li> <li class="li1"><div class="de1">%%</div></li> <li class="li2"><div class="de2"> </div></li> <li class="li1"><div class="de1"># SQL is case-insensitive, but in yapps it's not possible to</div></li> <li class="li2"><div class="de2"># specify that a token must match in case-insensitive fashion.</div></li> <li class="li1"><div class="de1"># This is hack to add re.IGNORECASE flag to all regular</div></li> <li class="li2"><div class="de2"># expressions that represent tokens in the generated grammar.</div></li> <li class="li1"><div class="de1"> </div></li> <li class="li2"><div class="de2">sqlScanner.patterns = map(lambda tup:</div></li> <li class="li1"><div class="de1">                          (tup[0], re.compile(tup[1].pattern, re.IGNORECASE)),</div></li> <li class="li2"><div class="de2">                          sqlScanner.patterns)</div></li> <li class="li1"><div class="de1"> </div></li> <li class="li2"><div class="de2"># vim: nospell syntax=off ts=4 et</div></li></ol></div>