Up to now, I have been making an assumption for the Piggy transformation system that regular expressions and NFA recognizers can be used to match arbitrary ASTs. After all, a lot of research has been published for tree recognizers, including XPath, TreeRegex, and Tregex, which are NFA based. With all this, I lost sight of the problem of trying to match subtrees within an AST. Unfortunately, pattern matching of ASTs is not regular, and so cannot be recognized by an NFA. This blog entry goes over why NFAs cannot be used, and how Piggy patterns should work.
XPath is regular; AST grammars are not regular
At issue is the difference between a grammar that recognizes a path through a tree vs. a grammar that recognizes an entire tree.
XPath is a language that matches a particular path of a node through the tree from the root. A grammar that models the abbreviated form of an XPath expression is:
s -> ‘/’ ID s | ‘/’ ID ;
This grammar is regular because there is no production that has a non-terminal symbol in the middle of the right-hand side.
ASTs, as output from a Piggy serializer, is represented as a parenthesized expression. A grammar that recognizes an serialized AST is:
s -> ( ID s* ) ;
Ignoring the fact that this grammar uses the Kleene closure on a non-terminal, this grammar is not regular because the production contains a non-terminal in the middle of the right-hand side.
The conclusion is that Piggy expressions cannot be regular expressions, and they cannot be recognized by an NFA.
The “…” operator problem
In addition to the above problem, Piggy implicitly adds an operator to accept any tree nodes that are “irrelevant” to a matching pattern. This operator is used in Coccinelle, and it used to match arbitrary code [Padioleau 2006, 2007] when the pattern needs to refer to two or more non-adjacent code blocks separated by code that is “irrelevant“. Piggy offers a similar operator but differs in a number of ways.
In Piggy, the “…” operator is equivalent to “.*”, where “.” refers to an AstParserParser.NodeContext or AstParserParser.AttrContext, and “*” is the Kleene operator (see grammar for node and attr ASTs in Piggy). This contrasts with Coccinelle, where “…” can refer to any code path, including ones that are nested within another block. In Piggy, the irrelevant code operator is inserted automatically in the user pattern, before every
AstParserParser.NodeContext or AstParserParser.AttrContext. The reason for this is so the user doesn’t have to explicitly add it, which helps the readability of a pattern. In Coccinelle, the operator is explicit.
Unfortunately, adding the irrelevant code operator “…” to a pattern results in a pattern that is ambiguous, meaning there are multiple parse trees for a given input. How does the recognizer work and resolve the ambiguity in patterns? For an NFA, I ordered the edges of the NFA and performed a greedy pattern match. Unfortunately, this isn’t going to work for a CFG recognizer.
Examples with the irrelevant operator
Piggy pattern: ( classBodyDeclaration ( modifier )* ( memberDeclaration ( methodDeclaration ) ) )
Explicitly stated, the above pattern is equivalent to the following pattern:
( classBodyDeclaration … ( modifier … )* … ( memberDeclaration … ( methodDeclaration … ) … ) … )
For input “( classBodyDeclaration ( modifier ) (memberDeclaration ( methodDeclaration ) ) )”, the first “…” will match the empty string if a greedy pattern matcher is employed.
For input ” ( classBodyDeclaration (memberDeclaration ( methodDeclaration ) ) )”, the pattern … ( modifier … )* … will match the empty string because the “(memberDeclaration etc)” matches the pattern
( memberDeclaration … ( methodDeclaration … ) … ) …
In any case, the irrelevant operator is essential for Piggy. But, adding it makes the resulting pattern highly ambiguous.
What should patterns in Piggy look like?
At this point, I need to re-evaluate what a Piggy pattern looks like. There are really three different possibilities for a pattern:
- a collection of visitor or listener functions;
- a collection of grammar rules with integrated text and code blocks;
- a parse string with integrated text and code blocks.
Let’s examine each of these alternatives.
Piggy patterns as AST visitors and listeners
With this method, Piggy patterns are simply text and code blocks that are executed via an AST traversal. The pattern matcher traverses the tree using a pre- or post-order traversal, which can be specified for the pattern.
The main issue here is that the pattern is now specified as a code predicate that indicates a match. The LHS symbol of the pattern, “ast”, means that the code is executed when an “ast” node type is traversed.
// Pattern to recognize and operate on an enum definition from the AST.
ast -> {{ if (tree.ID == "enumspecifier"
&& tree.Child(0).ID == "enumhead"
&& tree.Child(0).Child(0).ID == "enumkey"
&& tree.Child(0).Child(1).ID == "TOKEN")
{
...
return true; // match
}
There are several other issues specifying patterns in this manner. One issue is that in order to note that “return true;” is used to note to the engine that the predicate matches. Another issue is that the runtime must provide appropriate routines to skip nodes for the irrelevant operator. In the above example, Child(0) means to match exactly the first child of a node. In all likelihood, the pattern will require for-loops, or Linq expressions, to scan the children. Lastly, it is difficult to visualize the subtree structure based on C# code. That was always the reason for a parenthesized expression in Piggy for patterns.
Piggy patterns integrated into the Antlr grammar
Another possibility of expressing Piggy patterns is to use “action”-like items in the grammar for the tree, as used in YACC, or Antlr. Here, we need to distinguish the grammar for ASTs and the grammar used in the serialized parse, e.g., C++, because we want to write patterns in the syntax of the serialized AST.
enumhead
: enumkey attributespecifierseq? Identifier?
{{ if (Identifer != null) { ... } }}
enumbase?
;
The advantage of this method is that it works in much the same was as YACC or Antlr actions. The main disadvantage is that the rules much match exactly the syntax within the original grammar, so must be carefully duplicated from the grammar.
Antlr Tree Patterns
Antlr version 4 has a pattern matching mechanism to parse input according to a grammar and fill in placeholders for nodes of the parse tree. This feature appears useful for matching Piggy patterns, but it seems to have several shortcomings.
For example, consider the following code that takes a pattern and a Piggy AST, then tries to match the pattern against the tree.
private void Fun(string pat, string ast_string)
{
var ast_stream = CharStreams.fromstring(ast_string);
var ast_lexer = new AstLexer(ast_stream);
var ast_tokens = new CommonTokenStream(ast_lexer);
var ast_parser = new AstParserParser(ast_tokens);
ast_parser.BuildParseTree = true;
var listener = new ErrorListener<IToken>();
ast_parser.AddErrorListener(listener);
IParseTree ast_tree = ast_parser.ast();
if (listener.had_error) throw new Exception();
_ast = ast_tree;
var lexer = new AstLexer(null);
var parser = new AstParserParser(new CommonTokenStream(lexer));
var ptpm = new ParseTreePatternMatcher(lexer, parser);
var re = ptpm.Compile(pat, AstParserParser.RULE_node);
var c = ast_tree.GetChild(0);
var b = re.Matches(c);
System.Console.WriteLine(b);
}
Antlr parse tree matching works well if the tree you are trying to match fits exactly in the number and types of children for each node, including patterns that have sub-trees.
Fun("( <ID> <attr> <node> )", @"( u a=""1"" ( v ) )"); => true
Fun("( <ID> )", @"( u )"); => true
Fun("( <ID> ( <ID> ) )", @"( u ( v ) )"); => true
However, Antlr tree patterns do not work for patterns that contain an irrelevant operator, since there is no way to specify EBNF within the pattern.
Fun("( <ID> <attr> )", @"( u a=""1"" ( v ) )"); => false
Tagged nodes must be a left-hand symbol of a rule. Again, there is no way to use EBNF in a parse string pattern.
Fun("( <ID> <node | empty> )", @"( u ( v ) )"); => throws exception.
Parse tree patterns might be more useful if tagged items could contain the irrelevant operator and BNF syntax.
Conclusions
Some lessons are hard to learn. Piggy patterns cannot be expressed in XPath-like syntax because they need to express a whole subtree. We have looked at a number of ways to express patterns. The most promising seems to be something like Antlr’s parse tree patterns, perhaps even in code pre-serialized. But, it’s unclear how these would look in Piggy.
References and Further Reading
Padioleau, Y., Hansen, R.R., Lawall, J.L. and Muller, G., 2006, October. Semantic patches for documenting and automating collateral evolutions in Linux device drivers. In Proceedings of the 3rd workshop on Programming languages and operating systems: linguistic support for modern operating systems (p. 10). ACM.
Padioleau, Y., Lawall, J.L. and Muller, G., 2007. SmPL: A domain-specific language for specifying collateral evolutions in Linux device drivers. Electronic Notes in Theoretical Computer Science, 166, pp.47-62.
https://regular-expressions.mobi/lookaround.html?wlr=1
https://www.regular-expressions.info/lookaround.html
https://nlp.stanford.edu/software/tregex.html
https://pdfs.semanticscholar.org/0779/c4686c66dbbf1d1ccf41c49cea4c75b73d05.pdf