{"id":2449,"date":"2020-01-06T10:06:02","date_gmt":"2020-01-06T15:06:02","guid":{"rendered":"http:\/\/codinggorilla.com\/?p=2449"},"modified":"2020-01-23T14:51:49","modified_gmt":"2020-01-23T19:51:49","slug":"parsing-with-augmented-transition-networks-and-computing-the-lookahead","status":"publish","type":"post","link":"http:\/\/165.227.223.229\/index.php\/2020\/01\/06\/parsing-with-augmented-transition-networks-and-computing-the-lookahead\/","title":{"rendered":"Parsing with Augmented Transition Networks and computing lookahead"},"content":{"rendered":"\n<p>The next step in the development of my LSP server for Antlr involves support for code completion. There is an <a href=\"https:\/\/github.com\/mike-lischke\/antlr4-c3\" class=\"ek-link\">API<\/a> written by Mike Lischke that computes the lookahead sets from the parser tables and input, but it turns out to have a <a href=\"https:\/\/github.com\/mike-lischke\/antlr4-c3\/issues\/37\" class=\"ek-link\">bug<\/a> in some situations. So, I&#8217;ll now describe the algorithms I wrote to compute the lookahead set for Antlr parsers. I&#8217;ll start first with a few definitions, then go over the three algorithms that do the computation. An implementation of this is <a href=\"https:\/\/github.com\/kaby76\/AntlrVSIX\/blob\/master\/LanguageServer\/LASets.cs\" class=\"ek-link\">available in C# in the Antlrvsix extension<\/a> for code completion and another is provided by default in the <a href=\"https:\/\/github.com\/kaby76\/Antlr4BuildTasks\/tree\/master\/Templates\" class=\"ek-link\">Antlr Net Core templates I wrote<\/a> for syntax error messages.<\/p>\n\n\n\n<p><!--more--><\/p>\n\n\n\n<h2>What is the lookahead set, and what is available currently to compute it?<\/h2>\n\n\n\n<p>The lookahead set is the set of token types that could be inserted at an index in the text that is being parsed. If you know this set, you can compute code completions in a text editor for the grammar or output meaningful error messages in a parse error.<\/p>\n\n\n\n<p>Let&#8217;s take a simple Antlr grammar, the well-known <a href=\"https:\/\/github.com\/mike-lischke\/antlr4-c3\/blob\/master\/ports\/c%23\/test\/Antlr4CodeCompletion.CoreUnitTest\/Grammar\/Expr.g4\" class=\"ek-link\">Expr<\/a> example.<\/p>\n\n\n\n\n\n\n\n<p>Now consider what could be added at the line &#8220;\/\/ Insert here&#8230;&#8221;. We could add:<\/p>\n\n\n\n<ul><li>a new parser rule (<a href=\"https:\/\/github.com\/kaby76\/AntlrExprExamples\/blob\/master\/E1.g4\" class=\"ek-link\">in git<\/a>), which starts with a RULE_REF;<\/li><li>a new lexer rule (<a href=\"https:\/\/github.com\/kaby76\/AntlrExprExamples\/blob\/master\/E5.g4\" class=\"ek-link\">in git<\/a>), which starts with a TOKEN_REF;<\/li><li>a new parser rule with a leading DOC_COMMENT (<a href=\"https:\/\/github.com\/kaby76\/AntlrExprExamples\/blob\/master\/E2.g4\" class=\"ek-link\">in git)<\/a>;<\/li><li>a catch clause (<a href=\"https:\/\/github.com\/kaby76\/AntlrExprExamples\/blob\/master\/E3.g4\" class=\"ek-link\">in git<\/a>), which starts with the keyword &#8216;catch&#8217;;<\/li><li>a finally clause (<a href=\"https:\/\/github.com\/kaby76\/AntlrExprExamples\/blob\/master\/E4.g4\" class=\"ek-link\">in git<\/a>), which starts with the keyword &#8216;finally&#8217;;<\/li><\/ul>\n\n\n\n<p>There are, of course, other possibilities. If you <a href=\"https:\/\/github.com\/kaby76\/AntlrExprExamples\/blob\/master\/E6.g4\" class=\"ek-link\">try to place a &#8220;.&#8221;<\/a> at the point of insertion, the Antlr Tool reports the following error message:<\/p>\n\n\n\n<p>error(50): C:\\Users\\kenne\\Documents\\expr\\E6.g4:26:0: syntax error: &#8216;.&#8217; came as a complete surprise to me<\/p>\n\n\n\n<p>This is fine, but it doesn&#8217;t give you an idea of what you should be adding at the insertion point. If you write a parser using the Antlr grammars itself and use the default error reporter in the Antlr runtime, you get these error messages:<\/p>\n\n\n\n<p>line 26:0 mismatched input &#8216;.&#8217; expecting {, &#8216;mode&#8217;}<\/p>\n\n\n\n<p>This is a little weird because there is no symbol before the comma in the set described in the error message. Beyond the fact that you <a href=\"https:\/\/github.com\/kaby76\/AntlrExprExamples\/blob\/master\/E7.g4\" class=\"ek-link\">cannot use &#8220;modes&#8221; in a unified parser\/lexer grammar<\/a>, &#8216;mode&#8217; is possible in a lexer grammar (<a href=\"https:\/\/github.com\/kaby76\/AntlrExprExamples\/blob\/master\/LexerE7.g4\" class=\"ek-link\">lexer<\/a> and <a href=\"https:\/\/github.com\/kaby76\/AntlrExprExamples\/blob\/master\/ParserE7.g4\" class=\"ek-link\">parser<\/a> grammars). To its credit, the documentation for the <a href=\"https:\/\/www.antlr.org\/api\/Java\/org\/antlr\/v4\/runtime\/ConsoleErrorListener.html\" class=\"ek-link\">default error reporter<\/a> in the Antlr runtime does note that this routine does not compute error messages correctly.<\/p>\n\n\n\n<p>However, Lischke&#8217;s Code Completion Core should report the range of possibilities I describe above. But it instead reports that it expects input of &#8216;-2&#8217; (epsilon), &#8216;catch&#8217;, and &#8216;finally&#8217;, which is only partly true.<\/p>\n\n\n\n<p>I now describe a way to compute the lookahead.<\/p>\n\n\n\n<h2>Augmented Transition Network <\/h2>\n\n\n\n<p>An Augmented Transition Network (ATN) is defined as follows. Given predicated grammar G = (N, T, P, S, \u00ce\u00a0, M),<\/p>\n\n\n\n<ul><li>N is the set of nonterminals (rule names)<\/li><li>T is the set of terminals (tokens) <\/li><li>P is the set of productions<\/li><li>S \u00e2\u0088\u0088 N is the start symbol<\/li><li>\u00ce\u00a0 is a set of side-effect-free semantic predicates<\/li><li>M is a set of actions (mutators)<\/li><\/ul>\n\n\n\n<p>the corresponding ATN M<sub>G<\/sub> = (Q, \u00ce\u00a3, \u00e2\u0088\u0086, E, F) has\nelements:<\/p>\n\n\n\n<ul><li>Q is the set of states<\/li><li>\u00ce\u00a3 is the transition alphabet N \u00e2\u0088\u00aa T \u00e2\u0088\u00aa \u00ce\u00a0 \u00e2\u0088\u00aa M<\/li><li>\u00e2\u0088\u0086 is the transition relation mapping Q \u00c3\u2014 (\u00ce\u00a3 \u00e2\u0088\u00aa \u00ce\u00b5) \u00e2\u0086\u2019 Q, denoted by the triple &lt;q<sub>from<\/sub>, \u00cf\u0083, q<sub>to<\/sub>&gt;<\/li><li>E = {p<sub>A<\/sub> | A \u00e2\u0088\u0088 N} is the set of submachine entry states<\/li><li>F = {p\u00e2\u20ac\u2122<sub>A<\/sub> | A \u00e2\u0088\u0088 N} is the set of submachine final states<\/li><\/ul>\n\n\n\n<p>An edge is a triple &lt;u,  \u00cf\u0083 , v&gt; \u00e2\u0088\u0088 \u00e2\u0088\u0086. For an input string t<sub>0<\/sub> t<sub>1<\/sub> t<sub>2<\/sub> \u00e2\u20ac\u00a6 t<sub>n<\/sub>, a path is a sequence of edges [ &lt;s<sub>0<\/sub>, \u00cf\u0083<sub>0<\/sub>, s<sub>1<\/sub>&gt; &lt; s<sub>1<\/sub>, \u00cf\u0083<sub>1<\/sub>, s<sub>2<\/sub>&gt; &lt; s<sub>2<\/sub>, \u00cf\u0083<sub>2<\/sub>, s<sub>3<\/sub>&gt; &lt;s<sub>3<\/sub>, \u00cf\u0083<sub>3<\/sub>, s<sub>4<\/sub> &gt; &#8230; ] where     \u00cf\u0083<sub>0<\/sub> \u00cf\u0083<sub>1<\/sub> \u00cf\u0083<sub>2<\/sub> &#8230; =  t<sub>0<\/sub> t<sub>1<\/sub> t<sub>2<\/sub> \u00e2\u20ac\u00a6 t<sub>n<\/sub>, and where the final edge of the path is not an epsilon transition. The input does not have to be a complete string for the grammar, so it is possible for the last state in the path to be a non-final state.<\/p>\n\n\n\n<h2>Parsing in an ATN<\/h2>\n\n\n\n<p>The ATN parse is computed using the parse() function described below. It requires the input string, the current position to consider in the input, and the edge that matched the last input. For a lack of information, I implemented a basic backtracking parser. The function calls itself recursively for each transition that it considers. The result of the parse() function is a set of paths that match the input from this point in the input and state, or null if there is no match. Each path in the returned set is an acceptable path through the ATN that matches the input string from the current position in the input. The parser consumes the entire input string, and if the input is unacceptable, null is returned.<\/p>\n\n\n\n\n\n\n\n<h2>Closure of an ATN<\/h2>\n\n\n\n<p>After computing an ATN parse, parses are represented as a set of paths. Each path terminates in a final state. At this point, we are interested in what valid token could occur. The closure() function computes the set of states that are reachable via an \u00ce\u00b5-transition in the ATN. If the final state is a stop-state of the ATN, it&#8217;s possible that the parse could continue from a &#8220;follow&#8221; state of a rule transition in the path.<\/p>\n\n\n\n\n\n\n\n<h2>Computation of lookahead sets<\/h2>\n\n\n\n<p>Finally, we now define the function LASet() to compute the lookahead set for a particular closure of a path. The idea here is to find all non-\u00ce\u00b5-transitions in the ATN from the closure() of the last state of a path. If the closure includes the final state of the submachine, then go to the &#8220;calling atn submachine&#8221;, and compute the closure of the follow-state of the rule transition. The reason why to use the closure of the calling submachine is because the called submachine can return from a final state and continue the parse from the follow state of a rule transition. For an Antlr grammar, after the semicolon, a catch or finally clause could be parsed. But, because the catch and finally rules could be empty, the rule could end in the final state for the rule submachine.<\/p>\n\n\n\n\n","protected":false},"excerpt":{"rendered":"<p>The next step in the development of my LSP server for Antlr involves support for code completion. There is an API written by Mike Lischke that computes the lookahead sets from the parser tables and input, but it turns out to have a bug in some situations. So, I&#8217;ll now describe the algorithms I wrote &hellip; <\/p>\n<p class=\"link-more\"><a href=\"http:\/\/165.227.223.229\/index.php\/2020\/01\/06\/parsing-with-augmented-transition-networks-and-computing-the-lookahead\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Parsing with Augmented Transition Networks and computing lookahead&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[],"tags":[],"_links":{"self":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts\/2449"}],"collection":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/comments?post=2449"}],"version-history":[{"count":0,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/posts\/2449\/revisions"}],"wp:attachment":[{"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/media?parent=2449"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/categories?post=2449"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/165.227.223.229\/index.php\/wp-json\/wp\/v2\/tags?post=2449"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}