Parsing with Augmented Transition Networks and computing lookahead

The next step in the development of my LSP server for Antlr involves support for code completion. There is an API written by Mike Lischke that computes the lookahead sets from the parser tables and input, but it turns out to have a bug in some situations. So, I’ll now describe the algorithms I wrote to compute the lookahead set for Antlr parsers. I’ll start first with a few definitions, then go over the three algorithms that do the computation. An implementation of this is available in C# in the Antlrvsix extension for code completion and another is provided by default in the Antlr Net Core templates I wrote for syntax error messages.

What is the lookahead set, and what is available currently to compute it?

The lookahead set is the set of token types that could be inserted at an index in the text that is being parsed. If you know this set, you can compute code completions in a text editor for the grammar or output meaningful error messages in a parse error.

Let’s take a simple Antlr grammar, the well-known Expr example.

Now consider what could be added at the line “// Insert here…”. We could add:

a new parser rule (in git), which starts with a RULE_REF;
a new lexer rule (in git), which starts with a TOKEN_REF;
a new parser rule with a leading DOC_COMMENT (in git);
a catch clause (in git), which starts with the keyword ‘catch’;
a finally clause (in git), which starts with the keyword ‘finally’;

There are, of course, other possibilities. If you try to place a “.” at the point of insertion, the Antlr Tool reports the following error message:

error(50): C:\Users\kenne\Documents\expr\E6.g4:26:0: syntax error: ‘.’ came as a complete surprise to me

This is fine, but it doesn’t give you an idea of what you should be adding at the insertion point. If you write a parser using the Antlr grammars itself and use the default error reporter in the Antlr runtime, you get these error messages:

line 26:0 mismatched input ‘.’ expecting {, ‘mode’}

This is a little weird because there is no symbol before the comma in the set described in the error message. Beyond the fact that you cannot use “modes” in a unified parser/lexer grammar, ‘mode’ is possible in a lexer grammar (lexer and parser grammars). To its credit, the documentation for the default error reporter in the Antlr runtime does note that this routine does not compute error messages correctly.

However, Lischke’s Code Completion Core should report the range of possibilities I describe above. But it instead reports that it expects input of ‘-2’ (epsilon), ‘catch’, and ‘finally’, which is only partly true.

I now describe a way to compute the lookahead.

Augmented Transition Network

An Augmented Transition Network (ATN) is defined as follows. Given predicated grammar G = (N, T, P, S, Î , M),

N is the set of nonterminals (rule names)
T is the set of terminals (tokens)
P is the set of productions
S â N is the start symbol
Î is a set of side-effect-free semantic predicates
M is a set of actions (mutators)

the corresponding ATN M_G = (Q, Î£, â, E, F) has elements:

Q is the set of states
Î£ is the transition alphabet N âª T âª Î âª M
â is the transition relation mapping Q Ã— (Î£ âª Îµ) â’ Q, denoted by the triple <q_from, Ï, q_to>
E = {p_A | A â N} is the set of submachine entry states
F = {pâ€™_A | A â N} is the set of submachine final states

An edge is a triple <u, Ï , v> â â. For an input string t₀ t₁ t₂ â€¦ t_n, a path is a sequence of edges [ <s₀, Ï₀, s₁> < s₁, Ï₁, s₂> < s₂, Ï₂, s₃> <s₃, Ï₃, s₄ > … ] where Ï₀ Ï₁ Ï₂ … = t₀ t₁ t₂ â€¦ t_n, and where the final edge of the path is not an epsilon transition. The input does not have to be a complete string for the grammar, so it is possible for the last state in the path to be a non-final state.

Parsing in an ATN

The ATN parse is computed using the parse() function described below. It requires the input string, the current position to consider in the input, and the edge that matched the last input. For a lack of information, I implemented a basic backtracking parser. The function calls itself recursively for each transition that it considers. The result of the parse() function is a set of paths that match the input from this point in the input and state, or null if there is no match. Each path in the returned set is an acceptable path through the ATN that matches the input string from the current position in the input. The parser consumes the entire input string, and if the input is unacceptable, null is returned.

Closure of an ATN

After computing an ATN parse, parses are represented as a set of paths. Each path terminates in a final state. At this point, we are interested in what valid token could occur. The closure() function computes the set of states that are reachable via an Îµ-transition in the ATN. If the final state is a stop-state of the ATN, it’s possible that the parse could continue from a “follow” state of a rule transition in the path.

Computation of lookahead sets

Finally, we now define the function LASet() to compute the lookahead set for a particular closure of a path. The idea here is to find all non-Îµ-transitions in the ATN from the closure() of the last state of a path. If the closure includes the final state of the submachine, then go to the “calling atn submachine”, and compute the closure of the follow-state of the rule transition. The reason why to use the closure of the calling submachine is because the called submachine can return from a final state and continue the parse from the follow state of a rule transition. For an Antlr grammar, after the semicolon, a catch or finally clause could be parsed. But, because the catch and finally rules could be empty, the rule could end in the final state for the rule submachine.