Scraping the grammar for C++ from the ISO/IEC 14882 Specification

This post is a draft. –Ken

This post describes the problem of extracting the C++ grammar from the ISO/IEC 14882 Specification.

“Where is the Spec?”

An official edition of the ISO/IEC 14822 is available for sale on the iso.org website as C++14 (Fourth Edition), C++17 (Fifth Edition), and C++20 (Sixth Edition). Drafts of the Spec can be found in the GitHub repo https://github.com/cplusplus/draft.git. While the drafts are public and freely available, the official editions for the Spec can only be purchased. I should note that while you can extract a grammar for C++ from a draft, there are subtle differences between the drafts and the official editions of the Spec. If you are writing a parser for ISO C++ 14, 17, or 20, you must use the official editions of the Spec. And, if you do not use an official edition, you cannot say that your parser is ISO compliant. In any case, the program I’ve written to scrape the grammar for C++ can be a draft or an official edition in PDF format.

Extracting the CFG from the Spec

In the “old days”, creating a grammar required you to type in by hand the grammar from a journal or book. But, nowadays that’s “old school”. Instead, you open a PDF version of the document in Adobe Reader, find the text of the grammar, and just copy and paste the text to a text file. You would then edit the text to correct the grammar and change whatever is needed to get it to work with a parser generator. Who needs those ancient software developers, right?

The problem is that most programmers are not experts in programming languages, and it is unethical to assume that you are. “Scraping” the grammar this way is slow, error-prone, and not reproducible. Most people don’t record which program was used to read the PDF, and what productions were changed. They don’t even bother to let you know what version of the document they are reading!

Scraping vs. Copy and Paste

The problem we are trying to solve is much harder than just a copy and paste problem. Scraping is defined as:

Identifying where the grammar is located.
Extracting the text of the formal context-free grammar.
Correcting issues in correctness due to OCR problems.
- Incorrect OCR.
- Run-ins.
Correcting errors in the spec.
- Duplicate rules.
- Missing rules.
- Mispellings.
Convert the grammar into the syntax of the parser generator.
Correcting the grammar to compile with the parser generator.
- Identifying the lexer/parser boundary.
- Testing the grammar.
Optimizing the grammar.

Problems with an unmodified scraped grammar

After correcting all the problems with run-ins and spurious newlines in the text, one has to convert rules of the grammar to a legal EBNF.

Some rules in the Spec are in prose:

q-char: any member of the source character set except new-line and "

For raw-string, the end delimiter must be handled with a semantic predicate.

r-char: any member of the source character set, except a right parenthesis ) followed by the initial d-char-sequence (which may be empty) followed by a double quote ".

(The current Cpp grammar does not have the semantic predicate.)

Annex A of the Spec does not define the boundary between the lexer and parser very well.

literal: integer-literal character-literal floating-literal string-literal boolean-literal pointer-literal user-defined-literal

Is literal itself a token from the lexer, or is literal a parser non-terminal? We know from the Antlr4 grammars-v4 cpp grammar where to draw the line, but it hasn’t been done for the preprocessor rules at all.

Optional left recursive rules are not handled by Antlr4

Some rules have to be modified because they are illegal in Antlr4. While left recursion is handled in parser rules in Antlr4, it is flagged as error easily with a few alterations such as parentheses or the ?-operator.

noptr_abstract_declarator : noptr_abstract_declarator ? parameters_and_qualifiers | noptr_abstract_declarator ? '[' constant_expression ? ']' attribute_specifier_seq ? | '(' ptr_abstract_declarator ')' ;

Instead, this must be written as

noptr_abstract_declarator : noptr_abstract_declarator parameters_and_qualifiers | noptr_abstract_declarator '[' constant_expression ? ']' attribute_specifier_seq ? | '(' ptr_abstract_declarator ')' | parameters_and_qualifiers | '[' constant_expression ? ']' attribute_specifier_seq ? ;

A similar problem occurs for attribute_specifier_seq.

Left recursion in lexer rules is not handled by Antlr4

But, after rewriting a rule to be lexer-specific, the rule must be rewritten because Antlr4 does not handle left-recursion in a lexer rule.

error(119): c_plus_plus_spec_draft.g4::: The following sets of rules are mutually left-recursive [Hexadecimal_escape_sequence] and [C_char_sequence] and [Digit_sequence]

Hexadecimal_escape_sequence : '\x' Hexadecimal_digit | Hexadecimal_escape_sequence Hexadecimal_digit ;

C_char_sequence : C_char | C_char_sequence C_char ;

Digit_sequence : Digit | Digit_sequence '\'' ? Digit ;

Antlr does not parse rules renamed to lexer names if RHS contains any parser names

Consider a: b ‘C’; b : ‘B’;

If I want to rename a to A, the resulting grammar will not compile. Renaming is order-dependent. The problem here is that the Antlr grammar differentiates between lexer and parser symbols at a lexer level, and that lexer symbols cannot contain parser symbols even if erroneous. This prevents editing the grammar using tools like Trash that use the official Antl4 grammar: RULE_REF symbols cannot be used on the RHS of of lexer rule since it won’t parse!

Reordering rule alts

This change affected how something is parsed.

Non-greedy decl-specifier-seq

See this change. declSpecifierSeq: declSpecifier+? attributeSpecifierSeq?;

Wrong parse tree after grouping refactoring

The statement rule was scraped from the specification and resulted in a rule with alts in one order, which happened to work. But, this rule was refactored with a grouping, and resulted in a misparse.

C++ and Preprocessor rules are shared

The start rule for the C++ parser is translation_unit, but for the preprocessor it is preprocessing_file. There are rules that are shared between the two start rules, but for Antlr, the lexer must be different. I can compensate by placing the rules for preprocessing in a separate mode, but most lexer rules have to be duplicated between the two modes.

Because they are shared, the lexer rules for the preprocessor need to return types that are identical to the C++ lexer.

C++ grammar is ambigous for ‘#’ non_directive

The grammar has an alt of group_part for ‘#’ non_directive. This alt slows the parser significantly. I have commented it out, but I could fix this rule with maybe an ‘~’-operator.

header_name is either a string literal with double quotes or angle brackets. It turns out that c++config.h in the GNU compiler has a constant_expression that passes an angle bracket to a function. The header_name needs to accept either as a String_literal.

Preprocessing

C++ is different from other languages in many ways. For C++, a preprocessor is part of the Spec. The Cpp grammar in the Antlr4 grammars repo handled preprocessing directives as though comments. But, some code in some of the GNU C++ headers require a preprocessor, otherwise many files fail to parse. The most basic, widely-used sequence that is not specified in the grammar is the backslash-newline sequence, which is specifically mentioned in Section 2.2, Paragraph 2 on how it is to be handled.

String concatenation is another, clearly defined by Spec, preprocessing issue. The grammar should not be changed to deal with this issue because it has introduced many issues in the past.

In the Cpp14.g4, several changes were made to handle this directly by the grammar, but it only added problems. First, primary-expression was changed:

primaryexpression : literal (literal)* | This | '(' expression ')' | idexpression | lambdaexpression ;

Later, it was changed again:

primaryExpression: literal+ | This | LeftParen expression RightParen | idExpression | lambdaExpression;

But, both “hacks” are wrong because it deviates from the spec And, it allows if (1 1) ... as valid input. The original hack was introduced here, and noted in this issue, and change here without explanation, with little review!

Some rules are incomplete. It says in the Spec that and can be used as an alternative to &&. But, logical-and-expression does not contain the alternative. (The Cpp14.g4 grammar was modified here to include the and alternative, but most of the other alternative operators in Table 6 of the Spec were not added.)

Adhoc, scantly-reviewed changes to Cpp14.g4

Over the years, there have been many changes to the grammar. These changes have been allowed for the most part with little review. And, consequently, problems have crept into the grammar.

‘/=’ removed from operators rule

When the grammar was split, the ‘/=’ operator was dropped from Spec rule operators (aka theOperator rule in the Antlr4 grammar). This was done here. You cannot split an Antlr4 grammar reliably by hand: you should use tools.

It was then dropped altogether here.

Duplicate GreaterAssign in theOperator rule added

In this change, GreaterAssign was added twice to the rule. It was subsequently removed here.

Regression of elaborated_type_specifier

In this change, elaborated_type_specifier was changed so that we can no longer parse “friend union” declarations.

union a {
    int b;
    float c;
};

class b
{
    friend union a;
    int d;
};

C++ Preprocessor

The preprocessor is complicated. It requires a bit of support code in the form of an Antlr visitor. The visitor interprets the parse tree from the preprocessor_file rule and converts it to the input for the main parser. The grammar uses C++ expressions, which can result in very large parse trees even for the simple example ‘#if 1‘, and requires a lot of methods in the visitor to be implemented.

The current implementation for the C++ grammar is here.

Changes to Cpp14.g4 over the years

Date	Change	How it fails
4 Jul 2015	Initial revision.
1 Sep 2015	Fix ‘-‘ operator in ‘unaryoperators’ for Issue#212. This commit changed unaryoperator : ‘\|’ \| ‘’ \| ‘&’ \| ‘+’ \| ‘!’ \| ‘~’ ;* to unaryoperator : ‘\|’ \| ‘’ \| ‘&’ \| ‘+’ \| ‘!’ \| ‘~’ \| ‘-‘ ;*, which added the ‘-‘ operator.	While the commit fixes the ‘-‘ operator, the unary-operator rule still contains ‘\|’. which is an error, from the initial revision, and it remains more or less to this day!
3 Oct 2015	Removed Java @header. This change removed a bogus @header from the grammar. @header makes the grammar target specific.
3 Oct 2015	Fixed the pom.xml <packageName> attribute. Package names should be empty in order to be target-independent.
29 Oct 2015	Issue#230. This issue is about whether the grammar handles preprocessing directives. It does not. In order to parse C++ files with preprocessor directives, the suggested solution is “cc -E t.c will get the actual C++ input :)”.	The suggested solution uses the C preprocessor, not the C++ preprocessor. They two are not the same. See this explanation.
16 Dec 2015	Change ‘purespecifier’ rule; Issue#249. In the Spec, we have pure-specifier -> = 0. In the initial revision, this was implemented as purespecifier : Assign val = Integerliteral {if($val.text.compareTo(“0”)!=0) throw new InputMismatchException(this);} because the string literal ‘0’ conflicted with Integerliteral. This commit modifies the rule to purespecifier : Assign val = Octalliteral {if($val.text.compareTo(“0”)!=0) throw new InputMismatchException(this);} in order to parse static member initializations.	The commit breaks parsing pure virtual function declarations because Octalliteral never matches ‘0’: Octalliteral occurs after Integerliteral. The change has remained even though it is faulty.
24 Jun 2016	CPP14.g4 only compatible with Java ANTLR #415.	The issue has never been fixed.
24 Jun 2016	“Issue with other Antlr targets and pure-specifier”; part of PR416, reported in #415. The commit only changes the comments in the grammar.	The change does not fix target-independence.
14 Sep 2016	“Add lexer support for multiline macros in C++” with this hack. This commit adds the lexer rule MultiLineMacro : ‘#’ (.? ‘\\’ [\r\n]+)+ ~[\r\n]+ -> skip;*.	Preprocessing is not really implemented by the grammar. But, in addition, multi-line macros are not ISO C++ 2014.
12 Nov 2016	CPP14.g4 bugfix – relational ops in template arguments. This change reorders alts from templateargument : constantexpression \| typeid \| idexpression; to templateargument : typeid \| constantexpression \| idexpression ;. in order to parse “void TestFunc(vector<ClassA>args, vector<ClassB> args2)” correctly.
9 Dec 2016	“bugfix – multiline macro parse consumes all remaining input”. This commit changes MultiLineMacro : ‘#’ (.? ‘\\’ [\r\n]+)+ ~[\r\n]+ -> skip; Directive : ‘#’ ~[\r\n]* -> skip;* changed to MultiLineMacro : ‘#’ (~[\n]? ‘\\’ ‘\r’? ‘\n’)+ ~[\n]+ -> channel(HIDDEN) ; Directive : ‘#’ ~[\n]* -> channel(HIDDEN) ;*.	On a Mac, the end of line is just a ‘\r’, so this change breaks parsing for Mac files.
23 Jun 2017	Fix Rule Names Conflicts. This commit is a symbol name change. E.g., typeid conflicts with one of the targets.
26 Aug 2018	Fix duplicated Typeid() function declarations problem. This commit is a symbol name change, Typeid => typeidofthetypeid.	Most name changes now just append an underscore ‘_’ to the end of the name. But, the Go target has been fixed, so the rename is no longer required.
26 Dec 2018	CPP grammar does not parse. This commit is a symbol name change for conflicts in the C++ target.	Huge changes in the grammar were for mostly reformatting! Those changes should have gone into a separate commit. It’s hard to verify whether additional changes beyond renaming were made.
26 Dec 2018	partial fix for CPP14.g4 Stringliteral concatenation. This commit changes literal : … \| Stringliteral \| … to literal : … \| Stringliteral+ \| ….	This change was probably ok, but it was later backed out without any explanation as part of a larger/unrelated commit. But, the Spec states clearly that string concatenation is performed by the preprocessor. This is the wrong place for implementing string concatenation.