Columbia University - COMS W4115 – Programming Languages & Translators
Mini-C Compiler Front-End in ANTLR
Build the front-end of a compiler for a small C-like language (Mini-C) using ANTLR (ANother Tool for Language Recognition). This involves defining the grammar, generating a lexer and parser, and building an Abstract Syntax Tree (AST). Optionally, implement a symbol table and basic semantic error reporting.
Our Suggested Approach
1. Define the Mini-C Language Subset:
- Specify the exact features: e.g., basic types (int, char), variable declarations, arithmetic/boolean expressions, if-else statements, while loops, function declarations and calls, basic I/O (printf-like).
- Keep it simple initially, then extend.
2. Write the ANTLR Grammar (.g4
file):
- Lexer Rules (Tokens): Define regular expressions for keywords (
if
,else
,while
,int
,char
), identifiers, integer literals, character literals, operators (+
,-
,*
,/
,=
,==
,<
,>
), punctuation (;
,(
,)
,{
,}
).Whitespace and comments are often skipped.
- Parser Rules (Syntax): Define context-free grammar rules for language constructs like
program
,declaration
,statement
,expression
,function_definition
, etc. These rules describe how tokens can be combined to form valid program structures.
- Resolve ambiguities (e.g., dangling else) if any, using ANTLR's mechanisms or grammar refactoring.
3. Generate Lexer and Parser with ANTLR:
- Use the ANTLR tool (JAR file or plugin) to process your
.g4
grammar file. This generates Java (or other target language) source files for the lexer and parser (e.g.,MiniCLexer.java
,MiniCParser.java
).
4. Build the Abstract Syntax Tree (AST):
- The parser generated by ANTLR creates a Parse Tree (Concrete Syntax Tree) by default.
- To build a more abstract and convenient AST:
- Visitor Pattern: Create a custom visitor class that extends the ANTLR-generated base visitor (e.g.,
MiniCBaseVisitor.java
).
- Override
visitRuleName()
methods (e.g.,visitExpression()
,visitIfStatement()
).
- In each overridden method, create custom AST node objects (e.g.,
ExpressionNode
,IfNode
) representing the language construct. Populate these AST nodes by visiting their children in the parse tree.
- The AST should capture the essential structure and meaning of the program, omitting redundant syntactic details (like parentheses if operator precedence is handled).
5. (Optional) Symbol Table:
- Purpose: Stores information about identifiers (variables, functions) like their type, scope, and other attributes.
- Implementation: Can be a hash map or a list of hash maps to handle nested scopes.
- Population: Populate the symbol table during an AST traversal (e.g., using another visitor). When a declaration is encountered, add the identifier to the symbol table for the current scope.
6. (Optional) Semantic Error Reporting (Basic Type Checking):
- Purpose: Check for errors that are syntactically correct but semantically wrong (e.g., type mismatches in expressions, use of undeclared variables).
- Implementation: Traverse the AST (and use the symbol table).
- For expressions: Check if operand types are compatible with the operator.
- For variable usage: Check if the variable has been declared in the current or an enclosing scope.
- For function calls: Check if the function is declared and if the number/types of arguments match.
- Report errors with meaningful messages and line numbers (ANTLR parse tree nodes contain location information).
7. Driver Program:
- Write a main program that takes a Mini-C source file as input.
- Creates an ANTLR input stream and lexer.
- Creates a token stream and parser.
- Invokes the parser's starting rule (e.g.,
program()
) to get the parse tree.
- Invokes your AST building visitor on the parse tree.
- (Optionally) Invokes semantic analysis visitor/pass on the AST.
- Prints the AST (for debugging) or reports semantic errors.
Illustrative Code Snippet
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
// MiniC.g4 (Simplified ANTLR Grammar Example) grammar MiniC; // Parser Rules (starting with the entry point) program: declaration* statement* EOF; declaration: type IDENTIFIER ';' ; type: 'int' | 'char'; statement: expressionStatement | blockStatement | ifStatement | whileStatement ; expressionStatement: expression ';' ; blockStatement: '{' statement* '}' ; ifStatement: 'if' '(' expression ')' statement ('else' statement)? ; whileStatement: 'while' '(' expression ')' statement ; expression: assignmentExpr | additiveExpr ; // ... more expression rules for precedence (e.g., primary, multiplicative) ... assignmentExpr: IDENTIFIER '=' expression ; additiveExpr: primaryExpr (('+' | '-') primaryExpr)* ; // Simplified primaryExpr: IDENTIFIER | INT_LITERAL | '(' expression ')' ; // Lexer Rules IF: 'if'; ELSE: 'else'; WHILE: 'while'; INT: 'int'; CHAR: 'char'; IDENTIFIER: [a-zA-Z_] [a-zA-Z_0-9]*; INT_LITERAL: [0-9]+; // CHAR_LITERAL: '\'' . '\''; // Basic, needs more for escapes ASSIGN: '='; PLUS: '+'; // ... other operators ... LPAREN: '('; RPAREN: ')'; LBRACE: '{'; RBRACE: '}'; SEMICOLON: ';'; WS: [ \t\r\n]+ -> skip; // Skip whitespace COMMENT: '//' .*? '\n' -> skip; // Skip single-line comments // --- Conceptual Java AST Visitor (Not ANTLR grammar) --- /* class MyASTBuilder extends MiniCBaseVisitor<ASTNode> { @Override public ASTNode visitIfStatement(MiniCParser.IfStatementContext ctx) { ExpressionNode condition = (ExpressionNode) visit(ctx.expression()); StatementNode thenBranch = (StatementNode) visit(ctx.statement(0)); StatementNode elseBranch = (ctx.statement(1) != null) ? (StatementNode) visit(ctx.statement(1)) : null; return new IfNode(condition, thenBranch, elseBranch, ctx.start.getLine()); } // ... override other visit methods ... } */
Explanation & Key Points
ANTLR Grammar (.g4
file)
This is the heart of the front-end. It defines the language's syntax. Lexer rules (tokens) are defined using regular expressions for keywords, identifiers, literals, operators, etc. Parser rules define how these tokens combine to form valid language constructs (like statements, expressions, declarations) using a context-free grammar notation.
Lexer and Parser Generation
The ANTLR tool processes the .g4
grammar file and automatically generates Java (or another target language) code for a lexer (which breaks input text into tokens) and a parser (which checks if the token sequence conforms to the grammar rules and builds a parse tree).
Parse Tree vs. Abstract Syntax Tree (AST)
The parser initially produces a Parse Tree (or Concrete Syntax Tree), which directly reflects the grammar rules, including all punctuation and intermediate non-terminals. An AST is a more abstract, simplified tree representation that captures the essential structure and semantics of the program, omitting unnecessary syntactic details. ASTs are generally more convenient for later compiler phases like semantic analysis or code generation.
Visitor Pattern for AST Construction
ANTLR provides a Visitor pattern implementation. You create a custom visitor class by extending the base visitor generated by ANTLR. You then override methods like visitIfStatement
, visitExpression
, etc. Within these methods, you access parts of the parse tree (e.g., the condition and branches of an if-statement) and construct your custom AST nodes.
Symbol Table (Optional but Important)
A symbol table is a data structure used to store information about identifiers (variables, functions) such as their type, scope (where they are defined), and memory location. It's crucial for semantic checks like ensuring variables are declared before use and type compatibility.
Semantic Analysis (Optional)
This phase checks for meaning-related errors that the parser can't catch. Examples include type mismatches (e.g., adding an integer to a character array without a cast), using undeclared variables, or calling functions with incorrect arguments. This usually involves traversing the AST and consulting the symbol table.
Key Concepts to Master
Compilers
Front-End (Lexer, Parser, Semantic Analyzer)
ANTLR (ANother Tool for Language Recognition)
Grammars (Lexer & Parser Rules)
Tokens
Parse Trees (Concrete Syntax Trees - CST)
Abstract Syntax Trees (AST)
Visitor Pattern
Symbol Tables
Scope
Type Checking
Semantic Analysis
How Our Tutors Elevate Your Learning
- ANTLR Grammar Design: Helping you write correct and unambiguous lexer and parser rules in the
.g4
file for your Mini-C language subset. - Resolving Grammar Conflicts: Assisting with shift/reduce or reduce/reduce conflicts that might arise in your ANTLR grammar.
- AST Node Design: Discussing how to design your custom AST node classes to effectively represent your language constructs.
- Visitor Implementation: Guiding you through writing the visitor methods to correctly traverse the parse tree and build your AST nodes.
- Symbol Table Implementation: Helping you design and implement a symbol table that handles scopes correctly.
- Semantic Checking Logic: Working through specific semantic checks, like type checking for expressions or ensuring variables are declared.
- Debugging ANTLR and Generated Code: ANTLR can have a learning curve. A tutor can help debug issues with the grammar, the generated code, or your visitor logic.