Toby Chambers

Compiler Construction

Read Time:5 minutes

Programming languages have become fundamental in creating powerful applications - however the source code written in these languages has to be translated into a form that is executable by computers. This is done by either a compiler - which compiles source code (which is readable by a human) into object code (which is readable and runnable by a machine) - or by an interpreter, which only translates as much code as is needed to execute the next statement of the programme.

Compilers translate source code written in a higher-level programming language, into a lower-level programming language - often a machine language, to allow a computer to execute it. This process has numerous steps, and thus compilers can be hugely-complex programmes.

Stage 1 - Lexical Analysis

Token: A sequence of characters that are treated as one in the grammar of a programming language.

This is also known as scanning, and is the first phase of a compiler. It involves reading the source code, and breaking it into a stream of tokens. It does this by reading the characters in the source code, and grouping them into lexemes - a sequence of characters that are related. Each lexeme responds to a token, which are defined in the programme using regular expressions. These are also important for removing lexical errors - such as comments and white space.

For example a keyword or a type.

I was inspired to write this post after reading about finite Automata in A.K. Dewdney's "The New Turing Omnibus", which lexical analysis is closely related too.

Input Preprocessing: This stage involves 'cleaning up' the source code - for example by removing whitespace and comments.
Tokenisation: This stage is where the source text is broken up into a sequence of tokens - which it does by comparing the text to lexemes in the compiler (expressed using regular expressions).
Token Classification: Here, the lexer determines the type of each token - such as a keyword, operator or type.
Token Validation: The lexer checks that each token is valid - such as a variable name being a valid and accepted name.
Ouput Generation: The lexer generates the output of this process as a list of tokens. This list is then passed onto the next stage of the compilation process.

Stage 2 - Syntax Analysis

Also known as parsing - this stage ensures that the tokens obtained from lexical analysis are arranged in a valid pattern. If the tokens are not in a valid pattern, this stage will detect this error, preventing issues further down the compilation process.

This stage produces a parse tree or an abstract syntax tree - this is a hierarchical representation of the source code.

Different Algorithms Used in Syntax Analysis:

LL Parsing - This is a top-down parsing algorithm, that starts with the root of the parse tree, and constructs the tree by successively expanding downwards. It is one of the simplest algorithms to implement.
LR Parsing - This is a bottom-up parsing algorithm, that starts with the leaves of the parse tree, and constructs the tree by successively reducing terminals. It is more powerful than LL parsing.

Once a parse tree has been constructed, the compiler moves onto the next stage - semantic analysis.

Stage 3 - Semantic Analysis

This stage involves ensuring that declarations and statements are semantically correct. The parse tree / syntax tree is used to check consistency.

Functions of Semantic Analysis:

Type checking - this is where the compiler verifies that each operator has matching operands.
Label Checking - Ensures a programme contains label references.
Flow Control Check - Ensures that control structures are used correctly, e.g an else if statement not being used with an if statement.

Examples of Semantic Errors:

Type mismatch,
Undeclared variables,
Reserved identifiers being misused,

Stage 4 - Code Generation

In this stage, the compiler generates our final code - this may be code in another programming language, if the compiler is a cross-language compiler - or more often it is machine code.

This stage is quite large, and thus has several steps:

An Abstract Syntax Tree (AST) is constructed by traversing all possible paths in all files.
Register descriptors are created - these are structures which store information about the registers used in the programme.
Address descriptors are created - these store the memory locations used by a programme - much like pointers in the C programming language.
A code generator is then used to generate the actual code.

Stage 5 - Code Optimisation

This is the final stage of a compiler - and is not features in some, simple compilers, due to it not being essential. The compiler will analyse the code generated in the previous stage, and will optimise it - e.g via loop unrolling and function unrolling.

Conclusion

Whilst writing this post I made use of other articles - for example on geekforgeeks.com (https://www.geeksforgeeks.org/introduction-of-compiler-design/?ref=header_outind).

I hope to look into creating my own compiler in the future - potentially making a cross-language-compiler due to my experience in Javascript. However, I'd also really like to implement a compiler in C due to the superior optimisation it offers and the greater control over memory management.