Compiler Construction
Read Time:5 minutes
Programming languages have become fundamental in creating powerful applications - however the source code written in these languages has to be translated into a form that is executable by computers. This is done by either a compiler - which compiles source code (which is readable by a human) into object code (which is readable and runnable by a machine) - or by an interpreter, which only translates as much code as is needed to execute the next statement of the programme.
Compilers translate source code written in a higher-level programming language, into a lower-level programming language - often a machine language, to allow a computer to execute it. This process has numerous steps, and thus compilers can be hugely-complex programmes.
Stage 1 - Lexical Analysis
Token: A sequence of characters that are treated as one in the grammar of a programming language.
This is also known as scanning, and is the first phase of a compiler. It involves reading the source code, and breaking it into a stream of tokens. It does this by reading the characters in the source code, and grouping them into lexemes - a sequence of characters that are related. Each lexeme responds to a token, which are defined in the programme using regular expressions. These are also important for removing lexical errors - such as comments and white space.
- For example a keyword or a type.
I was inspired to write this post after reading about finite Automata in A.K. Dewdney's "The New Turing Omnibus", which lexical analysis is closely related too.
- Input Preprocessing: This stage involves 'cleaning up' the source code - for example by removing whitespace and comments.
- Tokenisation: This stage is where the source text is broken up into a sequence of tokens - which it does by comparing the text to lexemes in the compiler (expressed using regular expressions).
- Token Classification: Here, the lexer determines the type of each token - such as a keyword, operator or type.
- Token Validation: The lexer checks that each token is valid - such as a variable name being a valid and accepted name.
- Ouput Generation: The lexer generates the output of this process as a list of tokens. This list is then passed onto the next stage of the compilation process.
Stage 2 - Syntax Analysis
Also known as parsing - this stage ensures that the tokens obtained from lexical analysis are arranged in a valid pattern. If the tokens are not in a valid pattern, this stage will detect this error, preventing issues further down the compilation process.
This stage produces a parse tree or an abstract syntax tree - this is a hierarchical representation of the source code.
Different Algorithms Used in Syntax Analysis:
- LL Parsing - This is a top-down parsing algorithm, that starts with the root of the parse tree, and constructs the tree by successively expanding downwards. It is one of the simplest algorithms to implement.
- LR Parsing - This is a bottom-up parsing algorithm, that starts with the leaves of the parse tree, and constructs the tree by successively reducing terminals. It is more powerful than LL parsing.
Once a parse tree has been constructed, the compiler moves onto the next stage - semantic analysis.
Stage 3 - Semantic Analysis
This stage involves ensuring that declarations and statements are semantically correct. The parse tree / syntax tree is used to check consistency.
Functions of Semantic Analysis:
- Type checking - this is where the compiler verifies that each operator has matching operands.
- Label Checking - Ensures a programme contains label references.
- Flow Control Check - Ensures that control structures are used correctly, e.g an
else if
statement not being used with anif
statement.
Examples of Semantic Errors:
- Type mismatch,
- Undeclared variables,
- Reserved identifiers being misused,
Stage 4 - Code Generation
In this stage, the compiler generates our final code - this may be code in another programming language, if the compiler is a cross-language compiler - or more often it is machine code.
This stage is quite large, and thus has several steps:
- An Abstract Syntax Tree (AST) is constructed by traversing all possible paths in all files.
- Register descriptors are created - these are structures which store information about the registers used in the programme.
- Address descriptors are created - these store the memory locations used by a programme - much like pointers in the C programming language.
- A code generator is then used to generate the actual code.
Stage 5 - Code Optimisation
This is the final stage of a compiler - and is not features in some, simple compilers, due to it not being essential. The compiler will analyse the code generated in the previous stage, and will optimise it - e.g via loop unrolling and function unrolling.
Conclusion
Whilst writing this post I made use of other articles - for example on geekforgeeks.com (https://www.geeksforgeeks.org/introduction-of-compiler-design/?ref=header_outind).
I hope to look into creating my own compiler in the future - potentially making a cross-language-compiler due to my experience in Javascript. However, I'd also really like to implement a compiler in C due to the superior optimisation it offers and the greater control over memory management.