Compiler Construction

Read Time:5 minutes

Programming languages have become fundamental in creating powerful applications - however the source code written in these languages has to be translated into a form that is executable by computers. This is done by either a compiler - which compiles source code (which is readable by a human) into object code (which is readable and runnable by a machine) - or by an interpreter, which only translates as much code as is needed to execute the next statement of the programme.

Compilers translate source code written in a higher-level programming language, into a lower-level programming language - often a machine language, to allow a computer to execute it. This process has numerous steps, and thus compilers can be hugely-complex programmes.

Stage 1 - Lexical Analysis

Token: A sequence of characters that are treated as one in the grammar of a programming language.

This is also known as scanning, and is the first phase of a compiler. It involves reading the source code, and breaking it into a stream of tokens. It does this by reading the characters in the source code, and grouping them into lexemes - a sequence of characters that are related. Each lexeme responds to a token, which are defined in the programme using regular expressions. These are also important for removing lexical errors - such as comments and white space.

  • For example a keyword or a type.

I was inspired to write this post after reading about finite Automata in A.K. Dewdney's "The New Turing Omnibus", which lexical analysis is closely related too.



  1. Input Preprocessing: This stage involves 'cleaning up' the source code - for example by removing whitespace and comments.
  2. Tokenisation: This stage is where the source text is broken up into a sequence of tokens - which it does by comparing the text to lexemes in the compiler (expressed using regular expressions).
  3. Token Classification: Here, the lexer determines the type of each token - such as a keyword, operator or type.
  4. Token Validation: The lexer checks that each token is valid - such as a variable name being a valid and accepted name.
  5. Ouput Generation: The lexer generates the output of this process as a list of tokens. This list is then passed onto the next stage of the compilation process.

Stage 2 - Syntax Analysis

Also known as parsing - this stage ensures that the tokens obtained from lexical analysis are arranged in a valid pattern. If the tokens are not in a valid pattern, this stage will detect this error, preventing issues further down the compilation process.

This stage produces a parse tree or an abstract syntax tree - this is a hierarchical representation of the source code.

Different Algorithms Used in Syntax Analysis:

  • LL Parsing - This is a top-down parsing algorithm, that starts with the root of the parse tree, and constructs the tree by successively expanding downwards. It is one of the simplest algorithms to implement.
  • LR Parsing - This is a bottom-up parsing algorithm, that starts with the leaves of the parse tree, and constructs the tree by successively reducing terminals. It is more powerful than LL parsing.

Once a parse tree has been constructed, the compiler moves onto the next stage - semantic analysis.

Stage 3 - Semantic Analysis

This stage involves ensuring that declarations and statements are semantically correct. The parse tree / syntax tree is used to check consistency.

Functions of Semantic Analysis:

  1. Type checking - this is where the compiler verifies that each operator has matching operands.
  2. Label Checking - Ensures a programme contains label references.
  3. Flow Control Check - Ensures that control structures are used correctly, e.g an else if statement not being used with an if statement.

Examples of Semantic Errors:

  • Type mismatch,
  • Undeclared variables,
  • Reserved identifiers being misused,

Stage 4 - Code Generation

In this stage, the compiler generates our final code - this may be code in another programming language, if the compiler is a cross-language compiler - or more often it is machine code.

This stage is quite large, and thus has several steps:

  • An Abstract Syntax Tree (AST) is constructed by traversing all possible paths in all files.
  • Register descriptors are created - these are structures which store information about the registers used in the programme.
  • Address descriptors are created - these store the memory locations used by a programme - much like pointers in the C programming language.
  • A code generator is then used to generate the actual code.

Stage 5 - Code Optimisation

This is the final stage of a compiler - and is not features in some, simple compilers, due to it not being essential. The compiler will analyse the code generated in the previous stage, and will optimise it - e.g via loop unrolling and function unrolling.

Conclusion

Whilst writing this post I made use of other articles - for example on geekforgeeks.com (https://www.geeksforgeeks.org/introduction-of-compiler-design/?ref=header_outind).

I hope to look into creating my own compiler in the future - potentially making a cross-language-compiler due to my experience in Javascript. However, I'd also really like to implement a compiler in C due to the superior optimisation it offers and the greater control over memory management.

Toby Chambers © 2024