About

This is an informational web page for the Compiler Principles course (NSWI098), which is being taught at the Faculty of Mathematics and Physics, Charles University. The lectures are being held in S4 from 15:40 to 17:10 every Monday. The seminars are in S5 from 14:00 to 15:30 according to schedule.

You can find here:

Contact

The course is currently led by Jakub Yaghob and David Bednárek. Please direct inqueries of a specific topic (or home assignment) to the teacher who presented the topic.

Latest Updates

26.9.2023 Academic year 2023/2024 initial preparation.
22.9.2022 Academic year 2022/2023 initial preparation.
23.9.2021 Academic year 2021/2022 initial preparation.
25.9.2020 Academic year 2020/2021 preparation.

Grading

The grading criteria for this course are summarized in the following. Please, read them carefully. In case any rules are not clear, contact Dr. Jakub Yaghob.

Home Assignments

There will be 5 home assignments which are designed to incrementally build a compiler for simplified C (called Cecko).

You may receive up to 100 points for assignments #1-#2 and more than 150 points for assignments #3-#5 (100 base points and more then 50 bonus points). You only get bonus points for assignments #3-#5 if you earn at least 50 base points in the same assignment. The sum of your points will determine your final grade as follows:

pointsmark
600 or more1 (excellent)
450-5992 (well done)
350-4493 (OK)
349 or lessfailed

Each assignment has a strict deadline. Once the deadline is reached, all assignments are checked and graded. You may submit your solution after the deadline and ask for (re)evaluation, but you will receive penalty 10 points per day for late submission, which is additionally subtracted from points obtained for the assignment. This penalty is applied as the last points adjustment, i.e. after the decision on the allocation of bonus points.

Lectures

The domain of compilers is relatively steady in the computer science, at least for the compiler front-end. If you have any questions or suggestions related to the lectures, please contact the teachers.

The presented schedule is only informational. Teaching time for a matter may vary slightly depending on the student's grasp.

2.10.2023 Compiler - an introduction PPTX File Icon 01-comp.pptx
9.10.2023 Lexical analysis PPTX File Icon 02-la.pptx
16.10.2023 Syntax analysis - an introduction PPTX File Icon 03-sxa.pptx
23.10.2023 Syntax analysis - top-down parsing PPTX File Icon 03-sxa.pptx
30.10.2023 Syntax analysis - bottom-up parsing PPTX File Icon 03-sxa.pptx
6.11.2023 Semantic analysis PPTX File Icon 05-saatr.pptx
13.11.2023 Intermediate code PPTX File Icon 04-ic.pptx
20.11.2023 Intermediate code generation PPTX File Icon 06-genic.pptx
27.11.2023 High-level optimizations PPTX File Icon 07-opt.pptx
4.12.2023 CPU architectures PPTX File Icon 08-cpu.pptx
11.12.2023 Code generation PPTX File Icon 08-cpu.pptx
18.12.2023 Runtime PPTX File Icon 09-rt.pptx
8.1.2024 Interpreted languages PPTX File Icon 10-interp.pptx

Assignments

Motto: Kde to žije, tam je Mlaskal. (Kate) (not valid anymore, left here as a reminder)

Global assignment

The Cecko (simplified C) language is defined by a set of grammar rules. You may find them in the file doc/selected-grammar.md in the skeleton repository. The grammar is exactly in the same format as in the C programming language specification (working draft) Annex A, it is just simplified as Cecko is a subset of C.

Please note some further remarks:

Developing and testing

Read the Using your Cecko repository section before cloning the skeleton. The framework and skeleton files may be updated for various reasons, therefore, you will need a method to merge the updates into your project.

The skeleton framework contains the following folders:

You will need to prepare your building environment as described in Prerequisites.

Refer to Building and running Cecko for instructions on building and running the code.

Assignment 1

For the first assignment, you will submit the following files:

You may use the C++ header/source files for C++ routines called from calexer.lex - in this case, add #include "casem.hpp" to calexer.lex.

Assignments 2 to 5

For the last four assignments, you will submit the following files:

Test files

The Cecko source files have the usual suffix .c, the corresponding expected outputs are in the .gold files.

The test files whose names end with "n" contain intentional semantic errors and are expected to fail during compilation.

Submitting and evaluation

You shall submit your solution into Recodex.

Note: For registering in Recodex, you need a working email address registered in CAS.

Recodex will compile it using GCC 12.2. Your solution should not use any 3rd party libraries or code (except those specified in the section Building and running Cecko). Only standard C++20 libraries should be used.

Recodex will compile your solution with the same library and will apply the same tests as in the framework. (However, Recodex will use a different build system.)

Compilation and tests in Recodex takes some time - please be patient.

Recodex will insist on strict equivalence of your solution outputs to the gold files. However, we will manually inspect results of all tests which can compile and run but fail in the Recodex judge phase. Therefore, you may be assigned some points even if Recodex assigns zero.

Each home assignment has a maximal achievable points (100 points for assignments #1-#2, 150+ points for assignments #3-#5). Any difference (including your debug messages) between your compiler output and a provided correct output is penalized. The penalization depends on the severity of the problem.

(1) Lexical analysis

  • Extend the calexer.lex file so that the resulting scanner is capable of recognizing all the lexical elements described in doc/selected-grammar.md and returns the objects which represent them at the interface between the lexer and the parser.
  • All objects representing tokens are created using a call to a function like cecko::parser::make_WHILE(...) whose declarations are generated in <build-folder>/fmwk/ckdumper.hpp by the build system for cecko1. All these functions have a parameter location_type l, pass ctx->line() here.
  • The list of token names (and additional parameter types) may be found in ckdumper.y (it has to be the same as in caparser.y which is not used when building cecko1)
  • Some tokens (like >,<,>,<=,>= or +,-) are grouped together. The object representing the token is shared across each group (it will become the same terminal in the grammar); individual tokens are distinguished using an additional parameter of the object (an enumeration type associated to the group, like cecko::gt_cmpo::LT - see ckgrptokens.hpp).
  • Identifiers and literals have an additional parameter of the object - the text or value - pass the value of the correct type here
  • String literals shall not contain end-of-line (non-escaped \n). You will need 'start conditions' feature of FLEX to implement escape sequences. The effective value (with quotes removed and escapes replaced) shall be passed to make_STRLIT as std::string (cecko::CIName is just an alias)
  • Character literals shall not contain end-of-line (non-escaped \n) and they represent always only one char in Cecko. Remember, characters have type int in C.
  • Multiline comments are enclosed in /* and */ (not inside a string or one-line comment). Comments can contain any sequence of any characters. Multi-line comments can be nested - starts (/*) and ends (*/) have to be paired correctly. You will certainly need 'start conditions' to handle this.
  • One-line comments start with // (not inside a string or multi-line comment) and end with end of the line. Comments can contain any sequence of any characters except the end-of-line. You will certainly need 'start conditions' to handle this.
  • Call ctx->incline() whenever you meet end-of-line - this increments ctx->line(). The first line has number 1 (ctx->line() is already initialized to 1). (You must not use %option yylineno to count lines.)
  • Detection of and recovery from lexical errors, like end-of-line or end-of-file in the middle of a string (report EOLINSTRCHR or EOFINSTRCHR and terminate the string as if the terminating quote was present), end-of-file in comment (EOFINCMT).
  • Detection of malformed lexical elements, like 12xy or 0xcxy (it is not an INTLIT followed by an IDF). You shall report an error (BADINT), consume the whole malformed lexical element and produce one INTLIT token (with the value of 12) in this case.
  • Detection of too large integer literals (larger than MAX_INT=0x7FFFFFFF; Cecko behavior is slightly different from the standard here). You shall report an error (INTOUTRANGE) and produce the INTLIT token with a value of MAX_INT. (Beware: the atoi function has undefined behavior for integers greater than MAX_INT, you will need strtol or stoi or a hand-made implementation.)
  • Detection of too large hexadecimal escape sequences in char and string literals (larger than 0xFF). You shall report an error (BADESCAPE) and produce the INTLIT or STRLIT with \xFF substituted for the overflowed escape sequence.
  • If, after a backslash, you encounter a character other than specified in simple-escape-sequence, report the error (BADESCAPE), ignore the backslash and use the character after the backslash.
  • Use code like ctx->message(cecko::errors::EOFINCMT, ctx->line()) to signal an error. All error codes are declared in fmwk/ckcontext.hpp, the message texts are in fmwk/ckcontext.cpp. The err_def_s messages have an additional argument of type std::string_view

Example input (*.c) and outputs (*.gold) can be found in the tests directory.

(2) Parsing

  • Rewrite the grammar in doc/selected-grammar.md to Bison rules in caparser.y.
  • The declarative part of the caparser.y file is already prepared and you should not change it.
  • Remove all shift/reduce and reduce/reduce conflicts. Warning: bison does not report conflicts to stdout. CMake projects instruct bison to report the conflicts in <build_folder>/stud-sol/caparser.y.output, but you have to examine this file manually.
  • Extend calexer.lex with TYPEIDF token detection. The lexical element identifier corresponds to two tokens IDF and TYPEIDF depending on the content of compiler tables. Use if(ctx->is_typedef(yytext)) for detecting type identifier using compiler tables. Specifically for this assignment, we have predefined FILE identifier as a type identifier.

(3) Declarations

Implement semantic analysis for declarations.

  • 100 points:
    • void, _Bool, char, int, TYPEIDF, const
    • pointer and function declarators
    • variable and function declarations and definitions
    • return statement containing an INTLIT
  • +10 points:
    • typedef definitions
  • +10 points:
    • array declarators (with an INTLIT in brackets)
  • +20 points:
    • enum declarations and definitions
  • +20 points:
    • struct declarations and definitions

You will not be awarded any bonus points if your solution receives less than 50 points from the 100-point base part.

(4) Expressions

Implement semantic analysis for expressions.

  • 100 points:
    • Integer, character, and string literals
    • Implicit conversions:
      • Array to pointer (string literals are of type char[N])
      • _Bool/char to int, int to char
    • Function calls (including variadic functions like printf)
      • Arguments and return values of types char/int/pointer
    • int operators:
      • Unary +, -
      • Binary +, -, *, /, %
    • Pointer operators:
      • Unary *, &
    • Assignment operator (into char/int/pointer L-values)
    • Return statement
  • +10 points:
    • sizeof operator
  • +10 points:
    • int ++, --, +=, -=, *=, /=, %= operators
    • pointer ++, --, +=, -= operators
  • +20 points:
    • [] operator
    • Pointer+int, int+pointer, pointer-int operators
  • +10 points:
    • .,-> operators
    • Assignment, function arguments and return values of type struct

You will not be awarded any bonus points if your solution receives less than 50 points from the 100-point base part.

(5) Control-flow statements

Implement semantic analysis for control-flow statements and Boolean expressions.

  • 100 points:
    • Implicit conversions:
      • char/int/pointer to _Bool
    • _Bool operations:
      • Unary !
      • Arguments and return values of type _Bool
      • Assignment operator into _Bool
    • int operators:
      • ==,!=,<,>,<=,>=
    • if and if-else statements
    • while and while-do statements
  • +10 points:
    • for statement
  • +10 points:
    • pointer operators:
      • ==,!=,<,>,<=,>=
      • pointer-pointer
  • +30 points:
    • &&, || operators with shortcut evaluation
  • +6 points:
    • Complex mixture of the elements above

You will not be awarded any bonus points if your solution receives less than 50 points from the 100-point base part.

Prerequisites

Platform

Recodex uses Linux and GCC to compile and run your submissions. It may trigger errors which escaped detection in other compile/run environments. For such cases, it is advantageous to have a GCC-based build environment prepared beforehand.

On the other hand, the Cecko sources are intended to be multi-platform and the skeleton is regularly tested with both GCC and Microsoft Visual C++. Therefore, development using MSVC is possible (and encouraged).

Windows Subsystem for Linux

For developers equipped with Windows 10, Windows Subsystem for Linux (WSL) is recommended as the best approach to multi-platform development. Compiling, running and debugging in the WSL-GCC environment is integrated in Visual Studio 2022 (as well as VS Code) and fully supported by Cecko projects.

Note: If you plan to build LLVM on your own (instead of downloading binaries), make sure that you use WSL 2 and that the working folder is at native Linux filesystem (i.e. not /mnt/...).

Git

Make yourself familiar with the principles of git.

In Linux, you will probably use command-line git which is present in all modern Linux distributions.

In Windows, TortoiseGit is the recommended option.

You may also try using the Git plugins in Visual Studio 2022 or VS Code; however, be aware that some advanced commands may not be available there.

To access gitlab.mff.cuni.cz, you will first need to login (using your SIS credentials) in its web interface and register your SSH keys there (if you have not done so yet).

Build tools

cmake

CMake 3.19 or newer is required. It is packaged in sufficiently new distributions of Linux (including Ubuntu 22.04 in WSL), installed by something like:

sudo apt install cmake

For Windows, cmake (and ninja) is installed with Visual Studio 2022 when C++ CMake tools for Windows is enabled in the Visual Studio Installer. C++ CMake tools for Linux is required when Visual Studio is used to build in WSL or remote Linux.

When using VS Code, cmake (as well as a C++ compiler) must be installed as an external tool.

ninja and zlib

If you use Visual Studio support for Linux, you may need to manually install ninja build system like:

sudo apt install ninja-build

In some Linux distributions, you may encounter linker complaining about -lz. The remedy is installing libz like:

sudo apt install zlib1g-dev

bison and flex

bison 3.4 and flex 2.6.4 or newer are required. They are present in any (sufficiently new) distribution of Linux (including WSL), installed by something like:

sudo apt install bison flex

For Windows, they may be downloaded from lexxmark/winflexbison. Unpack the .zip to a suitable location and add the folder to the Path variable using the Edit the system environment variables control panel dialog. This package contains win_bison.exe and win_flex.exe - these names are recognized by cmake.

If you are unable to install bison/flex into system folders or if cmake does not find (the correct version of) bison or flex, you may specify it explicitly when invoking cmake:

cmake -DBISON_ROOT=<bison_install_folder> -DFLEX_ROOT=<flex_install_folder> ...

Note that the bison executable must be located in <bison_install_folder>/bin etc.

Within Visual Studio, you may specify the BISON_ROOT and FLEX_ROOT in the CMake Settings.

Note: You may also use the winflexbison3 package from Chocolatey.

Note: You cannot use WSL installations of bison/flex for MSVC builds (and vice versa) because cmake would not invoke them properly. Therefore you will probably end with two bison/flex installations if you do both MSVC and GCC builds, similarly to having two installations of cmake in this case.

C++20 compiler

MSVC 19.36 (part of Visual Studio 2022 v17.6) and gcc 11.4.0 are verified to be sufficient for the compilation of Cecko.

LLVM

Cecko uses the intermediate code (IR) and JIT x86 code generator (MCJIT) from the LLVM framework. The required version is LLVM 15 or newer; the tests in Recodex run with LLVM 17.0.1.

Although the LLVM team recommends building LLVM from sources, it may be well beyond the capability and patience of most Cecko developers. Therefore, downloading a binary version is recommended.

Note: Cecko requires only headers and static libraries; however, all binary distributions of LLVM include also many executables and dynamic libraries. Some downloads/packages include only the executables which is insufficient for Cecko. This also applies to some downloads at LLVM Pre-Built Binaries

Linux (including Ubuntu 22.04 in WSL) - using system-wide installation

The simplest way to install the required LLVM modules is from a Linux distribution channel, like:

sudo apt install llvm-15

cmake will automatically find LLVM if installed in this way.

Note: Although these libraries are compatible with Debug mode, they do not contain full debug info, for example, stepping into LLVM procedures during debugging of Cecko may be impossible.

Linux - installing and using private LLVM copy

Pre-built LLVM 17.0.1 binaries for Cecko are available here:

Note: These are the same binaries as used in Recodex, compiled with g++ 13.2.0

When building Cecko with a private LLVM installation, cmake must be instructed to find LLVM by setting the LLVM_DIR cmake variable to the value <llvm-install-folder>/lib/cmake/llvm as follows:

cmake -DLLVM_DIR=<llvm-install-folder>/lib/cmake/llvm <...other-arguments...>

Windows binaries

Beware: The Windows versions of Pre-Built Binaries published by LLVM (as of October 2023) are not complete enough for Cecko.

Beware: vcpkg can now install LLVM; however, the installation is implemented by building from sources and may take several hours.

Pre-built LLVM binaries for Cecko are available here:

Visual Studio CMake Configuration CMAKE_BUILD_TYPE Download
x64-Debug x64-Release
(default) (incompatible) Debug llvm-17.0.1-install-x86_64-msvc-19.36-debug.zip (1.2 GB)
(incompatible) (default) RelWithDebugInfo llvm-17.0.1-install-x86_64-msvc-19.36-relwithdebinfo.zip (651 MB)
(incompatible) (works) Release llvm-17.0.1-install-x86_64-msvc-19.36-release.zip (332 MB)

Note: The above binaries are compatible with Visual Studio 2022 v17.2.0 (May 2022) or newer.

When building Cecko, cmake must be instructed to find LLVM by setting the LLVM_DIR cmake variable to the value <llvm-install-folder>/lib/cmake/llvm.

In Visual Studio 2022, the LLVM_DIR variable is set in the CMake variables and cache section of the CMake Settings (CMakeSettings.json); you may run Generate Cache to pre-create this variable before setting it to the correct value.

In VS Code, the LLVM_DIR variable is located in the cmake.configureSettings entry of .vscode/settings.json.

When cmake is invoked from command-line, it may look like:

cmake -GNinja -DLLVM_DIR=<llvm-install-folder>/lib/cmake/llvm <...other-arguments...>	

Note: Make sure you use ninja (-GNinja) as the cmake Generator. Modern versions of cmake for Windows generate Visual Studio projects by default and some commands described in this guide do not work correctly in this mode. If cmake is invoked by Visual Studio, Ninja is enforced by default.

Note: If you compile with Microsoft Visual C++ from command-line, make sure that the x64 version of cl is visible when cmake is invoked (e.g. by starting from the "x64 Native Tools Command Prompt for VS 2022"). The cecko system is incompatible with 32-bit x86 mode which could be the default elsewhere.

Building LLVM (advanced)

If you decide to ignore binary distributions and build LLVM on your own (beware, it may require several hours of downloading and building), the simplest approach is cloning llvm-project into a <llvm-project-folder>:

git clone https://github.com/llvm/llvm-project.git <llvm-project-folder>	

If you want to have exactly the same version as in recodex, checkout the tag llvmorg-17.0.1.

To build and install, create a <llvm-build-folder> and a <llvm-install-folder>, then configure, build and install using:

cd <llvm-build-folder>
cmake -GNinja -DLLVM_PARALLEL_COMPILE_JOBS=8 -DLLVM_PARALLEL_LINK_JOBS=1 \
  -DLLVM_TARGETS_TO_BUILD=X86 -DCMAKE_BUILD_TYPE=Debug \
  -DCMAKE_INSTALL_PREFIX=<llvm-install-folder> \
  <llvm-project-folder>/llvm
cmake --build .
cmake --build . --target install

When building Cecko with the home-built LLVM, define the LLVM_DIR variable as <llvm-install-folder>/lib/cmake/llvm.

See also Building LLVM with CMake.

Using your Cecko repository

For each student registered for this course, a repository was created in gitlab.mff.cuni.cz/teaching/nswi098/2023.

If you cannot see your repository there, it is probably because your login was not registered in gitlab when the repository was created. Ask your teachers in this case.

Your remote repository shall remain private even after your work is finished and points awarded, otherwise you may be deemed guilty if a plagiarism case appears.

You shall clone your repository to a local folder as usual:

git clone <your-repository-url> <local-folder>

Each repository was already initialized by a copy of the skeleton. However, it will not automatically import any changes made to the skeleton after the creation.

Changes to Cecko in 2023/24 wrt. 2022/23

Cecko has moved from LLVM 14 to LLVM 17; it remains compatible with LLVM 15 and 16. Solutions developed with LLVM 14 and older require minor reengineering - see Opaque Pointers.

Merging updates of the skeleton into your code

The actions described in this section will be needed if there is an update (by the teachers) to the skeleton repository after the creation of the students' repositories. There is no planned update in sight at this moment. You may probably safely ignore the actions described below until the moment when skeleton update is really needed (due to a bug or incompatibility with some build environment).

Configuration

At each local folder you use, you need to setup the upstream remote:

git remote add upstream git@gitlab.mff.cuni.cz:teaching/nswi098/cecko/skeleton.git

In TortoiseGit, the remote is created in the config accessible from the Git Sync… dialog through the Manage button. The resulting config shall look like this:

Merging

Whenever you think the skeleton repository may have been updated, execute a pull from upstream:

git pull upstream master

In TortoiseGit, the Pull command may be executed from the Git Sync… dialog after selecting the upstream remote as shown here:

If you encounter conflicts on an upstream pull, it probably means that you added your code to an unexpected place in the skeleton files. Try to resolve it carefully, if in doubts, ask your teachers.

Your work shall be pushed to/pulled from the default origin remote.

Building and running Cecko

Building

Command-line

Create a <build-folder>, by convention as <cecko-folder>/build. Then configure using these commands:

cd <build-folder>
cmake -DCMAKE_BUILD_TYPE=Debug -DLLVM_DIR=<llvm-install-folder>/lib/cmake/llvm <cecko-folder>

The -DLLVM_DIR option is usually not required if LLVM was installed as a system package.

Building is then done by invoking cmake --build . or make in <build-folder>.

The resulting executables are located in <build-folder>/stud-main and invoked like:

cd <build-folder>
<build-folder>/stud-main/cecko1 <cecko-folder>/test/test1.c

Visual Studio 2022

Open the <cecko-folder> with Visual Studio.

When opened for the first time, use PROJECT/CMake Settings to adjust the cmake configuration, the LLVM_DIR variable in particular (see Prerequisites). The default x64-Debug configuration is created automatically; to use WSL, add the WSL-GCC-Debug configuration, adjust it if necessary and switch to it.

See also the related Visual Studio Documentation.

Visual Studio Code

In Windows, install the Visual C++ compiler first. Then start an x64 Developer Command Prompt (which defines the environment where the compiler is visible), then run code from the command prompt. Then follow the documentation for the CMake Tools extension.

Running

Command-line

The resulting executables are located in <build-folder>/stud-main and invoked like:

cd <build-folder>
<build-folder>/stud-main/cecko1 <cecko-folder>/test/test1.c

Visual Studio 2022

Open DEBUG -> Debug and Launch Settings for … (launch.vs.json) and add the following settings into the x64 configuration(s) in use:

"currentDir": ".",
"args": [ "test\\test1.c" ]

For WSL configuration(s), use:

"cwd": ".",
"args": [ "test/test1.c" ]

For remote Linux configuration(s), adjust remoteCMakeListsRoot in CMakeSettings.json so that it does not use the $HOME variable, then use:

"cwd": "${cmake.remoteCMakeListsRoot}",
"args": [ "test/test1.c" ]

Note that WSL and remote Linux use the same debug config ("type": "cppgdb"); therefore, there is probably no setting that works simultaneously in both cases.

See also the related Visual Studio Documentation.

Visual Studio Code

T.B.D.