About

This is an informational web page for the Compiler Principles course (NSWI098), which is being taught at the Faculty of Mathematics and Physics, Charles University. The lectures are being held in S3, from 14:00 to 15:30 every Thursday. The seminars are in S4 after the lecture (from 15:40).

You can find here:

Contact

The course is currently led by Jakub Yaghob and David Bednárek. Please direct inqueries of a specific topic (or home assignment) to the teacher who presented the topic.

Latest Updates

18.9.2017 Academic year 2017/2018 preparation.
23.7.2016 A new web page was created.

Grading

The grading criteria for this course are summarized in the following. Please, read them carefully. In case any rules are not clear, contact Dr. Jakub Yaghob.

Home Assignments

There will be 6 home assignments which are designed to incrementally build a compiler for simplified Pascal (called Mlaskal).

You may receive up to 100 points for assignments #1-#3 and up to 150 points for assignments #4-#6. The sum of your points will determine your final mark as follows:

pointsmark
700 or more1 (excellent)
550-6992 (well done)
450-5493 (OK)
449 or lessfailed

Furthermore, 450 points are required to receive a credit for the seminar (which is also required). If you are not satisfied with your mark, but you have at least 450 points (an thus the credit), you may request a written examination where your mark will be determined. Once subscribing for the exam, your mark received from the assignments is no longer valid — i.e., you may fail the exam even if you have enough points from the assignments.

Each assignment has a strict deadline. Once the deadline is reached, all assignments are collected and graded. You may submit your solution after the deadline and ask for (re)evaluation, but you will receive penalty 10 points per day for late submission, which is additionally subtracted from points obtained for the assignment.

Lectures

The domain of compilers is relatively steady in the computer science, at least for the compiler front-end. If you have any questions or suggestions related to the lectures, please contact the teachers.

The presented schedule is only informational. Teaching time for a matter may vary slightly depending on the student's grasp.

4.10.2018 Compiler - an introduction PPTX File Icon 01-comp.pptx
11.10.2018 Lexical analysis PPTX File Icon 02-la.pptx
18.10.2018 Syntax analysis - an introduction PPTX File Icon 03-sxa.pptx
25.10.2018 Syntax analysis - top-down parsing PPTX File Icon 03-sxa.pptx
1.11.2018 Syntax analysis - bottom-up parsing PPTX File Icon 03-sxa.pptx
8.11.2018 Intermediate code PPTX File Icon 04-ic.pptx
15.11.2018 Semantic analysis PPTX File Icon 05-saatr.pptx
29.11.2018 Intermediate code generation PPTX File Icon 06-genic.pptx
6.12.2018 High-level optimizations PPTX File Icon 07-opt.pptx
13.12.2018 CPU architectures PPTX File Icon 08-cpu.pptx
20.12.2018 Code generation PPTX File Icon 08-cpu.pptx
3.1.2019 Runtime PPTX File Icon 09-rt.pptx
10.1.2019 Interpreted languages PPTX File Icon 10-interp.pptx

Assignments

Motto: Kde to žije, tam je Mlaskal. (Kate)

Global assignment

The Mlaskal (simplified Pascal) language is defined by a set of syntax diagrams:

Program JPEG File Icon mlaskal-prog.jpeg
Block JPEG File Icon mlaskal-block.jpeg
Procedure, function JPEG File Icon mlaskal-proc.jpeg
Type JPEG File Icon mlaskal-type.jpeg
Special types JPEG File Icon mlaskal-spectype.jpeg
Statement JPEG File Icon mlaskal-stmt.jpeg
Variable JPEG File Icon mlaskal-var.jpeg
Expression JPEG File Icon mlaskal-expr.jpeg
Constants JPEG File Icon mlaskal-const.jpeg
Numbers JPEG File Icon mlaskal-nums.jpeg
Identifiers JPEG File Icon mlaskal-id.jpeg

Please note some further remarks:

Links to the required software

Homepages:

Windows native ports:

Developing and testing

Framework+skeleton: ZIP File Icon mlaskal.zip (version 566 2018-10-18 new: LF-only delimiters)

The framework contains the following folders:

Assignments 1 and 2

For the first two assignments, you will submit the following files:

In the private-src folder, you will find compilable and runnable skeletons of these files. You shall edit them in place - the makefile and the projects refer to this location.

For Assignment 1, you will probably leave the C++ files empty. For Assignment 2, you may use them for C++ routines called from du12l.lex - in this case, add #include "du12sem.hpp" to du12l.lex.

Assignments 3 to 6

For the last four assignments, you will submit the following files:

When moving from Assignment 2 to 3, replace the skeletons du3456l.lex, du3456sem.hpp, and du3456sem.cpp by copies of your du12l.lex, du12sem.hpp, and du12sem.cpp files. After copying, change the following:

The makefile

For assignment 1, run make du1 to build and make test1 to run the respective tests. Use similarly for assignment 2 to 6.

Note that the du1 and du2 targets produce the same executable file tmp/du12, similarly, the du5 and du6 targets produce tmp/du56. The test targets are however different.

The Visual Studio projects

Start the Visual Studio with the mlaskal.sln file.

Edit only the files in folders marked as "STUDENTS". These folders are only in the projects du12, du12grm, du3456, and du3456grm. Don't touch the files in "READONLY" or "GENERATED" folders.

For Assignments 1 and 2, set the du12 project as "Start up". The du12 project depends on du12grm (which runs flex) and on mlaskallib.

For Assignment 3 use the du3 project as "Start up". It depends on du3456, du3456grm (runs flex and bison) and on mlaskallib.

For Assignment 4 use the du4 project as "Start up". Dependences are the same as for du3.

For Assignments 5 and 6 use the du56 project as "Start up". Dependences are the same.

Test files

The mlaskal source files have the suffix .mls, the corresponding expected outputs are in the .out files.

The test files whose names end with "r" contain records - ignore them this year.

The test files whose names end with "a" contain arrays, the files ending with "c" contain constant definitions - together they form the bonus 50 points (in Assignments 4 to 6).

The test files whose names end with "n" contain intentional semantic errors and are expected to fail during compilation.

Submitting and evaluation

You shall submit your solution into Recodex.

Note: For registering in Recodex, you need a working email address registered in CAS.

Recodex will compile it using GCC (probably 7.3). Your solution should not use any 3rd party libraries or code. Only standard C++17 libraries should be used.

Recodex will compile your solution with the same library and will apply the same tests as in the framework. (However, Recodex will use a different build system.)

Compilation and tests in Recodex takes about 4 minutes - please be patient (and don't ask why).

Recodex will insist on strict equivalence of your solution outputs to the gold files. However, we will manually inspect results of all tests which can compile and run but fail in the Recodex judge phase. Therefore, you may be assigned some points even if Recodex assigns zero.

Each home assignment has a maximal achievable points (100 points for assignments 1-3, 150 points for assignments 4-6). Any difference (including your debug messages) between your compiler output and a provided correct output is penalized. The penalization depends on the severity of the problem.

(1) Lexical analysis I

  • Look at the diagrams and determine, what is a lexical element and what is not.
  • Extend the du1l.lex file, so that the resulting scanner is capable of recognizing all correct lexical elements and returns the objects which represent them at the interface between the lexer and the parser.
  • In this task, you don't need to handle strings and comments and you don't have to handle values of identifiers and numeric literals.
  • Names of all tokens are listed in dummyg.y.
  • All objects representing tokens are created using a call to a function like yy::mlaskal_parser::make_BEGIN(...). These functions are generated from dummyg.y by bison. All these functions have a parameter - a line number - pass ctx->curline here.
  • Some tokens (like >,<,<>,<=.>= or +,-) are grouped together. The object representing the token is shared across each group (it will become the same terminal in the grammar); individual tokens are distinguished using an additional parameter of the object (an enumeration type associated to the group, like mlc::DUTOKGE_OPER_REL - see dutable.hpp).
  • Identifier and literals have an additional parameter of the object - the text or value - pass the default value of the type here, like mlc::ls_id_index()

Example input (*.mls) and outputs (*.out) can be found in the tests directory.

(2) Lexical analysis II

  • Recognize strings as lexical elements. Strings may contain two quotation marks ('') which should be returned as one quote character ' in the result. Strings can not contain end of the line (\n). You will probably need 'start conditions' feature of FLEX to complete this task.
  • Handling of comments. Comments are enclosed in { and }. Comments can contain any sequence of all characters. Comments can be nested (unlike traditional Pascal) - parentheses have to be paired correctly. You will certainly need 'start conditions' to handle this.
  • Maintain line numbers. You must not use %option yylineno. Use ctx->curline as the line counter, just ensure that it is incremented on every end-of-line and pass the value to every make_ function. The first line is number 1 (ctx->curline is already initialized to 1).
  • Store integral and real constants, strings and identifiers into the tables of the compiler. Functions like ctx->tab->ls_int().add() are used to add the values into the corresponding tables; there are four such tables for integers (ls_int), reals (ls_real), strings (ls_str), and identifiers (ls_id). The functions return a magic "index" that should be passed to the corresponding make_ functions. Identifiers must be converted to uppercase before add().
  • Detection and recovery of lexical errors, like end of line in the middle of a string, end of file in comment, incorrectly paired { }, too large integers (store only the lowest 31 bits of the number - note that behaviour of atoi function is undefined for integers greater than MAX_INT),...
  • Detection of malformed lexical elements, like 12xy, which is not UINT and IDENTIFIER, but incorrectly written UINT (value 12), because there is no space between 12 and xy. You should read the whole malformed lexical element, but the returned value is a convertible value for the given lexical element, in our case it is 12. The same situation goes for REAL. If an overflow of the convertible value occurs, report the malformed input error first, then overflow error.

(3) Parsing

  • Rewrite syntactic diagrams to Bison rules in du3g.y. Ignore the parts in blue rectangles (record type declaration and dot operator).
  • The declarative part of the du3g.y file is already prepared and you should not change it.
  • Remove all shift/reduce and reduce/reduce conflicts.

(4) Semantic analysis - declarations

  • Add C++ code fragments into du4g.y to decode declarations and store the extracted information to system tables.
  • 100 points: Declaration of procedures, functions, their parameters, variables (only simple types) and labels.
  • 150 points: All declarations, i.e., in addition, constant declarations, type declarations ( including range types and arrays ), and variables of range and array types .

(5) Semantic analysis - expressions and functions

    Implement semantic analysis for simple statements (see below) and generate stack oriented intermediate code from them.

  • 100 points:
    • Assignment statement (to a scalar variable or to the identifier of the current function)
    • Procedure call (with parameters passed by value)
    • Sequence of such statements
    These statements contain expressions built from these elements:
    • Name of a scalar variable (or function argument)
    • Boolean, number, or string constant
    • Function call (with parameters passed by value)
    • Unary + and - for int (instruction MINUSI) and real (instruction MINUSR)
    • Binary + for int-int (instruction ADDI), real-real (instruction ADDR), real-int, int-real (instructions CVRTIR and ADDR), and str-str (string concatenation - instruction ADDS)
    • Binary - and * for int-int, real-real, real-int, int-real (instructions SUBI, SUBR, MULI, MULR, possibly CVRTIR)
    • Binary / for int-int, real-real, real-int, int-real with real result (instruction DIVR, possibly CVRTIR)
    • Binary mod, div for int-int with int result (instructions MODI, DIVI)
  • 150 points: In addition, named constants in expressions and the operator [] for accessing array elements . Assignment statements, argument passing, and return values operate on scalars only (within assignment 5).

The compiler now produces intermediate code and runs it.

File suffix conventions:

  • .out - Compiler diagnostic output plus the output of the compiled program.

(6) Semantic analysis - control-flow statements, Boolean expressions, pass by reference

Complete the semantic analysis and code generation so that the language is completely covered.

  • 100 points:
    • Passing arguments by reference (using GREF, LREF, LLDP, XLD* and XST* instructions)
    • if-then[-else], while-do, repeat-until, and for-[to|downto]-do statements (using JF, JT, JMP instructions)
    • goto and labeled statements (using JMP)
    • Unary Boolean not (using NOT)
    • Binary Boolean and/or (using AND/OR - no short-circuit evaluation)
    • Comparison operators =, <>, <, >, <=, >= over two Boolean, int, real, and string arguments or one int and one real arguments (using EQ*, NE*, LF*, GT*, LE*, GE* and CVRTIR)
  • 150 points: In addition, passing arrays (including arrays which are parts of multi-dimensional arrays) in the following contexts:
    • Subprogram arguments passed by value
    • Subprogram arguments passed by reference
    • Assignment statement
    When copying an array, the compiler knows the size; therefore, it can emit the necessary number of instructions to copy the array (instead of generating a loop). (The instructions SLD* or SST* may help in some cases.)