Mlaskal compiler support classes
Assignment 4

The mlc::symbol_tables class holds all symbol tables. The table structure reflects directly the properties of the Mlaskal language. The table object is accessible in flex/bison C-code fragments via the pointer ctx->tab.

Lexical analysis

There are 4 lexical symbol tables which store the texts or binary values of respective lexical elements. All occurrences of a text/value are matched together, ignoring any scopes. Further values may be added during semantic analysis as results of constant expressions.

All four tables are instances of the mlaskal::lit_storage template. Table entries are inserted/matched using the mlaskal::lit_storage::add function which returns the type mlc::ls_id_index (or similar). This type is comparable (therefore it may serve as key for semantic tables) and the associated text/value may be accessed by the operator *.

Semantic analysis

Semantic tables are structured according to the naming and scoping rules of the Mlaskal language. Therefore, there are two parts: Entries denoted by numbers (i.e. labels) and entries denoted by identifiers (i.e. constants, types, variables, functions, and procedures). Each part is divided into two layers - global and local entries. Insert operations on these tables automatically report errors for identifiers declared twice in the same scope (therefore, each inserting function requires a line number corresponding to the position of the declaration). Search operations do not report errors but return a null pointer for undefined identifiers - corresponding error messages must be reported by the caller.

Scoping rules

When inserting entries into tables, the same functions are used for both global and local entries. Similarly, search functions first try to locate a local entry and continue to global level if local entry is not found. For correct functionality of these mechanisms, the following discipline must be maintained:

The symbol tables automatically deduce the following information:

Table of labels

Label declarations (label 123;) are saved using mlc::symbol_tables::add_label_entry which requires (besides a line number) the label name (i.e. the index of an integer constant stored in ls_int) and an unique code-position placeholder, obtained by calling mlc::symbol_tables::new_label .

Label entries are searched in symbol tables using mlc::symbol_tables::find_label.

Type descriptors and type symbols

All library types and all types constructed in the program being compiled are represented by type descriptors. Type descriptors do not contain the names of the types and if they refer one to another, the reference is a direct link which does not mention any name (although a name is typically used in the source code, e.g. in array[1..9]of T). Thus, the type descriptors denote the structure of the types, not their names. There is, however, no mechanism to match types by their structure - each time a function to create a type descriptor (e.g. mlc::symbol_tables::create_array_type) is called, a new, independent, type descriptor is created. This behavior matches the definition of the Mlaskal language: If a type construction (like array or record) is repeated twice in the source code, it generates two incompatible types even if they are identical by construction. Type descriptors for library types (boolean, integer, real, string) are pre-created during initialization of the symbol tables and no other descriptors for these built-in types shall be created.

Type symbols represent association of type names to type descriptors. Type symbols for the library types are pre-created and associated with their pre-created type descriptors. Additional type symbols are created (using mlc::symbol_tables::add_type) for each named type declaration in the source code. If such type declaration contains a type construction (like array or record), a type desciptor (or more than one) shall be also created and the type symbol shall refer to it. Otherwise (type X = Y;), the new type symbol will refer to a previously existing type descriptor, i.e. the type symbols for X and Y will refer to the same type descriptor.

Technically, type descriptors are objects of a class derived from mlc::abstract_type. The objects are owned by the symbol tables (which will arrange their destruction on exit) and are immutable. Other symbol table objects (e.g. type symbols or other type descriptors) link to the type descriptor objects using the type mlc::type_pointer.

A type symbol is logically a pair of an identifier (ls_id_index) and a link (type_pointer) to a type descriptor. However, the physical structure is more complex because an identifier may also denote a different language element (e.g. a variable). Therefore, when searching for the meaning of an identifier, symbol tables (mlc::symbol_tables::find_symbol) return a pointer (mlc::symbol_pointer) to the abstract class mlc::abstract_symbol. In the case of type identifier, the returned type will point to an object of the mlc::type_symbol class which contains the link (type_pointer) to the associated type descriptor.

The conversion of a symbol_pointer to a specific pointer to type_symbol is performed by a call to mlc::abstract_symbol::access_type. If the symbol_pointer points to a different object than a type_symbol, the access_type function will return null pointer.

As a convenience, all pointers to symbols, as well as all pointers to type descriptors, support dereferencing null pointers: Calling a method via a null pointer will invoke a dummy implementation of that method which will typically return another null pointer. Thus, the expression p->access_type()->type() correctly returns a null type_pointer if p is null or p->access_type() returns null.

For example, the bison rule

type_declaration: IDENT ASSIGN IDENT;

may be accompanied with the following semantic fragment

if (!sp || (sp->kind() != SKIND_TYPE)) { message(DUERR_NOTTYPE, @3, * $3); }
else {
mlc::type_symbol_reference tr = sp->access_type();
mlc::type_pointer tp = tr->type();
ctx->tab->add_type(@1, $1, tp);
}

Here, the mlc::abstract_symbol::kind function returns an enumeration value corresponding to the concrete class of the symbol object; thus, the code reports an error if either the identifier $3 was not found or it is not a type identifier.

The null behavior of symbol pointers allows a shorter implementation

auto ts = ctx->tab->find_symbol($3)->access_type();
if (! ts) { message(DUERR_NOTTYPE, @3, * $3); }
ctx->tab->add_type(@1, $1, ts->type());

In this case, the type symbol is inserted even in the case of an error - this is preferred behavior, because it will prevent error messages being triggered by subsequent references to the type declared here.

Note, however, that there is no rule for IDENT ASSIGN IDENT in the Mlaskal grammar. A type construction must be allowed on the right-hand side, like:

type_declaration: IDENT ASSIGN type;
type: IDENT
| RECORD elements END;

Because the non-terminal "type" may contain both an identifier of a type and a construction of a new type, the corresponding attribute passed along the non-terminal must be a type_pointer. Thus, the calls to find_symbol, access_type, type and error messaging shall be associated with the type:IDENT rule, while the add_type call will be the only action associated with the type_declaration rule (note that duplicity checks are done by the add_type function). Finally, the RECORD ... END rule will contain a call to create_record_type.

When creating record types, the list of record fields is created beforehand as type mlc::field_list_ptr and later passed to the create_record_type function. An empty list of fields is created by mlc::create_field_list and individual fields are appended by calls to mlc::field_list_body::append_field. Two field lists may be concatenated using mlc::field_list_body::append_and_kill - note that the field entries from the second operand are moved to the first operand, so the second operand becomes an empty field list. The field_list_ptr has shared-ownership semantics and it will properly destroy the field_list if the ownership is not passed to the symbol tables by a call to create_record_type.

For arrays, the type descriptors typically contain two layers, created by a call to mlc::symbol_tables::create_range_type followed by a call to mlc::symbol_tables::create_array_type. Note that the grammar also allows naming of the range type and using its name in the construction of the array, which will separate the two create_ calls into two different declarations.

Multi-dimensional arrays are represented by several array type descriptors linked together, array [T1, T2] of T3 is by definition equivalent to array [T1] of array [T2] of T3.

Named constants

Named constants are entered into symbol tables using the following functions

In the case of mlc::symbol_tables::add_const_bool the value is passed directly as a bool; otherwise, indexes to corresponding lexical tables are used.

To determine the value of a constant, its type must be determined first. After a call to mlc::symbol_tables::find symbol, all four kinds of constant symbols will identify themselves as mlc::SKIND_CONST. Then, a call to access_const()->type() will return a type_pointer which must be classified using mlc::abstract_type::cat. If it returns, e.g., mlc::TCAT_INT, subsequent call to access_const()->access_int_const()->int_value() on the symbol pointer will return the value of the integer constant (as a ls_int_index).

The following example illustrates the application of unary minus to an integer constant.

constant_declaration: IDENT = MINUS IDENT;
if ( sp->kind() != SKIND_CONST ) { message( DUERR_NOTCONST, @4, * $4); }
if ( sp->access_const()->type()->cat() == TCAT_INT )
{
mlc::ls_int_index val = sp->access_const()->access_int_const()->int_value();
mlc::ls_int_index nval = ctx->tab->ls_int().add( - *val);
ctx->tab->add_const_int( @1, $1, nval);
}

The real grammar is, however, more complex, and the code should also handle TCAT_REAL and signalize errors for TCAT_BOOL and TCAT_STR.

Variable declarations

Variables are declared using mlc::symbol_tables::add_var whose arguments include the type_pointer which must be distilled from the right-hand-side of the declaration similarly to the case of type declarations. Note that function/procedure formal arguments are not entered via add_var; instead, the symbol tables will automatically create their locally visible declarations during a call to enter, using the information from the function declaration.

Functions and procedures

Before the call to mlc::symbol_tables::enter, the declaration of the function/procedure shall be entered using

Functions are allowed to return only scalar types, i.e. the four built-in types and user-defined range types (it shall be tested using mlc::abstract_type::cat).

Argument lists are created before the call to add_fnc/add_proc, similarly to field lists for create_record_type. The type mlc::parameter_list_ptr is created by mlc::create_parameter_list and manipulated via

The order of arguments in the list must reflect the order in the source code. The call to add_fnc/add_proc passes the ownership of the list to the symbol tables; lists which were not consumed by add_fnc/add_proc will be automatically deallocated using shared_ptr semantics. (Note that unique_ptr semantics would be sufficient; unfortunately, bison cannot handle move-only types.)