Asignments - General information

Testing environment

Login via ssh to at parlab.ms.mff.cuni.cz:42222 (fingerprints) using your CAS (SIS) credentials.

Use the parlab machine only for file manipulation (git, vim, ...) and scripting. Never use the parlab machine directly for compiling or running - use the parlab workers instead.

The traffic at parlab workers is controlled by SLURM - you need to understand at least the basics described here: parlab & SLURM - crash course. You may ignore the parts specific for GPUs and gpulab.

All machines at parlab are equipped with cmake 3.27 and g++ 13.2, capable of C++23.

Parlab partitions relevant for this course

SLURM partitionsCPUvector instruction sets
mpi-homo-long, mpi-homo-shortIntel Xeon Gold 6130SSE4.2, AVX2, AVX512(CD,DQ,BW,VL)
phi-long, phi-shortIntel Xeon Phi 7230SSE4.2, AVX2, AVX512(CD,PF,ER)

Only users registered for this course will have access to the -long and -short partitions. Unregistered users can use the -ffa partitions.

Frameworks and skeletons

All homework assignments share the same git project. This allows to easily share some files across the assignments.

Usage:

Handling different vector instruction sets

All your solutions shall be templated classes with the template argument policy. This argument is used to specify the desired vector platform. In the skeleton, you will find the following four policy classes: policy_scalar, policy_sse, policy_avx, and policy_avx512. You may fill the policy classes with whatever contents or you may leave them empty and use explicit specializations of your class.

In addition, the parts of source code which use AVX512 intrinsics must be enclosed inside "#ifdef USE_AVX512" directives to avoid compilation errors on unsupporting platforms, similarly for USE_AVX.

USE_AVX512 means that AVX512CD is also available. On the other hand, AVX512BW instructions may be used only if enclosed inside "#ifdef USE_AVX512BW".

Your solution shall support all the four policy classes; however, you are not required to actually use the respective vector instructions. For instance, the implementation for <policy_avx512> may be identical to the implementation with <policy_sse> or <policy_scalar>. (E.g., for the Macroprocessor, it is unlikely that vectorizing brings any measurable speed-up.)

The USE_AVX, USE_AVX512, and USE_AVX512BW flags are set by the cmake configuration files, consistently with the command-line compiler options that enable the corresponding instructions.

CPU generationMSVC compiler optionsGCC compiler optionsPolicies tested
SSE 4.2(none)-msse4.2(policy_scalar)policy_sse
AVX2/arch:AVX2 -D USE_AVX-mavx2 -D USE_AVX(policy_scalar)policy_ssepolicy_avx
AVX512 (Xeon Phi)not supported-mavx512cd -D USE_AVX -D USE_AVX512(policy_scalar)policy_ssepolicy_avxpolicy_avx512
AVX512 (Other)/arch:AVX512 -D USE_AVX -D USE_AVX512 -D USE_AVX512BW-mavx512cd -mavx512bw -D USE_AVX -D USE_AVX512 -D USE_AVX512BW(policy_scalar)policy_ssepolicy_avxpolicy_avx512
policy_scalar is not tested in assignments which rely on vectorizing (levenstein, matrix)

At Linux machines, the cmake configuration files will automatically detect the CPU capability using /proc/cpuinfo.

At Windows, there is no reliable way of detection. The defaults (USE_AVX=True, USE_AVX512=False, USE_AVX512BW=False) are set in cmake cache; it may be changed in Project/CMake Settings in Visual Studio or using cmake-gui. (It must be changed for very old CPUs which do not support AVX2.)

Output

The testing framework produces standard output in tab-separated format (without any header). The columns are the following:

The testing framework produces output in two phases:

Testing procedure

The output shows all the inividual tests performed. The meaning of the parameters is described with each assignment.

The testing is performed in parallel, to simulate the workload on the other CPU cores during a typical parallel computation. Each thread performs the same sequence of tests; however, each thread uses a different seed for the random generation of input data (therefore the different outputs/checksums). The threads run the tests in lockstep, i.e. there is a global rendez-vous after finishing each test. Due to different inputs and other variations, different threads may run for slightly different time.

To mitigate the effect of context-switching after the rendez-vous and different run times, each test is actually run three times (with identical data) in each thread. Only the middle of the tree runs is measured (and printed); the first and the last run are set to last only 20% of the measured run (by manipulating the auto-adjusting parameter). The triple run is implemented together with the loop over the auto-adjust parameter, by a range-based for like this (from macromain.cpp): for (auto&& ctx5 : auto_measurement(ctx4, 1024)) The loop itself is entered for every auto-adjusting group of tests, the number of iterations of the loop is three times the number of auto-adjusting parameter values used. The time measurement (implemented inside the * and ++ operators of the loop range iterators) is thus done on the 2-nd, 5-th, 8-th, etc. iterations (only the last measurement is actually used in the final statistics). Note that this complex behavior is observable during debugging.

Command-line arguments

The testing framework compiled together with your code supports the following command-line arguments:

In addition, each assignment has its specific command-line arguments which influence the testing sequence or enable debuging outputs. See the respective assignment pages.

Advanced options:

The *gold*.cpp files were obtained using these options. Timing files for other machines may be produced by --code-time, compiled and linked with the project. The framework will then compare the timings against them whenever run on the same machine name.

The program may also be run via the command "make test". In this case, command-line arguemnts may be adding the arguments to the MAKE_TARGET call in "CMakeLists.txt", e.g.:

MAKE_TARGET("MACRO" "macro" "--dump=on")