Login via ssh to at parlab.ms.mff.cuni.cz:42222 (fingerprints) using your CAS (SIS) credentials.
Use the parlab machine only for file manipulation (git, vim, ...) and scripting. Never use the parlab machine directly for compiling or running - use the parlab workers instead.
The traffic at parlab workers is controlled by SLURM - you need to understand at least the basics described here: parlab & SLURM - crash course. You may ignore the parts specific for GPUs and gpulab.
All machines at parlab are equipped with cmake 3.27 and g++ 13.2, capable of C++23.
Parlab partitions relevant for this course
SLURM partitions | CPU | vector instruction sets |
---|---|---|
mpi-homo-long, mpi-homo-short | Intel Xeon Gold 6130 | SSE4.2, AVX2, AVX512(CD,DQ,BW,VL) |
phi-long, phi-short | Intel Xeon Phi 7230 | SSE4.2, AVX2, AVX512(CD,PF,ER) |
Only users registered for this course will have access to the -long and -short partitions. Unregistered users can use the -ffa partitions.
All homework assignments share the same git project. This allows to easily share some files across the assignments.
Usage:
git remote add upstream git@gitlab.mff.cuni.cz:teaching/nprg054/asgn.git
git pull upstream
git pull upstream +master:<branch-name>
cmake <project-root-folder> make <target-name> ./<target-name>
cmake -D CMAKE_BUILD_TYPE=Release <project-root-folder>
All your solutions shall be templated classes with the template argument policy. This argument is used to specify the desired vector platform. In the skeleton, you will find the following four policy classes: policy_scalar, policy_sse, policy_avx, and policy_avx512. You may fill the policy classes with whatever contents or you may leave them empty and use explicit specializations of your class.
In addition, the parts of source code which use AVX512 intrinsics must be enclosed inside "#ifdef USE_AVX512" directives to avoid compilation errors on unsupporting platforms, similarly for USE_AVX.
USE_AVX512 means that AVX512CD is also available. On the other hand, AVX512BW instructions may be used only if enclosed inside "#ifdef USE_AVX512BW".
Your solution shall support all the four policy classes; however, you are not required to actually use the respective vector instructions. For instance, the implementation for <policy_avx512> may be identical to the implementation with <policy_sse> or <policy_scalar>. (E.g., for the Macroprocessor, it is unlikely that vectorizing brings any measurable speed-up.)
The USE_AVX, USE_AVX512, and USE_AVX512BW flags are set by the cmake configuration files, consistently with the command-line compiler options that enable the corresponding instructions.
CPU generation | MSVC compiler options | GCC compiler options | Policies tested | |||||
---|---|---|---|---|---|---|---|---|
SSE 4.2 | (none) | -msse4.2 | (policy_scalar) | policy_sse | ||||
AVX2 | /arch:AVX2 -D USE_AVX | -mavx2 -D USE_AVX | (policy_scalar) | policy_sse | policy_avx | |||
AVX512 (Xeon Phi) | not supported | -mavx512cd -D USE_AVX -D USE_AVX512 | (policy_scalar) | policy_sse | policy_avx | policy_avx512 | ||
AVX512 (Other) | /arch:AVX512 -D USE_AVX -D USE_AVX512 -D USE_AVX512BW | -mavx512cd -mavx512bw -D USE_AVX -D USE_AVX512 -D USE_AVX512BW | (policy_scalar) | policy_sse | policy_avx | policy_avx512 | ||
policy_scalar is not tested in assignments which rely on vectorizing (levenstein, matrix) |
At Linux machines, the cmake configuration files will automatically detect the CPU capability using /proc/cpuinfo.
At Windows, there is no reliable way of detection. The defaults (USE_AVX=True, USE_AVX512=False, USE_AVX512BW=False) are set in cmake cache; it may be changed in Project/CMake Settings in Visual Studio or using cmake-gui. (It must be changed for very old CPUs which do not support AVX2.)
The testing framework produces standard output in tab-separated format (without any header). The columns are the following:
--direct-print=no
option.)
OK/MISMATCH
indicating the correctness of the result/checksumThe output shows all the inividual tests performed. The meaning of the parameters is described with each assignment.
The testing is performed in parallel, to simulate the workload on the other CPU cores during a typical parallel computation. Each thread performs the same sequence of tests; however, each thread uses a different seed for the random generation of input data (therefore the different outputs/checksums). The threads run the tests in lockstep, i.e. there is a global rendez-vous after finishing each test. Due to different inputs and other variations, different threads may run for slightly different time.
To mitigate the effect of context-switching after the rendez-vous and different run times,
each test is actually run three times (with identical data) in each thread.
Only the middle of the tree runs is measured (and printed);
the first and the last run are set to last only 20% of the measured run
(by manipulating the auto-adjusting parameter).
The triple run is implemented together with the loop over the auto-adjust parameter,
by a range-based for like this (from macromain.cpp):
for (auto&& ctx5 : auto_measurement
The loop itself is entered for every auto-adjusting group of tests,
the number of iterations of the loop is three times the number of auto-adjusting parameter values used.
The time measurement (implemented inside the * and ++ operators of the loop range iterators)
is thus done on the 2-nd, 5-th, 8-th, etc. iterations
(only the last measurement is actually used in the final statistics).
Note that this complex behavior is observable during debugging.
The testing framework compiled together with your code supports the following command-line arguments:
srun -n 1 -c 8
to reproduce the official settings in parlab.)
In Debug mode, tests are always done in the main thread to simplify debugging.In addition, each assignment has its specific command-line arguments which influence the testing sequence or enable debuging outputs. See the respective assignment pages.
Advanced options:
The *gold*.cpp files were obtained using these options. Timing files for other machines may be produced by --code-time, compiled and linked with the project. The framework will then compare the timings against them whenever run on the same machine name.
The program may also be run via the command "make test". In this case, command-line arguemnts may be adding the arguments to the MAKE_TARGET call in "CMakeLists.txt", e.g.:
MAKE_TARGET("MACRO" "macro" "--dump=on")