NPRG054 Assignments - General information

Asignments - General information

Testing environment

Use the parlab machine only for file manipulation (git, vim, ...) and scripting. Never use the parlab machine directly for compiling or running - use the parlab workers instead.

The traffic at parlab workers is controlled by SLURM - you need to understand at least the basics described here: parlab & SLURM - crash course. You may ignore the parts specific for GPUs and gpulab.

All machines at parlab are equipped with cmake 3.27 and g++ 13.2, capable of C++23.

Parlab partitions relevant for this course

SLURM partitions	CPU	vector instruction sets
mpi-homo-long, mpi-homo-short	Intel Xeon Gold 6130	SSE4.2, AVX2, AVX512(CD,DQ,BW,VL)
phi-long, phi-short	Intel Xeon Phi 7230	SSE4.2, AVX2, AVX512(CD,PF,ER)

Only users registered for this course will have access to the -long and -short partitions. Unregistered users can use the -ffa partitions.

Frameworks and skeletons

All homework assignments share the same git project. This allows to easily share some files across the assignments.

Usage:

At the machine you want to use for development:

Ensure that you have a working git client and that you understand git basics
If you want to use git over ssh (preferred), you have to register your public key at gitlab.mff.cuni.cz
Logon to gitlab.mff.cuni.cz (with your SIS credentials), locate the teaching/nprg054/2023/<your-login> project, and clone it into a local folder.
If you do not see your project in gitlab, contact your teacher.

Important: Updating your repository from teaching/nprg054/asgn:

Your project at gitlab was already filled by a copy of the framework+skeleton. However, your copy may become obsolete, therefore:
In your local clone folder, execute (once):

                git remote add upstream git@gitlab.mff.cuni.cz:teaching/nprg054/asgn.git

Whenever you want to update the common framework from teaching/nprg054/asgn, execute:

                git pull upstream

Note: If you develop in a branch, you have to use git pull upstream +master:<branch-name>
The pull upstream command will update the framework and try to merge changes in the skeleton into your sources. Sometimes, you will have to resolve conflicts which occured in the pull command by editing them individually.
After pulling, recompile and test your program, then push to your gitlab repository (aka origin). You may then pull the update from your gitlab repository into your other local folders (e.g. at parlab).

At parlab:

If you want to use git over ssh (preferred), you have to generate a private key here (ssh-keygen) and register the corresponding public key (~/.ssh/id_rsa.pub) at gitlab.mff.cuni.cz
Clone your project into a local folder
All machines in parlab share the same NFS tree with your home folder. Therefore, the git actions may be done on parlab itself, once for all.
For compiling, you need to create a special build folder for every parlab environment (mpi-homo|phi), you may also want to distinguish Debug and Release builds.

Compiling and running with command-line:

Make yourself familiar with the purpose of cmake.
In a (initially empty) build folder, run this:

                cmake <project-root-folder>
                make <target-name>
                ./<target-name>

Target-name is defined in CMakeLists.txt ("macro" for the first assignment).
Important: if you want to test performance, you need to configure cmake to Release mode as follows:

                cmake -D CMAKE_BUILD_TYPE=Release <project-root-folder>

It may be a good idea to separate Debug and Release build folders.

Compiling and running using Microsoft Visual Studio 2022 or Visual Studio Code:

Install cmake support in Visual Studio
Open your local clone folder with the Visual Studio (you can also create the clone directly with Visual Studio)
Wait until the "cache" is generated (i.e. cmake is run), usually automatically
Select the target, build the project and run
Visual studio also supports compiling/running/debugging with clang on Windows, g++/gdb in WSL, and clang or g++/gdb in Linux via ssh. Unfortunately, the ssh approach does not work well with SLURM.

When done:

Push everything into your gitlab repository (aka origin), branch "master"
If you finished the work before the deadline, no further action is required
After the deadline, you have to send me an email indicating that you want your submision evaluated (with a penalty for late submission)

Please be patient, I will usually check the new solutions in batches, approx. every two weeks

If you continue development after the deadline and after commiting some working solution, use a branch other than master for incomplete versions; otherwise, you risk that an incomplete version will be evaluated instead of the working version. (If you share a header file across assignments, remember that updating this file during a second assignment may spoil the commited solution of the first one.)

Handling different vector instruction sets

All your solutions shall be templated classes with the template argument policy. This argument is used to specify the desired vector platform. In the skeleton, you will find the following four policy classes: policy_scalar, policy_sse, policy_avx, and policy_avx512. You may fill the policy classes with whatever contents or you may leave them empty and use explicit specializations of your class.

In addition, the parts of source code which use AVX512 intrinsics must be enclosed inside "#ifdef USE_AVX512" directives to avoid compilation errors on unsupporting platforms, similarly for USE_AVX.

USE_AVX512 means that AVX512CD is also available. On the other hand, AVX512BW instructions may be used only if enclosed inside "#ifdef USE_AVX512BW".

Your solution shall support all the four policy classes; however, you are not required to actually use the respective vector instructions. For instance, the implementation for <policy_avx512> may be identical to the implementation with <policy_sse> or <policy_scalar>. (E.g., for the Macroprocessor, it is unlikely that vectorizing brings any measurable speed-up.)

The USE_AVX, USE_AVX512, and USE_AVX512BW flags are set by the cmake configuration files, consistently with the command-line compiler options that enable the corresponding instructions.

CPU generation	MSVC compiler options	GCC compiler options	Policies tested
SSE 4.2	(none)	-msse4.2	(policy_scalar)	policy_sse
AVX2	/arch:AVX2 -D USE_AVX	-mavx2 -D USE_AVX	(policy_scalar)	policy_sse	policy_avx
AVX512 (Xeon Phi)	not supported	-mavx512cd -D USE_AVX -D USE_AVX512	(policy_scalar)	policy_sse	policy_avx	policy_avx512
AVX512 (Other)	/arch:AVX512 -D USE_AVX -D USE_AVX512 -D USE_AVX512BW	-mavx512cd -mavx512bw -D USE_AVX -D USE_AVX512 -D USE_AVX512BW	(policy_scalar)	policy_sse	policy_avx	policy_avx512
policy_scalar is not tested in assignments which rely on vectorizing (levenstein, matrix)

At Linux machines, the cmake configuration files will automatically detect the CPU capability using /proc/cpuinfo.

At Windows, there is no reliable way of detection. The defaults (USE_AVX=True, USE_AVX512=False, USE_AVX512BW=False) are set in cmake cache; it may be changed in Project/CMake Settings in Visual Studio or using cmake-gui. (It must be changed for very old CPUs which do not support AVX2.)

Output

The testing framework produces standard output in tab-separated format (without any header). The columns are the following:

Machine name
Platform
Thread id (not present in Debug mode)
Asignment-specific parameters, usually affecting the size of particular test. The last of these parameters is usually auto-adjusting, i.e. increased until the test time becomes reliably measurable (over 1 sec; 0.1 sec for Debug mode).
Time in nanoseconds, divided by the complexity of the test (derived from the parameters).
The result of the test, or a checksum of the results, depending on the assignment.

The testing framework produces output in two phases:

Immediately after finishing a particular test, an output line containing the columns descibed above is produced. (This output may be switched off by the --direct-print=no option.)
After finishing all tests, partial and full aggregates are printed. The first set of lines is based on aggregation over the auto-adjusting parameter values, where the last (i.e. the most precisely measured) time is selected while the results are taken from the first run. The remaining parameters, in right-to-left order, produce further line sets, where the description of parameter values is replaced by their ranges over which the aggregation was done; finally the threads and the platforms are aggregated too. For these results, time is aggregated by geometric mean. Besides the columns described above, every aggregated line contains two additional columns:
- A sign OK/MISMATCH indicating the correctness of the result/checksum
- The ratio between the measured and the reference timings (i.e. a number smaller than 1.0 means that the code is faster than the reference). Reference timings are available only at parlab.

Testing procedure

The output shows all the inividual tests performed. The meaning of the parameters is described with each assignment.

The testing is performed in parallel, to simulate the workload on the other CPU cores during a typical parallel computation. Each thread performs the same sequence of tests; however, each thread uses a different seed for the random generation of input data (therefore the different outputs/checksums). The threads run the tests in lockstep, i.e. there is a global rendez-vous after finishing each test. Due to different inputs and other variations, different threads may run for slightly different time.

To mitigate the effect of context-switching after the rendez-vous and different run times, each test is actually run three times (with identical data) in each thread. Only the middle of the tree runs is measured (and printed); the first and the last run are set to last only 20% of the measured run (by manipulating the auto-adjusting parameter). The triple run is implemented together with the loop over the auto-adjust parameter, by a range-based for like this (from macromain.cpp): for (auto&& ctx5 : auto_measurement(ctx4, 1024)) The loop itself is entered for every auto-adjusting group of tests, the number of iterations of the loop is three times the number of auto-adjusting parameter values used. The time measurement (implemented inside the * and ++ operators of the loop range iterators) is thus done on the 2-nd, 5-th, 8-th, etc. iterations (only the last measurement is actually used in the final statistics). Note that this complex behavior is observable during debugging.

Command-line arguments

The testing framework compiled together with your code supports the following command-line arguments:

--platform=(scalar|sse|avx|avx512) - run tests only in the specified platform
--(scalar|sse|avx|avx512)=no - disable specific platform
--threads=<number> - use the specified number of worker threads for testing. Defaults to the number of hardware threads in the CPU or allocated in SLURM. Official timings for most assignments are measured with 8 threads. (Use srun -n 1 -c 8 to reproduce the official settings in parlab.) In Debug mode, tests are always done in the main thread to simplify debugging.
--machine=<name> - override the autodetected machine name printed in the first column
--direct-print=no - disable printing individual results during testing, print only final stats

In addition, each assignment has its specific command-line arguments which influence the testing sequence or enable debuging outputs. See the respective assignment pages.

Advanced options:

--code-check=<filename.cpp> - produce a .cpp file containing checksums from this run (use with --platform to make results unique)
--code-time=<filename.cpp> - produce a .cpp file containing timings from this run

The *gold*.cpp files were obtained using these options. Timing files for other machines may be produced by --code-time, compiled and linked with the project. The framework will then compare the timings against them whenever run on the same machine name.

The program may also be run via the command "make test". In this case, command-line arguemnts may be adding the arguments to the MAKE_TARGET call in "CMakeLists.txt", e.g.:

MAKE_TARGET("MACRO" "macro" "--dump=on")