This is an informational web page for the Programming in Parallel Environment course ( NPRG042), which is being taught at the Faculty of Mathematics and Physics, Charles University in Prague. The lectures are being held in S5, from 12:20 to 13:50 every Monday. The seminars are in S1 from 12:20 to 13:50 once a fortnight on Wednesday starting the first week. The labs are every other week either at the same time in SU2 or in SU1 from 15:40 to 17:10.

You can find here:


The course is currently led by Jakub Yaghob and Martin Kruliš. Please direct inqueries of a specific topic (or home assignment) to the teacher who presented the topic.

Latest Updates

13.2.2019 The web page was updated for the new scholar year.
14.2.2018 The web page was updated for the new scholar year.


The grading criteria for this course are summarized in the following. Please, read them carefully. In case any rules are not clear, contact Dr. Jakub Yaghob.

Home Assignments

There will be 4 home assignments which are designed to explore several technologies and parallel platforms. The main criterium for grading is the speedup of your solution computed by measuring wall-time of your solution and reference serial solution.

You may recieve up to 4 points for each performance assigment. The sum of your points will determine your final mark as follows:

points mark
11 or more 1 (excellent)
8 to 10 2 (well done)
5 to 7 3 (OK)
4 or less failed

Furthermore, 4 points are required to recieve a credit for the seminar (which is also required) and at least 1 point is required from each assignment. In other words, you need to solve all assignments at least in a minimal way.

If you are not satisfied with your mark, but you have at least 4 points - 1 point from each assignment (an thus the credit), you may request an oral examination where your mark will be determined. Once subscribing for the exam, your mark recieved from the assignments is no longer valid — i.e., you may fail the exam even if you have enough points from the assignments.

Each assignment has a strict deadline. Once the deadline is reached, all assignments are collected and graded. You may submit your solution after the deadline and ask for (re)evaluation, but you will recieve reduced points. Submissions that would get 2 or more points normally will get only 1 point if submitted late. Late submissions which would not worth 2 points normally will receive no points when delivered late.


The domain of parallel architectures and parallel programming is perhaps the most intensively developing field in the computer science. Hence, the slides are being updated continuously. Furthermore, bare in mind that some information on the slides may become outdated despite our best efforts. If you have any questions or suggestions related to the lectures, please contact the teachers.

18.2.2019 Parallel programming theory PPTX File Icon 01-theory.ppt
25.2.2019 Intel Threading Building Blocks PPTX File Icon 12-tbb.pptx
4.3.2019 Finding parallelism PPTX File Icon 02-finding.ppt
11.3.2019 OpenMP PPTX File Icon 13-openmp.pptx
18.3.2019 Basic threading in C++ PPTX File Icon 11-cppthread.pptx
25.3.2019 MPI - basics PPTX File Icon 15-mpi.ppt
1.4.2019 MPI - advanced PPTX File Icon 15-mpi-adv.ppt
8.4.2019 GPGPU, CUDA PPTX File Icon 16-cuda.pptx
15.4.2019 GPGPU, CUDA (continuation) PPTX File Icon 16-cuda-adv.pptx
29.4.2019 Algorithm structures PPTX File Icon 03-algorithm.ppt
6.5.2019 Algorithm structures (continuation)
13.5.2019 Support structures PPTX File Icon 04-support.ppt
20.5.2019 Support structures (continuation)


The assignments will be tested on our parlab and gpulab clusters, which are also at your disposal for development, debugging, and performance testing. The clusters are managed by the SLURM resource manager. Additionally, the Charliecloud is available on all worker nodes.

Do not execute any of your code on the front-end nodes directly, use worker nodes instead! All unknown terms (front-end node, worker node) will be explained later. Front-end nodes do not have installed any development software.

SLURM Crash Course

All informations about SLURM can be found on its SLURM documentation page or on SLURM tutorials page. Anyway, we have provided a short description of SLURM and of our clusters.


A cluster is a bunch of nodes. Nodes are grouped together to partitions. Partitions may overlap, ie. one node can be in more partitions. Feature is a string describing a feature of a node, e.g. avx2 for a node capable of executing AVX2 instructions. Each node has two sets of features: current features and available features. Usually, they are same. But in some cases, the node is capable of changing current features set on demand.

SLURM manages resources. The most important resources are CPUs and memory, but it is able to manage other generic resources (GRES), like GPUs, etc. Moreover, the time is resource as well.

A user is identified by his/her login. Account is a billing entity (well, we won't charge you for using our clusters). Each user must have assigned an account. Moreover, user can be assigned to more accounts and use them depending on what he/she is doing. Accounts can only be allowed access to some partitions.

A user can launch a job. Job has a state, has some reserved and assigned resources, and returns an error code after completition. Job consistes from steps (usually 1 step). Each step is executed by a bunch of tasks (usually 1 task), where resources are equally assigned to the tasks of a step. Jobs are inserted to a scheduling queue, where you can find them.

Partition has a priority. A job submitted to a partition with higher priority can suspend an another job submitted to a partition with lower priority.

Important commands

There are many commands (see SLURM man pages or SLURM command summary). The most important commands are:

Job submission commands (srun, sbatch, salloc) have a common set of important options:

-A--acount= Charge resources used by the job to the specified account. If not sepcified, user's default account is charged
-B--extra-node-info= Select only nodes with at least specified number of sockets, cores per socket, and threads per core. This option does not specify the resource allocation, its just constraint.
-C--constraint=Select only nodes with matching features
-c--cpus-per-task=Number of CPUs per 1 task
-e--error=Standard error stream is redirected to the specified file
--gres= Specifies comma delimited list of GRES. Each entry on the list is in form "name[[:type]:count]"
-i--input=Standard input stream is redirected from the specified file
-J--job-name=Job name
-L--licenses= Specifies comma delimited list of licenses allocated to the job. Each entry on the list is in form "name[:count]"
-m--distribution= Select distribution method for tasks and resources. For more info see documentation
--mem=Specify the real memory required per node
--mem-per-cpu=Specify the memory required per allocated CPU
--mem-bind=Specify the memory binding. For more info see documentation
-N--nodes=A minimum of allocated nodes. Default is 1
-n--ntasks=Number of tasks. Default is 1
-o--output=Standard output stream is redirected to the specified file
-p--partition= Request a specific partition for the resource allocation. If not specified, default partition os chosen
-t--time= Set a limit on the total run time of a job. If not specified, default time for a selected partition is used. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours".

Description of available clusters

As mentioned above, we have two clusters parlab and gpulab. Access to the cluster is always through the front-end server using SSH on port 42222. Front-end servers have the same name as the cluster, i.e. parlab.ms.mff.cuni.cz and gpulab.ms.mff.cuni.cz.

Your student logins will be created before the first assignment. Your login is s_sislogin where sislogin is your login to our student infomation system (SIS). The generated password will be sent to your official e-mail. Both clusters use one common LDAP, so you will need to change the password only once. Each course tought on our clusters has its account. Any research group or project have their account as well. Student logins are assigned to the corresponding account depending on visiting relevant courses or working in a research group.

Both clusters have access to the same disk array using NFS. You may find your home mounted on /mnt/home. Moreover, research projects can have an additional space mounted on /mnt/research.

You may use an environment variable TMPDIR set to a private local temporary directory for a job. It is created on every node allocated to the job before the job starts and it is completely removed after the job finishes.

Parlab cluster specification

All nodes are interconected by InfiniBand FDR (56 Gb/s) for high-performance messaging using MPI. Moreover, they are interconnected by 10 GbE for all other traffic. The front-end server is connected by 10 GbE to the external world. The latest version of OpenMPI is installed on all nodes.

Parlab nodes

Node namesCPUSocketscoresHTRAMGRESAdditional info
w[401-404]Intel Xeon E7-4820482128 GB
w[201-208]Intel Xeon Gold 61302162128 GB
phi[01-02]Intel Xeon Phi 7230164496 GBhbm 16 GBcan change feature set after rebooting

Parlab partitions

NameNodesPriorityTimelimitIntended use
big-lpw[401-404]low1 daydefault, general or MPI debugging, long jobs
big-hpw[401-404]high1 hourexecuting short jobs on 4-socket system, MPI jobs
small-lpw[201-208]low1 daydebugging on newer CPUs, MPI debugging, long jobs
small-hpw[201-208]high30 minsexecutng short jobs on 2-socket system, MPI jobs
phi-lpphi[01-02]low1 dayKNL debugging, long jobs
phi-hpphi[01-02]high30 minsexecuting short jobs on KNL
allallhigh30 minsexecuting short jobs on all nodes, used primarily for testing heterogeneous MPI computing

Gpulab cluster specification

All nodes are interconnected by 10 GbE. The front-end server is connected by 10 GbE to the external world.

Gpulab nodes

Node namesCPUSocketscoresHTRAMGRESAdditional info
dw[01-02]Intel Xeon E545024132 GBDocker installed
dw03Intel Xeon E564024296 GBDocker installed
dw04Intel Xeon E5-2660v22102256 GBDocker installed
varjagIntel Xeon E7-4830482256 GB
volta[01-02]Intel Xeon Silver 4110282128 GBgpu volta [0-1]2x NVIDIA Tesla V100 PCIe 16 GB, latest CUDA
volta03Intel Xeon Silver 4110282192 GBgpu volta [0-1]2x NVIDIA Tesla V100 PCIe 16 GB, latest CUDA

Gpulab partitions

NameNodesPriorityTimelimitIntended use
debug-lpdw[01-04],varjaglow7 daysdefault, general debugging, long jobs, build Docker image
debug-hpdw[01-04],varjaghigh1 hourshort jobs, build Docker image
volta-lpvolta[01-03]low1 daydebugging GPU task, long GPU jobs
volta-hpvolta[01-03]high1 hourexecuting short GPU jobs

Useful examples


Starts a shell in a minimal environment (resources). Useful for making small builds or debugging in a restricted environment.

srun -p volta --gres=gpu:volta:1 mygpucode

Starts a job which will have one NVIDIA V100 card available.

sinfo -o "%P %L %l"

Prints info about default and maximum job time for partitions in a cluster.

sinfo -o "%n %b %f"

Prints info about current and available feature set for nodes in a cluster.

srun -p phi -C flat,snc2 --gres=hbm:8G myphicode

Selects a KNL node with required features and assigns 8 GB of HBM to the job. If there is no node with required features, one free node will be selected (may involve waiting for finishing all jobs on the selected node) and rebooted for changing current set of features. BE PATIENT, THE REBOOT TAKES LOOOOOONG TIME (~10 mins).

srun -p small -n 128 -N 8 --mem-per-cpu=2G mympijob

Starts a MPI job with 128 tasks/ranks spanning over 8 nodes in the small partition assigning 1 CPU and 2 GB RAM to each task.



Each assignment objective is to implement a parallel solution for described problem using specified technology. Let us emphasize, that the solution must use particular technology and must not use any other technology for parallelization. Please, read the assignment specification carefully and observe all details regarding the file naming conventions, framework utilization, compilation, and APIs.

The correct results are provided by the serial solution which is provided for each problem. The speedup is computed as the ratio of measured wall time of the serial solution and your parallel solution. When multiple datasets are provided, the speedup is determined for each dataset separately, and the final speedup is computed as a harmonic mean of the individual speedups.

The parallel solution must return the exact same results as the serial solution unless specified otherwise. A solution that fails to do so will not be accepted (i.e., no points will be granted). Furthermore, a solution is forbidden to produce any output to both stdout and stderr which is not required or allowed in the problem description. Debugging outputs may interfere with our testing framework and so your solution will be treated as if it does not provide a correct output.

The testing environment does not use any sand box or other means to prevent the submitted solutions from performing malicous tasks since we need to measure the performance without any interference. Please, take extra care to ensure that your solution is not performing any tasks that may violate the security of parlab. Any attempt for intentional breach will result into severe disciplinary (and possibly legal) actions against the participants.

The details regarding the testing infrastructure and the assignment submission can be found in the Infrastructure tab.

Please note that the parlab uses CentOS Linux, which has rather old version of g++ compiler. We have installed newer versions of g++ on all workers to let you fully appreciate new features of C++11/14/17. The new g++ compiler will be also used for testing your submits. It will be invoked with following flags CXXFLAGS=-Wall -O3 -std=c++17.

Additional materials are located at /mnt/home/_teaching . The subdirectory para holds the data related to the assignments (data, frameworks, ...) and the subdirectory examples holds the examples that will be presented during the lectures.

Assignment Submission

Unless specified otherwise, you are expected to submit your solution by copying appropriate files to the submit_box directory inside your home. The entire contents of this directory is copied atomically by our collection script when the deadline passes. The collection process also purges the submit box, so it is ready for another submission. Make sure you have copy the correct files which are requested by the assignment specification. Also make sure that you have your own copy of the files somewhere else.

#1 K-means Clustering

Write an implementation of k-means clustering algorithm using Intel Threading Building Blocks technology. The algorithm gets input points in a plane (R 2), the number of clusters K, and the number of iterations. It returns coordinates of the cluster centroids, and the assignment of the points after the last iteration.

The algorithm works as follows. As an initialization, the first K points are taken as the initial set of cluster centroids. After that, the algorithm performs prescribed number of iteration, whilst each iteration consists of the following steps:

  1. Each point is assigned to its neares cluster -- i.e., the cluster which has its centroid nearest by the Euclidean metrics. In case there are multiple clusters with centroids that are equidistant from the point being assigned, the cluster with the lowest index is considered the closest one.
  2. A new centroid is computed for each cluster. The centroid is in fact an average of all assigned points (computed per dimension). The task expects that integer arithmetics is used (i.e., all operations including the division is computed on integers). In case that a cluster has no points assigned to it (and thus computing new centroid will result in zero division error), the centroid should be preserved (no new centroid shall be computed).

The number of points are divisible by 1,024 and the number of clusters K will not exceed 256 (clusters are indexed from 0). All points will fit in RAM and the sum of all point coordinates in one dimension as well as the sum of squares of coordinate diferences (Euclidean distance) will fit into signed 64-bit integer.

Your solution must use a framework, which is available for you at /mnt/home/_teaching/para/01-kmeans (including the testing data, data generator, and serial implementation). Your solution is expected to modify only the implementation.hpp file of the framework while you have to preserve the name of the implementation class ( KMeans ) and it must implement the interface IKMeans . The modified file implementation.hpp is also the only file you are expect to submit into your submit_box directory. If you need to create additional headers ( *.hpp ), add them to you submit box as well, but do not create any subdirectories.

Use the attached Makefiles for compilation. When debugging, the KMeans class has a bool template parameter DEBUG . When the program is called with -debug argument, the debugging version of the class is used. You may use it to print out debugging information, but make sure that the regular (non-debug) version is not printing anything to the stdout.

Your solution will be tested on different datasets than you are provided. However, the evaluation data will be of the same size. The small partition (w2xx workers) will be used to measure the speedup.

#2 Levenshtein's Edit Distance

Design and implement a parallel algorithm for computing Levenshtein's Edit Distance using the OpenMP technology.

Your solution must use a framework, which is available for you at /mnt/home/_teaching/para/02-levenshtein (including the testing data, data generator, and serial implementation). Your solution is expected to modify only the implementation.hpp file of the framework while you have to preserve the name of the implementation class (EditDistance). The modified file implementation.hpp is also the only file you are expect to submit into your submit_box directory. If you need to create additional headers (*.hpp), add them to you submit box as well, but do not create any subdirectories.

Use the attached Makefiles for compilation. When debugging, the EditDistance class has a bool template parameter DEBUG. When the program is called with -debug argument, the debugging version of the class is used. You may use it to print out debugging information, but make sure that the regular (non-debug) verision is not printing anything to the stdout.

Your solution will be tested on diffrent datasets than you are provided. However, the evaluation data will use the same string lengths as the testing data. We recommend designing your performance-optimized solution to expect that the string lengths are divisible by 1,024 (but note that debugging data are much smaller).

#3 MPI Matrix Multiplication

Write a MPI program, which will perform a matrix multiplication. Source matrices can be too large to fit into a one node memory. Resulting matrix will fit into a one node memory. No other technology should be used, only vectorization instructions (SSE, AVX) are allowed. Files for source and result matrices are available only at rank #0. The program accepts three command line parameters, first two parameters are source matrices, the third one is the resulting matrix.

We have prepared some testing data and programs. They are placed in /mnt/home/_teaching/para/03-matrixmul-mpi/exec. generator generates a matrix. comparator compares two matrices. multiply is a serial version of matrix multiplication.

In this assignment, you are expected to submit an executable file named du3, which does not depend on any external library (like boost::mpi). Only the libraries which are generally available on all parlab workers can be used. The executable must be built using mpicxx/mpicc on workers.

Matrix file format

It is a binary format. First two 4-bytes integer numbers are C(olumns) and R(ows), followed by R*C floating point numbers in the IEEE 754 single precision format (float). The format is row-based, i.e. all elements from one row are in a sequence, followed by the next row.

Testing environment

The environment is slightly different from previous assignments. Testing matrices are placed in /mnt/home/_teaching/para/03-matrixmul-mpi/data. The program is executed using the following command line:

srun -n $N_PROC -N $N_HOSTS -p $PART -c 2 -m cyclic -t $TIMEOUT du3 $INPUT_FILE_A $INPUT_FILE_B $OUTPUT_FILE_R

where $N_PROC=256, $N_HOSTS=8, $PART=small-hp, and $TIMEOUT=20 (min). $INPUT_FILE_A, $INPUT_FILE_B, and $OUTPUT_FILE_R are replaced by corresponding matrices in data directory.

#4 Physical Simulation

The third assignment is to implement a physical simulation based on given model specification described in a separate document (Czech, English). The simulation is about simple particle mechanics based on forces and velocities. Unlike the other assignments, this one will be developed and tested on our GPU-lab (front server is gpulab.ms.mff.cuni.cz and it is accessible for you in the same manner as parlab).

Your solution must use a framework, which is available for you at /mnt/home/_teaching/para/04-potential (including the testing data and serial implementation). Your solution is expected to modify only the implementation.hpp, kernels.h, and kernels.cu. These are also the only files you are supposed to leave in your submit box. The implementation class (ProgramPotential) must be preserved and it must implement the interface IProgramPotential.

The compilation is a bit tricky as we need to combine pure-C++ code with CUDA code. You may use CUDA runtime functions in the implementation.hpp (e.g., for memory allocation and transfers), but kernels and their invocation has to be placed in kernels.cu. Each kernel should be provided with a pure-C function wrapper which invokes it (see my_kernel kernel and its run_my_kernel wrapper). We recommend not using templates (even though the interface is designed in that way). Your solution will be tested only with double precision floats (coordinates) and 32bit unsigned integers (indices).

Your solution is expected to perform the following changes in the implementation file:

  1. The virtual function initialize() is designed to initialize (allocate) the memory buffers. You may also copy input data (like edges) to the GPU or initialize the velocities to zero vectors.
  2. The virtual function iteration(points) should implement the computation performed in each iteration of the simulation. The function updates the velocities and moves the points according to them. This function is called as many times as many iterations the simulation has to perform. Furthermore, the API guarantees that every iteration call (starting from the second iteration) is given the points vector yielded by its previous call. In other words, you may cache the point positions (or any related data) at the GPU(s).
  3. The virtual function getVelocities() is expected to retrieve the internal copy of the point velocities. This function is invoked for verification only and it does not have to be efficient (its execution time will not be added to the overall measured time).

All the parameters of the model are set to the member variables of the ProgramPotential object before the initialize() function is called. The member variable mVerbose indicates whether the application was launched with "-verbose" argument. If so, you may print out debugging or testing output without any penalizations. Otherwise, your implementation must not perform any output to the stdio nor to the stderr. Criticall errors should be reported by throwing UserException. Cuda errors may be reported by throwing CudaError exception (you may conveniently use CUCH macro).

The framework tests the solution and prints out the measured times. The first value is the time of the initialization and the second value is an average time of an iteration (both in milliseconds). The initialization time is not considered for the speedup evaluation. Your solution will be tested on different data than you are provided, but we will use the same numbers of vertices and edges. The verification process will be performed separately to the time measurements; thus, it will not influence the performance. All tests will be pefromed using the initial simulation parameters (see potential.cpp).

Supplied makefile may be used for compilation. Do not forget, that the code has to be compiled at workers. The SLURM partitions are named volta-hp (high priority) and volta-lp (low priority) on GPU lab. For allocating the GPUs on the workers, you need to pass a general resources request parameter to srun:
$> srun -p volta-hp --gres=gpu:1
Use gpu:2 gres value if you want to implement your solution for two GPUs (be aware that no worker has currently more than 2 GPUs).


The work in labs focues on introducing thechnologies, which are subsequently used in home assignments. Therefore, it is widely recommended to attend the labs. Details about labs and seminars will be updated continuously.

date room details
20.2.2019 S1 Revising parallelism in operating systems and mainstream multicore CPUs seminar - slides
25/27.2.2019 SU1/SU2 Intel Threading Building Blocks Matrix Transposition
6.3.2019 S1 Explaining the first assignment assignment #1
11/13.3.2019 SU1/SU2 OpenMP Minimum and Matrix Mul
20.3.2019 S1 Discussing possible solutions for Ass#1. Explaining second assignment. assignment #2
25/27.3.2019 SU1/SU2 MPI Distributed reduction
3.4.2019 S1 Discussing possible solutions for Ass#2, explaining third assignment. assignment #3
8/10.4.2019 SU1/SU2 CUDA Blur stencil
17.4.2019 S1 Discussing possible solutions for Ass#3, explaining fourth assignment. assignment #4
24.4./6.5.2019 SU2/SU1 C++ native parallel primitives Rand Top-K
15.5.2019 S1 Discussing possible solutions for Ass#4 #4 analysis


Pages from the previous years.