This is an informational web page for the Programming in Parallel Environment course (NPRG042), which is being taught at the Faculty of Mathematics and Physics, Charles University in Prague. The lectures are being held in S4, from 12:20 to 13:50 every Tuesday. The seminars are in S4 from 12:20 to 13:50 once a fortnight on Thursday starting the first week. The labs are at the same time in SW2 every other week.

You can find here:


The course is currently led by Jakub Yaghob and Martin Kruliš. Please direct inqueries of a specific topic (or home assignment) to the teacher who presented the topic.

Latest Updates

22.2.2017 The web page was updated for the new scholar year.


The grading criteria for this course are summarized in the following. Please, read them carefully. In case any rules are not clear, contact Dr. Jakub Yaghob.

Home Assignments

There will be 5 home assignments which are designed to explore several technologies and parallel platforms. The main criterium for grading is the speedup of your solution computed by measuring wall-time of your solution and reference serial solution.

You may recieve up to 4 points for each performance assigment. The sum of your points will determine your final mark as follows:

10 or more1 (excellent)
8 or 92 (well done)
6 or 73 (OK)
5 or lessfailed

Furthermore, 5 points are required to recieve a credit for the seminar (which is also required) and at least 1 point is required from each assignment. In other words, you need to solve all assignments at least in a minimal way.

If you are not satisfied with your mark, but you have at least 5 points - 1 point from each assignment (an thus the credit), you may request an oral examination where your mark will be determined. Once subscribing for the exam, your mark recieved from the assignments is no longer valid — i.e., you may fail the exam even if you have enough points from the assignments.

Each assignment has a strict deadline. Once the deadline is reached, all assignments are collected and graded. You may submit your solution after the deadline and ask for (re)evaluation, but you will recieve reduced points. Submissions that would get 2 or more points normally will get only 1 point if submitted late. Late submissions which would not worth 2 points normally will receive no points when delivered late.


The domain of parallel architectures and parallel programming is perhaps the most intensively developing field in the computer science. Hence, the slides are being updated continuously. Furthermore, bare in mind that some information on the slides may become outdated despite our best efforts. If you have any questions or suggestions related to the lectures, please contact the teachers.

21.2.2017 Parallel programming theory PPTX File Icon 01-theory.ppt
28.2.2017 Basic threading in C++ PPTX File Icon 11-cppthread.pptx
7.3.2017 Finding parallelism PPTX File Icon 02-finding.ppt
14.3.2017 Intel Threading Building Blocks PPTX File Icon 12-tbb.pptx
21.3.2017 Finding parallelism (continuation)
28.3.2017 OpenMP PPTX File Icon 13-openmp.ppt
4.4.2017 Algorithm structures PPTX File Icon 03-algorithm.ppt
11.4.2017 OpenCL and GPGPU PPTX File Icon 14-opencl.pptx
18.4.2017 Algorithm structures (continuation)
25.4.2017 MPI - basics PPTX File Icon 15-mpi.ppt
2.5.2017 MPI - advanced PPTX File Icon 15-mpi-adv.ppt
9.5.2017 Algorithm structures (continuation)
16.5.2017 Support structures PPTX File Icon 04-support.ppt
23.5.2017 Support structures (continuation)


The assignments will be tested on our parlab infrastructure, which is also at your disposal for their development, debugging, and performance testing. The front server is


and you may access it via SSH protocol running on port 42222. Your student accounts will be created before the first assignment. Your login is s_sislogin where sislogin is your login to our student infomation system (SIS). The generated password will be sent to your official e-mail. The parlab uses LDAP, so you will need to change the password only once. You will be granted access to

The workers (blade servers) are available via SSH (port 22) from the parlab server. The ubergrafik is available directly (SSH running on port 42222). Your home directories are located in /mnt/home, which is a NFS folder mounted to all parlab servers.

Do not execute any of your code on the parlab directly, use one of the workers instead.

Additional materials are located at /mnt/home/_teaching. The subdirectory para holds the data related to the assignments (data, frameworks, ...) and the subdirectory examples holds the examples that will be presented during the lectures.

Assignment Submission

Unless specified otherwise, you are expected to submit your solution by copying appropriate files to the submit_box directory inside your home. The entire contents of this directory is copied atomically by our collection script when the deadline passes. The collection process also purges the submit box, so it is ready for another submission. Make sure you have copy the correct files which are requested by the assignment specification. Also make sure that you have your own copy of the files somewhere else.


Each assignment objective is to implement a parallel solution for described problem using specified technology. Let us emphasize, that the solution must use particular technology and must not use any other technology for parallelization. Please, read the assignment specification carefully and observe all details regarding the file naming conventions, framework utilization, compilation, and APIs.

The correct results are provided by the serial solution which is provided for each problem. The speedup is computed as the ratio of measured wall time of the serial solution and your parallel solution. When multiple datasets are provided, the speedup is determined for each dataset separately, and the final speedup is computed as a harmonic mean of the individual speedups.

The parallel solution must return the exact same results as the serial solution unless specified otherwise. A solution that fails to do so will not be accepted (i.e., no points will be granted). Furthermore, a solution is forbidden to produce any output to both stdout and stderr which is not required or allowed in the problem description. Debugging outputs may interfere with our testing framework and so your solution will be treated as if it does not provide a correct output.

The testing environment does not use any sand box or other means to prevent the submitted solutions from performing malicous tasks since we need to measure the performance without any interference. Please, take extra care to ensure that your solution is not performing any tasks that may violate the security of parlab. Any attempt for intentional breach will result into severe disciplinary (and possibly legal) actions against the participants.

The details regarding the testing infrastructure and the assignment submission can be found in the Parlab tab.

Please note that the parlab uses CentOS Linux, which has rather old version of g++ compiler. We have installed newer version of g++ on all workers to let you fully appreciate new features of C++11/14. Use g++14 command for compiling with newer version of g++. Moreover, you must add /usr/local/lib64 into your LD_LIBRARY_PATH before executing any program compiled by g++14. The new g++14 compiler will be also used for testing your submits. It will be invoked with following flags CXXFLAGS=-Wall -O3 -msse4.2 -std=c++14.

#1 Parallel Sorting

Design and implement a parallel algorithm that sorts an array of integers using the C++11/14 threads and native synchronization tools (mutexes, atomics, conditional variables, ...).

Your solution must use a framework, which is available at /mnt/home/_teaching/para/01-sort-cpp (including serial implementation). Your solution is expected to modify only the implementation.cpp file of the framework while you have to preserve the interface function declared in sort-cpp.hpp. The file implementation.cpp is also the only file you are expect to submit into your submit_box directory. If you need to create additional headers (*.hpp), add them to you submit box as well, but do not create any subdirectories.

Use the attached Makefiles for compilation.

The provided framework has several optional parameters and one mandatory parameter. The mandatory parameter sets the output file name. This file will be used for comparing results with serial version. The optional parameter -nthrs gives a hint about number of threads your solution should use. It is permitted, that your implementation uses a different number of threads, but the parameter should not be completely ignored. The number of threads needs not to be a power of two.

The interface function value_type* sort_impl(value_type *arr, std::size_t size, value_type *temp_arr, int nthrs); receives two allocated buffers of appropriate size. The arr is the source array which is filled by random numbers. The temp_arr is an array for temporary values which is pre-allocated for your convenience (to remove allocation from the time measurement). You may use both arrays at your discretion and the function must return either arr or temp_arr pointer depending on where your sorted data are at the end. Parameter nthrs holds the value of the -nthrs command linde argument.

Your solution will be tested with various parameters. You may expect -nthrs will vary between 8-64. The size of the array will contain between 10M and 1G items whilst it does not have to be a power of two.

#2 K-means Clustering

Write an implementation of k-means clustering algorithm using Intel Threading Building Blocks technology. The algorithm gets input points in a plane (R2), the number of clusters K, and the number of iterations. It returns coordinates of the cluster centroids, and the assignment of the points after the last iteration.

The algorithm works as follows. As an initialization, the first K points are taken as the initial set of cluster centroids. After that, the algorithm performs prescribed number of iteration, whilst each iteration consists of the following steps:

  1. Each point is assigned to its neares cluster -- i.e., the cluster which has its centroid nearest by the Euclidean metrics. In case there are multiple clusters with centroids that are equidistant from the point being assigned, the cluster with the lowest index is considered the closest one.
  2. A new centroid is computed for each cluster. The centroid is in fact an average of all assigned points (computed per dimension). The task expects that integer arithmetics is used (i.e., all operations including the division is computed on integers). In case that a cluster has no points assigned to it (and thus computing new centroid will result in zero division error), the centroid should be preserved (no new centroid shall be computed).

The number of points are divisible by 1,024 and the number of clusters K will not exceed 256 (clusters are indexed from 0). All points will fit in RAM and the sum of all point coordinates in one dimension as well as the sum of squares of coordinate diferences (Euclidean distance) will fit into signed 64-bit integer.

Your solution must use a framework, which is available for you at /mnt/home/_teaching/para/02-kmeans (including the testing data, data generator, and serial implementation). Your solution is expected to modify only the implementation.hpp file of the framework while you have to preserve the name of the implementation class (KMeans) and it must implement the interface IKMeans. The modified file implementation.hpp is also the only file you are expect to submit into your submit_box directory. If you need to create additional headers (*.hpp), add them to you submit box as well, but do not create any subdirectories.

Use the attached Makefiles for compilation. When debugging, the KMeans class has a bool template parameter DEBUG. When the program is called with -debug argument, the debugging version of the class is used. You may use it to print out debugging information, but make sure that the regular (non-debug) version is not printing anything to the stdout.

Your solution will be tested on different datasets than you are provided. However, the evaluation data will be of the same size.


The work in labs focues on introducing thechnologies, which are subsequently used in home assignments. Therefore, it is widely recommended to attend the labs. Details about labs and seminars will be updated continuously.

date room details
23.2.2017 S4 Revising parallelism in operating systems and mainstream multicore CPUs seminar
2.3.2017 SW2 C++ native parallel primitives Rand Top-K
9.3.2017 S4 Explaining the first assignment assignment #1
16.3.2017 SW2 Intel Threading Building Blocks Matrix Transposition
23.3.2017 S4 assignment #2
30.3.2017 SW2 OpenMP
6.4.2017 S4 assignment #3
13.4.2017 SW2 OpenCL
20.4.2017 S4 assignment #4
27.4.2017 SW2 MPI
4.5.2017 S4 assignment #5
11.5.2017 SW2
18.5.2017 S4 #5 analysis
25.5.2017 SW2


Pages from the previous years.