About

This is an informational web page for the Programming in Parallel Environment course (NPRG042), which is being taught at the Faculty of Mathematics and Physics, Charles University in Prague. The lectures are being held in S4, from 12:20 to 13:50 every Thursday. The seminars are in S8 after the lecture (from 14:00) once a fortnight.

You can find here:

Contact

The course is currently led by Jakub Yaghob and Martin Kruliš. Please direct inqueries of a specific topic (or home assignment) to the teacher who presented the topic.

Latest Updates

17.2.2016 A new web page was created.

Grading

The grading criteria for this course are summarized in the following. Please, read them carefully. In case any rules are not clear, contact Dr. Jakub Yaghob.

Home Assignments

There will be 5 home assignments which are designed to explore several technologies and parallel platforms. The main criterium for grading is the speedup of your solution (except for the assignment #4, which will be graded only as passed or failed).

You may recieve up to 4 points for each performance assigment and 1 point for assignment #4. The sum of your points will determine your final mark as follows:

pointsmark
9 or more1 (excellent)
7 or 82 (well done)
5 or 63 (OK)
4 or lessfailed

Furthermore, 5 points are required to recieve a credit for the seminar (which is also required). If you are not satisfied with your mark, but you have at least 5 points (an thus the credit), you may request an oral examination where your mark will be determined. Once subscribing for the exam, your mark recieved from the assignments is no longer valid — i.e., you may fail the exam even if you have enough points from the assignments.

Each assignment has a strict deadline. Once the deadline is reached, all assignments are collected and graded. You may submit your solution after the deadline and ask for (re)evaluation, but you may recieve at most 1 point for late submission regardless its actual speedup.

Lectures

The domain of parallel architectures and parallel programming is perhaps the most intensively developing field in the computer science. Hence, the slides are being updated continuously. Furthermore, bare in mind that some information on the slides may become outdated despite our best efforts. If you have any questions or suggestions related to the lectures, please contact the teachers.

25.2.2016 Parallel programming theory PPTX File Icon 01-theory.ppt
29.10.2015 Finding parallelism PPTX File Icon 02-finding.ppt
10.3.2016 Basic threading in C++ PPTX File Icon 11-cppthread.pptx
17.3.2016 Finding parallelism (continuation)
24.3.2016 Intel Threading Building Blocks PPTX File Icon 12-tbb.pptx
31.3.2016 Algorithm structures PPTX File Icon 03-algorithm.ppt
7.4.2016 OpenCL and GPGPU PPTX File Icon 13-opencl.pptx
14.4.2016 Algorithm structures (continuation)
21.4.2016 MPI - basics PPTX File Icon 14-mpi.ppt
28.4.2016 Algorithm structures (continuation)
5.5.2016 MPI - advanced PPTX File Icon 15-mpi-adv.ppt
12.5.2016 Support structures PPTX File Icon 04-support.ppt
19.5.2016 Support structures (continuation)
26.5.2016 Support structures (continuation)

Unused Lectures from Previous Years

OpenMP PPTX File Icon unused-openmp.ppt

Parlab

The assignments will be tested on our parlab infrastructure, which is also at your disposal for their development, debugging, and performance testing. The front server is

parlab.ms.mff.cuni.cz

and you may access it via SSH protocol running on port 42222. Your student accounts will be created before the first assignment. Your login is s_sislogin where sislogin is your login to our student infomation system (SIS). The generated password will be sent to your official e-mail. The parlab uses LDAP, so you will need to change the password only once. You will be granted access to

The workers (blade servers) are available via SSH (port 22) from the parlab server. The ubergrafik is available directly (SSH running on port 42222). Your home directories are located in /mnt/home, which is a NFS folder mounted to all parlab servers.

Do not execute any of your code on the parlab directly, use one of the workers instead.

Additional materials are located at /mnt/home/_teaching. The subdirectory para holds the data related to the assignments (data, frameworks, ...) and the subdirectory examples holds the examples that will be presented during the lectures.

Assignment Submission

Unless specified otherwise, you are expected to submit your solution by copying appropriate files to the submit_box directory inside your home. The entire contents of this directory is copied atomically by our collection script when the deadline passes. The collection process also purges the submit box, so it is ready for another submission. Make sure you have copy the correct files which are requested by the assignment specification. Also make sure that you have your own copy of the files somewhere else.

MPI Crash Course

You must perform the following steps before you attempt to launch your first MPI application:

  1. Make sure you can use SSH without password (i.e., using authentication keys) between all workers and parlab master. Since your home directory is shared among all servers, this task should be trivial.
  2. The list of worker servers, to which an MPI application is spawned, has to be prepared in a file inside your directory (typically home). Copy the initial /mnt/home/mpd.hosts file which contains the list of all servers. You may edit the file to restrict the nodes of the MPI process.

The code compilation must be performed by generic tool mpicc or by augmented specific compiler (e.g., mpigxx is the MPI g++).

Exectuing an MPI application

mpiexec.hydra -n <N> -f <cesta>/mpd.hosts -rr -env I_MPI_FABRICS <iconn>:<econn> -env I_MPI_FALLBACK_DEVICE 0 <exac> <parametry>

<N> is the number of processes to be spawned. Other parameters (e.g., -rr) affect the distribution of the processes among available nodes.

<iconn> is the type of the communication interface used between processes on the same host. The following interfaces are available:

<econn> is the type of the communication interface used between individual hosts. The following interfaces are available:

The fastest option is shm:ofa. It is also possible run the application without distinguishing between internal and external interfaces (e.g., all processes may communicate only via ofa).

You may execute your MPI applications from any worker, but not from parlab. The spawning worker must be present in the mpd.hosts list.

PDF File Icon Intel MPI Getting Started PDF File Icon Intel MPI Reference Manual

Tracing MPI Jobs

The MPI tracing records a time log of all messages sent between nodes. The tracing is activated when the mpiexec launcher gets an additional parameter -trace. When the process terminates, the log is present in <exac>.stf (<exac> is name of your executable).

The log can be analysed traceanalyzer <exac>.stf. The trace analyser is a graphical tool, so you will need to use X forwarding over SSH.

Assignments

Each assignment objective is to implement a parallel solution for described problem using specified technology. Let us emphasize, that the solution must use particular technology and must not use any other technology for parallelization. Please, read the assignment specification carefully and observe all details regarding the file naming conventions, framework utilization, compilation, and APIs.

The correct results are provided by the serial solution which is provided for each problem. The speedup is computed as the ratio of measured wall time of the serial solution and your parallel solution. When multiple datasets are provided, the speedup is determined for each dataset separately, and the final speedup is computed as a harmonic mean of the individual speedups.

The parallel solution must return the exact same results as the serial solution unless specified otherwise. A solution that fails to do so will not be accepted (i.e., no points will be granted). Furthermore, a solution is forbidden to produce any output to both stdout and stderr which is not required or allowed in the problem description. Debugging outputs may interfere with our testing framework and so your solution will be treated as if it does not provide a correct output.

The testing environment does not use any sand box or other means to prevent the submitted solutions from performing malicous tasks since we need to measure the performance without any interference. Please, take extra care to ensure that your solution is not performing any tasks that may violate the security of parlab. Any attempt for intentional breach will result into severe disciplinary (and possibly legal) actions against the participants.

The details regarding the testing infrastructure and the assignment submission can be found in the Parlab tab.

Please note that the parlab uses Red Hat Linux, which has rather old version of g++ compiler. We have also installed newer version of g++ on all workers to to let you fully appreciate new features of C++11. Check out the /mnt/home/_teaching/use-new-gcc.sh script. The new g++ compiler will be also used for testing your submits.

#1 Levenshtein's Edit Distance

Design and implement a parallel algorithm for computing Levenshtein's Edit Distance using the C++11 threads and native synchronization tools (mutexes, atomics, conditional variables, ...).

Your solution must use a framework, which is available for you at /mnt/home/_teaching/para/01-levenshtein (including the testing data, data generator, and serial implementation). Your solution is expected to modify only the implementation.hpp file of the framework while you have to preserve the name of the implementation class (EditDistance). The modified file implementation.hpp is also the only file you are expect to submit into your submit_box directory. If you need to create additional headers (*.hpp), add them to you submit box as well, but do not create any subdirectories.

Use the attached Makefiles for compilation. When debugging, the EditDistance class has a bool template parameter DEBUG. When the program is called with -debug argument, the debugging version of the class is used. You may use it to print out debugging information, but make sure that the regular (non-debug) verision is not printing anything to the stdout.

Your solution will be tested on diffrent datasets than you are provided. However, the evaluation data will use the same string lengths as the testing data. We recommend designing your performance-optimized solution to expect that the string lengths are divisible by 1,024 (but note that debugging data are much smaller).

#2 K-means Clustering

Write an implementation of k-means clustering algorithm using Intel Threading Building Blocks technology. The algorithm gets input points in a plane (R2), the number of clusters K, and the number of iterations. It returns coordinates of the cluster centroids, and the assignment of the points after the last iteration.

The algorithm works as follows. As an initialization, the first K points are taken as the initial set of cluster centroids. After that, the algorithm performs prescribed number of iteration, whilst each iteration consists of the following steps:

  1. Each point is assigned to its neares cluster -- i.e., the cluster which has its centroid nearest by the Euclidean metrics. In case there are multiple clusters with centroids that are equidistant from the point being assigned, the cluster with the lowest index is considered the closest one.
  2. A new centroid is computed for each cluster. The centroid is in fact an average of all assigned points (computed per dimension). The task expects that integer arithmetics is used (i.e., all operations including the division is computed on integers). In case that a cluster has no points assigned to it (and thus computing new centroid will result in zero division error), the centroid should be preserved (no new centroid shall be computed).

The number of points are divisible by 1,024 and the number of clusters K will not exceed 256 (clusters are indexed from 0). All points will fit in RAM and the sum of all point coordinates in one dimension as well as the sum of squares of coordinate diferences (Euclidean distance) will fit into signed 64-bit integer.

Your solution must use a framework, which is available for you at /mnt/home/_teaching/para/02-kmeans (including the testing data, data generator, and serial implementation). Your solution is expected to modify only the implementation.hpp file of the framework while you have to preserve the name of the implementation class (KMeans) and it must implement the interface IKMeans. The modified file implementation.hpp is also the only file you are expect to submit into your submit_box directory. If you need to create additional headers (*.hpp), add them to you submit box as well, but do not create any subdirectories.

Use the attached Makefiles for compilation. When debugging, the KMeans class has a bool template parameter DEBUG. When the program is called with -debug argument, the debugging version of the class is used. You may use it to print out debugging information, but make sure that the regular (non-debug) verision is not printing anything to the stdout.

Your solution will be tested on different datasets than you are provided. However, the evaluation data will be of the same size.

#3 Physical Simulation

The third assignment is to implement a physical simulation based on given model specification described in a separate document (Czech, English). The simulation is about simple particle mechanics based on forces and velocities. Unlike the other assignments, this one will be developed and tested on a dedicated machine with GPUs (the ubergrafik server). Please note, that you can log in by the same account as for the parlab master.

Your solution must use a framework, which is available for you at /mnt/home/_teaching/para/03-potential (including the testing data and serial implementation). Your solution is expected to modify only the implementation.hpp file of the framework whilst the name of the implementation class (ProgramOCL) must be preserved and it must implement the interface IProgramOCL. The modified file implementation.hpp is also the only file you need to submit into your submit_box directory. If you need to create additional headers (*.hpp), add them to you submit box as well, but do not create any subdirectories.

Your GPU kernels may be placed as strings into the *.hpp files, or (more conveniently) you may save them to separate files with *.cl extension in the same directory as the implementation.hpp. Do not forget to copy all your *.cl files into the submit box along with the implementation.

Your solution is expected to perform the following changes in the implementation file:

  1. The ProgramOCL constructor is expected to register soruce files or source codes of the kernels, which should be compiled by OpenCL. Therefore, the compilation time may be excluded from the execution time. The framework will provide compiled cl::Kernel objects for each registered kernel to the main execution routine. Beware that the kernel identifier used in the constructor of ProgramOCL must match the kernel function name.
  2. The virtual function initialize() is designed to initialize (allocate) the memory buffers. You may also copy input data to the GPU or initialize the velocities to zero vectors.
  3. The virtual function iteration(points) should implement the computation performed in each iteration of the simulation. The function updates the velocities and moves the points according to them. This function is called as many times as many iterations the simulation has to perform. Furthermore, the API guarantees that 2nd and the following calls provides the very same point positions which were yielded by the previous iteration. In other words, you may cache the point positions (or any related data) at the GPU(s).
  4. The virtual function getVelocities() is expected to retrieve the internal copy of the point velocities. This function is invoked for verification only and it does not have to be efficient (its execution time will not be reflected in the performance evaluation).

All the parameters of the model and necessary OpenCL entities are initialized and set to the member variables of the ProgramOCL object before the initialize() function is called. The member variable mVerbose indicates whether the application was launched with "-verbose" argument. If so, you may print debugging or testing output at your discretion. Otherwise, your implementation must not perform any output to the stdio nor to the stderr. Criticall errors should be reported by throwing UserException.

The framework tests the solution and prints out the measured times. The first value is the time of the initialization and the second value is an average time of an iteration (both in milliseconds). The initialization time is not considered for the speedup evaluation. Your solution will be tested on different data than you are provided, but we will use the same numbers of vertices and edges. The verification process will be performed separately to the time measurements; thus, it will not influence the performance. All tests will be pefromed using the initial simulation parameters (see potential.cpp).

#4 Simple MPI

Write a MPI program, which will perform a trivial sum of multiple integers. The program reads a file name as a string from standard input at rank #0. Given file is opened on all ranks and a single integer (in decimal format) is read from the file. All values and their sum will fit into 32-bit signed integer. The final sum of all values has to be printed to standard output at the rank #0 (also in decimal format).

If the given file is missing on any rank, the program will terminate correctly on all ranks and an error message containing the IDs of the ranks that have failed has to be printed on the rank #0 standard output instead of the result.

In this assignment, you are expected to submit an executable file named du4, which does not depend on any external library (like boost::mpi). Only the libraries which are generally available on all parlab workers can be used.

Let us emphasize that the file name has to be read from the standard input at rank #0, not from the program parameters.

For your conveninence, we have prepared a short MPI Carsh Course, which will help you to compile and start your MPI applications.

#5 MPI Matrix Multiplication

Write a MPI program, which will perform a matrix multiplication. Source matrices can be too large to fit into a one node memory. Resulting matrix will fit into a one node memory. No other technology should be used, only vectorization instructions (SSE, AVX) are allowed. Source and result matrices are available only at rank #0. The program accepts three parameters, first two parameters are source matrices, the third one is the resulting matrix.

We have prepared some testing data and programs. They are placed in /mnt/home/_teaching/para/05-matrixmul-mpi. generator generates a matrix. comparator compares two matrices. multiply is a serial version of matrix multiplication.

In this assignment, you are expected to submit an executable file named du5, which does not depend on any external library (like boost::mpi). Only the libraries which are generally available on all parlab workers can be used.

Matrix file format

It is a binary format. First two 4-bytes integer numbers are R(ows) and C(olumns), followed by R*C floating point numbers in the IEEE 754 single precision format (float). The format is row-based, i.e. all elements from one row are in a sequence, followed by the next row.