About

This is an informational web page for the Programming in Parallel Environment course (NPRG042), which is being taught at the Faculty of Mathematics and Physics, Charles University in Prague. The lectures are being held in S4, from 12:20 to 13:50 every Monday. The seminars are in S5 from 10:40 to 12:10 once a fortnight on Thursday starting the first week. The labs are every other week either at the same time in SW2 or in SW2 from 14:00 to 15:30 on Thursday.

You can find here:

You may also want to check out information regarding our KSI clusters.

Contact

The course is currently led by Jakub Yaghob and Martin Kruliš. Please direct inqueries of a specific topic (or home assignment) to the teacher who presented the topic.

Latest Updates

24.2.2021 The web page was updated for the new scholar year.
17.2.2020 The web page was updated for the new scholar year.
13.2.2019 The web page was updated for the new scholar year.
14.2.2018 The web page was updated for the new scholar year.

Grading

The grading criteria for this course are summarized in the following. Please, read them carefully. In case any rules are not clear, contact Dr. Jakub Yaghob.

Home Assignments

There will be 5 home assignments which are designed to explore several technologies and parallel platforms. The main criterium for grading is the speedup of your solution computed by measuring wall-time of your solution and reference serial solution.

You may recieve up to 4 points for each performance assigment. The sum of your points will determine your final mark as follows:

points mark
12 or more 1 (excellent)
9 to 11 2 (well done)
6 to 8 3 (OK)
5 or less failed

Furthermore, 5 points are required to receive a credit for the seminar (which is also required) and at least 1 point is required from each assignment. In other words, you need to solve all assignments at least in a minimal way.

If you are not satisfied with your mark, but you have at least 5 points - 1 point from each assignment (an thus the credit), you may request an oral examination where your mark will be determined. Once subscribing for the exam, your mark recieved from the assignments is no longer valid — i.e., you may fail the exam even if you have enough points from the assignments.

Each assignment has a strict deadline. Once the deadline is reached, all assignments are collected and graded. You may submit your solution after the deadline and ask for (re)evaluation, but you will recieve reduced points. Submissions that would get 2 or more points normally will get only 1 point if submitted late. Late submissions which would not worth 2 points normally will receive no points when delivered late.

Lectures

The domain of parallel architectures and parallel programming is perhaps the most intensively developing field in the computer science. Hence, the slides are being updated continuously. Furthermore, bare in mind that some information on the slides may become outdated despite our best efforts. If you have any questions or suggestions related to the lectures, please contact the teachers.

Lectures will be available on the course video channel. Use login in a form xxx@cuni.cz, which will lead to the standard CAS authentication.

Date Topic Materials ZOOM Recording
1.3.2021 Parallel programming theory PPTX File Icon 01-theory.ppt 963 5693 3313 lecture recording
8.3.2021 C# and .NET PPTX File Icon 17-cs.pptx 929 5609 6564 lecture recording
15.3.2021 Finding parallelism PPTX File Icon 02-finding.ppt 963 5693 3313 lecture recording
22.3.2021 Intel Threading Building Blocks PPTX File Icon 12-tbb.pptx 950 4571 7195 lecture recording
29.3.2021 Algorithm structures PPTX File Icon 03-algorithm.ppt 963 5693 3313 lecture recording
12.4.2021 OpenMP PPTX File Icon 13-openmp.pptx 963 5693 3313 lecture recording
19.4.2021 Algorithm structures (continuation) PPTX File Icon 03-algorithm.ppt 963 5693 3313 lecture recording
26.4.2021 GPGPU, CUDA PPTX File Icon 16-cuda.pptx 926 5122 7627 lecture recording
3.5.2021 GPGPU, CUDA (continuation) PPTX File Icon 16-cuda-adv.pptx 926 5122 7627 lecture recording
10.5.2021 Support structures PPTX File Icon 04-support.ppt 963 5693 3313 lecture recording
17.5.2021 Apache Spark PPTX File Icon 15-spark.pptx 963 5693 3313 lecture recording
24.5.2021 Support structures (continuation) PPTX File Icon 04-support.ppt 963 5693 3313 lecture recording
31.5.2021 Support structures (continuation) PPTX File Icon 04-support.ppt 963 5693 3313 lecture recording

Assignments

Each assignment objective is to implement a parallel solution for described problem using specified technology. Let us emphasize, that the solution must use particular technology and must not use any other technology for parallelization. Please, read the assignment specification carefully and observe all details regarding the file naming conventions, framework utilization, compilation, and APIs.

The correct results are provided by the serial solution which is provided for each problem. The speedup is computed as the ratio of measured wall time of the serial solution and your parallel solution. When multiple datasets are provided, the speedup is determined for each dataset separately, and the final speedup is computed as a harmonic mean of the individual speedups.

The parallel solution must return the exact same results as the serial solution unless specified otherwise. A solution that fails to do so will not be accepted (i.e., no points will be granted). Furthermore, a solution is forbidden to produce any output to both stdout and stderr which is not required or allowed in the problem description. Debugging outputs may interfere with our testing framework and so your solution will be treated as if it does not provide a correct output.

The testing environment does not use any sand box or other means to prevent the submitted solutions from performing malicous tasks since we need to measure the performance without any interference. Please, take extra care to ensure that your solution is not performing any tasks that may violate the security of parlab. Any attempt for intentional breach will result into severe disciplinary (and possibly legal) actions against the participants.

The details regarding the testing infrastructure and the assignment submission can be found on KSI clusters page.

Please note that the parlab uses CentOS Linux, which has rather old version of g++ compiler. We have installed newer versions of g++ on all workers to let you fully appreciate new features of C++11/14/17. The new g++ compiler will be also used for testing your submits. It will be invoked with following flags CXXFLAGS=-Wall -O3 -std=c++17.

Additional materials are located at /mnt/home/_teaching . The subdirectory para holds the data related to the assignments (data, frameworks, ...) and the subdirectory examples holds the examples that will be presented during the lectures.

Assignment Submission

Unless specified otherwise, you are expected to submit your solution by copying appropriate files to the submit_box directory inside your home. The entire contents of this directory is copied atomically by our collection script when the deadline passes. The collection process also purges the submit box, so it is ready for another submission. Make sure you have copy the correct files which are requested by the assignment specification. Also make sure that you have your own copy of the files somewhere else.

#1 C# .NET Async Programming

The objective is to write recursive DNS resolver in C# .NET Core using already implemented methods for individual DNS queries. Of course, all network operations will be simulated by mocks. And since the DNS problem is rather complex, we have simplified it significantly for the purposes of this assignment:

  • Each known domain has exactly 1 IPv4 address. Different domains have different addresses.
  • There is no difference between the server and its nameserver (we do not distinguish any record types A, NS, CNAME...).
  • Server holding domain X is the only authority which can answer queries for immediate sub-domains of X (i.e., if one needs to resolve mff.cuni.cz, one must first resolve cuni.cz and then ask it for address of mff).
  • Root servers are well known (you will get a list of IPs).
  • There is no additional trickery (like CNAMEs), domains are simple identifiers separated by dots.

Your implementation will get an object implementing IDNSClient interface. The interface will provide the following methods:

  • IReadOnlyList<IP4Addr> GetRootServers() - will give you the list of root server IPs
  • Task<IP4Addr> Resolve(IP4Addr server, string subDomain) - perform single resolving (you can ask specific server for immediate subdomain)
  • Task<string> Reverse(IP4Addr server) - performs reverse translation (will give you full domain name of IP address); this might be handy if you implement some sort of cache and you need to quickly find out whether an IP address is still valid

Your job is to implement class with IRecursiveResolver interface, which has only one method - Task<IP4Addr> ResolveRecursive(string domain). This method gets full domain name and yields its address. File Serial.cs holds trivial implementation of the resolver. Your codes goes into Solution.cs file, which is also the only file you need to submit.

All the interfaces yield Tasks, which might suggest your course of action. Besides low-level tasks, you may also employ async/await approach. The main objective is to use parallelism to lower (average) query latency and increase throughput.

I would suggest implementing the following optimizations: First, when two similar queries are executed simultaneously (e.g., mff.cuni.cz and ff.cuni.cz) the common part may be resolved only once. If can save time by saving resources and it may reduce latency of the query which was started later. Second, you may use whatever caching you deem necessary, but beware that the addresses may change in time. Therefore, you need to verify data from cache (e.g., by reverse translation) before yielding the result.

On the other hand, it is expected that you will not employ any hacking techniques or aggressive methods that might work in real world. E.g., you are forbidden from trying guessing IP addresses of server or scanning the entire IP range.

All parlab workers have .NET Core 5.0 installed. The initial solution is in /mnt/home/_teaching/para/01-dns-netcore You may compile it using dotnet command:
$> dotnet build -c Release ./dns-netcore.sln
The compiled binary will appear in ./bin subdir. Remember you need to start both the compilation and the job itself using SLURM on one of the workers.

Final remarks: This assignment is still experimental and it is somewhat different from other assignments. Therefore, we have not set any official scoring tables or rules yet. We will tune these things based on your results to ensure fair evaluation.

#2 K-means Clustering

Write an implementation of k-means clustering algorithm using Intel Threading Building Blocks technology. The algorithm gets input points in a plane (R 2), the number of clusters K, and the number of iterations. It returns coordinates of the cluster centroids, and the assignment of the points after the last iteration.

The algorithm works as follows. As an initialization, the first K points are taken as the initial set of cluster centroids. After that, the algorithm performs prescribed number of iteration, whilst each iteration consists of the following steps:

  1. Each point is assigned to its neares cluster -- i.e., the cluster which has its centroid nearest by the Euclidean metrics. In case there are multiple clusters with centroids that are equidistant from the point being assigned, the cluster with the lowest index is considered the closest one.
  2. A new centroid is computed for each cluster. The centroid is in fact an average of all assigned points (computed per dimension). The task expects that integer arithmetics is used (i.e., all operations including the division is computed on integers). In case that a cluster has no points assigned to it (and thus computing new centroid will result in zero division error), the centroid should be preserved (no new centroid shall be computed).

The number of points are divisible by 1,024 and the number of clusters K will not exceed 256 (clusters are indexed from 0). All points will fit in RAM and the sum of all point coordinates in one dimension as well as the sum of squares of coordinate diferences (Euclidean distance) will fit into signed 64-bit integer.

Your solution must use a framework, which is available for you at /mnt/home/_teaching/para/02-kmeans (including the testing data, data generator, and serial implementation). Your solution is expected to modify only the implementation.hpp file of the framework while you have to preserve the name of the implementation class ( KMeans ) and it must implement the interface IKMeans . The modified file implementation.hpp is also the only file you are expect to submit into your submit_box directory. If you need to create additional headers ( *.hpp ), add them to you submit box as well, but do not create any subdirectories.

Use the attached Makefiles for compilation. When debugging, the KMeans class has a bool template parameter DEBUG . When the program is called with -debug argument, the debugging version of the class is used. You may use it to print out debugging information, but make sure that the regular (non-debug) version is not printing anything to the stdout.

Your solution will be tested on different datasets than you are provided. However, the evaluation data will be of the same size. The small partition (w2xx workers) will be used to measure the speedup.

#3 Levenshtein's Edit Distance

Design and implement a parallel algorithm for computing Levenshtein's Edit Distance using the OpenMP technology.

Your solution must use a framework, which is available for you at /mnt/home/_teaching/para/03-levenshtein (including the testing data, data generator, and serial implementation). Your solution is expected to modify only the implementation.hpp file of the framework while you have to preserve the name of the implementation class (EditDistance). The modified file implementation.hpp is also the only file you are expect to submit into your submit_box directory. If you need to create additional headers (*.hpp), add them to you submit box as well, but do not create any subdirectories.

Use the attached Makefiles for compilation. When debugging, the EditDistance class has a bool template parameter DEBUG. When the program is called with -debug argument, the debugging version of the class is used. You may use it to print out debugging information, but make sure that the regular (non-debug) verision is not printing anything to the stdout.

Your solution will be tested on diffrent datasets than you are provided. However, the evaluation data will use the same string lengths as the testing data. We recommend designing your performance-optimized solution to expect that the string lengths are divisible by 1,024 (but note that debugging data are much smaller).

#4 Physical Simulation in CUDA

The assignment is to implement a physical simulation based on given model specification described in a separate document (Czech, English). The simulation is about simple particle mechanics based on forces and velocities. Unlike the other assignments, this one will be developed and tested on our GPU-lab (front server is gpulab.ms.mff.cuni.cz and it is accessible for you in the same manner as parlab).

Your solution must use a framework, which is available for you at /mnt/home/_teaching/para/04-potential (including the testing data and serial implementation). Your solution is expected to modify only the implementation.hpp, kernels.h, and kernels.cu. These are also the only files you are supposed to leave in your submit box. The implementation class (ProgramPotential) must be preserved and it must implement the interface IProgramPotential.

The compilation is a bit tricky as we need to combine pure-C++ code with CUDA code. You may use CUDA runtime functions in the implementation.hpp (e.g., for memory allocation and transfers), but kernels and their invocation has to be placed in kernels.cu. Each kernel should be provided with a pure-C function wrapper which invokes it (see my_kernel kernel and its run_my_kernel wrapper). We recommend not using templates (even though the interface is designed in that way). Your solution will be tested only with double precision floats (coordinates) and 32bit unsigned integers (indices).

Your solution is expected to perform the following changes in the implementation file:

  1. The virtual function initialize() is designed to initialize (allocate) the memory buffers. You may also copy input data (like edges) to the GPU or initialize the velocities to zero vectors.
  2. The virtual function iteration(points) should implement the computation performed in each iteration of the simulation. The function updates the velocities and moves the points according to them. This function is called as many times as many iterations the simulation has to perform. Furthermore, the API guarantees that every iteration call (starting from the second iteration) is given the points vector yielded by its previous call. In other words, you may cache the point positions (or any related data) at the GPU(s).
  3. The virtual function getVelocities() is expected to retrieve the internal copy of the point velocities. This function is invoked for verification only and it does not have to be efficient (its execution time will not be added to the overall measured time).

All the parameters of the model are set to the member variables of the ProgramPotential object before the initialize() function is called. The member variable mVerbose indicates whether the application was launched with "-verbose" argument. If so, you may print out debugging or testing output without any penalizations. Otherwise, your implementation must not perform any output to the stdio nor to the stderr. Criticall errors should be reported by throwing UserException. Cuda errors may be reported by throwing CudaError exception (you may conveniently use CUCH macro).

The framework tests the solution and prints out the measured times. The first value is the time of the initialization and the second value is an average time of an iteration (both in milliseconds). The initialization time is not considered for the speedup evaluation. Your solution will be tested on different data than you are provided, but we will use the same numbers of vertices and edges. The verification process will be performed separately to the time measurements; thus, it will not influence the performance. All tests will be pefromed using the initial simulation parameters (see potential.cpp).

Supplied makefile may be used for compilation. Do not forget, that the code has to be compiled at workers. The SLURM partitions are named volta-hp (high priority) and volta-lp (low priority) on GPU lab. For allocating the GPUs on the workers, you need to pass a general resources request parameter to srun:
$> srun -p volta-hp --gres=gpu:1
Use gpu:2 gres value if you want to implement your solution for two GPUs (be aware that reference testing will take place on volta01, which has only 2 GPUs).

#5 Apache Spark

The task is to count the number of people from a given population list who have the same first and last name and live in the same county. The pre-generated list can be found in the directory /mnt/home/_teaching/para/05-spark. Use containerized Spark on our SLURM parlab cluster.

In this case we will not measure the acceleration, but only the correctness of the output. This is because Spark cluster startup takes too long and with a large variation of times.

There is also a spark-slurm.sh shell script in the directory, which you can customize to suit your solution to the problem. This script will be run using the sbatch command, so you can use the corresponding #SBATCH commands in the script.

In addition to your solution, also submit the above shell script to submit_box. It must have the name spark-slurm.sh, because this name will be used in automatic testing. Your solution can have any name you want, you just need to modify the script accordingly.

Output format

The output file should be named output.csv. The output file is in CSV format, data separated by lines, no header, only LF. The first column is the "region number", i.e. just the highest number from the ZIP. This column is sorted in ascending order. The second column is the number of collisions in that county.

Testing

Everything in the submit_box is copied to some target test directory. seznam.csv is added to the target test directory. Run SLURM with your spark-slurm.sh from the target directory. This script will be run on the small-hp partition, this will be the only parameter to SLURM when running this script. The parameters of the script will then be in this order: path to the Spark image, network interface, target test directory (which will be mounted in the container as /mnt/1). The script has to run the correct application, i.e. the application is specified by the script (currently the script accepts the 4th parameter). The application is already referenced to the container, so something like /mnt/1/app.py. You need to set other parameters of SLURM inside the script using #SBATCH (see the commented example in the script). The application writes the output to the file /mnt/1/output.csv, where it is taken by the test script in the target test directory and compared with the correct result.

Labs

The work in labs focues on introducing thechnologies, which are subsequently used in home assignments. Therefore, it is widely recommended to attend the labs. Details about labs and seminars will be updated continuously.

Both labs and home assignments will require that you get yourselves familiar with our parlab and gpulab clusters and the SLURM management system. See the KSI clusters readme.

Date ZOOM Topics Details Recordings
4.3.2021 927 5572 5683 Revising parallelism in operating systems and mainstream multicore CPUs seminar - slides lab record
11.3.2021 966 0472 4968 C# tasks, assignment #1 assignment #1 lab record
18.3.2021 966 0472 4968 Consultation for the assignment #1 assignment #1 -
25.3.2021 966 0472 4968 Intel Threading Building Blocks, assignment #2 assignment #2 Matrix Transposition lab record
1.4.2021 966 0472 4968 Consultation for the assignment #2, discussing possible solutions for assignment #1 assignment #2 lab record
8.4.2021 966 0472 4968 Consultation for the assignment #2. assignment #2 -
15.4.2021 927 5572 5683 OpenMP, assignment #3 Minimum and Matrix Mul assignment #3 lab record
22.4.2021 966 0472 4968 Consultation for the assignment #3, discussing possible solutions for assignment #2. assignment #3 lab record
29.4.2021 966 0472 4968 CUDA blur stencil, assignment #4 Instructions assignment #4 lab record
6.5.2021 927 5572 5683 Consultation for the assignment #4, discussing possible solutions for assignment #3. assignment #4 lab record
13.5.2021 966 0472 4968 Consultation for the assignment #4. assignment #4
20.5.2021 927 5572 5683 Apache Spark, assignment #5 assignment #5 lab record
27.5.2021 966 0472 4968 Consultation for the assignment #5, discussing possible solutions for assignment #4. assignment #5 lab record
3.6.2021 927 5572 5683 Discussing possible solutions for assignment#5. lab record

Archive

Pages from the previous years.

close download