NPRG042 Programming in Parallel Environment [2021/22]

Assignment 5

Physical Simulation
CUDA
Assigned:	2.5.2022
Deadline:	14.5.2022 23:59 (CEST)
Supervisor:	Martin Kruliš
Submission:	in `submit_box` directory in your home
Results:	volta01 (2 GPU)

speedup	points
20× or less	0
20× to 75×	1
75× to 150×	2
150× to 300×	3
300× or more	4

The assignment is to implement a physical simulation based on given model specification described in a separate document (Czech, English). The simulation is about simple particle mechanics based on forces and velocities. Unlike the other assignments, this one will be developed and tested on our GPU-lab (front server is gpulab.ms.mff.cuni.cz and it is accessible for you in the same manner as parlab).

Your solution must use a framework, which is available for you at /home/_teaching/para/05-potential (including the testing data and serial implementation). Your solution is expected to modify only the implementation.hpp, kernels.h, and kernels.cu. These are also the only files you are supposed to submit for evaluation. The implementation class (ProgramPotential) must be preserved and it must implement the interface IProgramPotential.

The compilation is a bit tricky as we need to combine pure-C++ code with CUDA code. You may use CUDA runtime functions in the implementation.hpp (e.g., for memory allocation and transfers), but kernels and their invocation has to be placed in kernels.cu. Each kernel should be provided with a pure-C function wrapper which invokes it (see my_kernel kernel and its run_my_kernel wrapper). We recommend not using templates (even though the interface is designed in that way). Your solution will be tested only with double precision floats (coordinates) and 32bit unsigned integers (indices).

Your solution is expected to perform the following changes in the implementation file:

The virtual function initialize() is designed to initialize (allocate) the memory buffers. You may also copy input data (like edges) to the GPU or initialize the velocities to zero vectors.
The virtual function iteration(points) should implement the computation performed in each iteration of the simulation. The function updates the velocities and moves the points according to them. This function is called as many times as many iterations the simulation has to perform. Furthermore, the API guarantees that every iteration call (starting from the second iteration) is given the points vector yielded by its previous call. In other words, you may cache the point positions (or any related data) in the GPU memory.
The virtual function getVelocities() is expected to retrieve the internal copy of the point velocities. This function is invoked for verification only and it does not have to be efficient (its execution time will not be added to the overall measured time).

All the parameters of the model are filled in the member variables of the ProgramPotential object before the initialize() function is called. The member variable mVerbose indicates whether the application was launched with -verbose argument. If so, you may print out debugging or testing output without any penalizations. Otherwise, your implementation must not write anything to the output (stdio nor to stderr). Critical errors should be reported by throwing UserException. CUDA errors may be reported by throwing CudaError exception (you may conveniently use CUCH macro).

The framework tests the solution and prints out the measured times. The first value is the time of the initialization and the second value is an average time of an iteration (both in milliseconds). The initialization time is not considered for the speedup evaluation. Your solution will be tested on different data than you are provided, but we will use the same numbers of vertices and edges. The verification process will be performed separately to the time measurements; thus, it will not influence the performance. All tests will be performed using the initial simulation parameters (see potential.cpp).

Supplied Makefile may be used for compilation. Do not forget, that the code has to be compiled at workers. The SLURM partitions are named gpu-short (high priority) and gpu-long (low priority) on GPU lab. For allocating the GPUs on the workers, you need to pass a general resources request parameter to srun:

$> srun -p gpu-short -A nprg042s --gres=gpu:1

Use --gres=gpu:2 if you want to implement your solution for two GPUs (be aware that reference testing will take place on volta01, which has only 2 GPUs).