NPRG042 Programming in Parallel Environment
Labs 03 - OpenMP
Matrix multiplication
The first task is to accelerate Matrix multiplication using OpenMP. The initial source code with the serial solution is ready in the /home/_teaching/para/labs/omp/matrix-mul
directory on parlab. Open matrix-mul.cpp
and parallelize the code in para_matrix_mul()
function.
Hint: Let's use parallel for! 🎉
Adjust the N
constant in the main()
function to select the matrix size: 1024
is a good number for performance testing; 256
is fine for debugging, 2048
will take 28
seconds to compute (sequentially) on w201
(do not use larger matrices, they stall the worker for too long).
Searching for minimum
The second task is to parallelize the search for the minimum in an array of numbers. The initial source code with the serial solution is ready in the /home/_teaching/para/labs/omp/minimum
, edit para_min()
function in the minimum.cpp
file.
You may adjust the value passed to generate_data()
in the main()
function to experiment with different array sizes. The default value is 1024*1024
(1M numbers).
Stretch goal: Try to parallelize the search for k
smallest values in the array where k
is selected much smaller than the size of the array (tens or hundreds at most).
Compilation and execution on parlab
For compilation, you may use the provided Makefile
, and optionally the build.sh
sbatch script. This is not an ordinary shell script, but it needs to be executed via slurm as
$> sbatch ./build.sh
It will enqueue the job and execute it once the resources become available. The job ID is returned by the sbatch
command. You can check the job status using the squeue
command or the sacct -j <jobID>
command. Once the job is finished, the output is stored in ./build.log
file.
If you prefer to run the make directly as a terminal command, use srun
:
$> srun -A nprg042s -p mpi-homo-short -c 1 -n 1 make
Similarly, you can run the compiled job using
$> sbatch ./run.sh
It yields the output (measured times) into run.log
. Alternatively, you can run it interactively as:
$> srun -A nprg042s -p mpi-homo-short -c 32 -n 1 ./<executable-name> [ args... ]
Feel free to experiment with the number of allocated processors (-c
argument) and the size of the input data. For debugging, it is recommended to keep the size small (up to a few thousand) and use only 4-8 CPUs so you can better share resources in the class. The 32 cores are a reasonable maximum (64 cores are likely to exhibit no improvement since there are only 32 physical cores).