NPRG042 Programming in Parallel Environment

Labs 02 - TBB matrix transpose

Take a code startup pack tbb-matrix-tran.zip, unzip it, and open it in Visual Studio (or your favorite IDE). Optionally, you can copy the same starter pack from /home/_teaching/para/labs/tbb-matrix-tran NFS directory on parlab.

The objective is to parallelize the matrix transpose operation using Intel TBB. The matrix is stored in the class Matrix which takes a policy class that defines its memory organization (currently, only the MatrixOrganizationLinear policy representing traditional row-wise layout is available). The transpose operation should be performed in-place. The matrix is always square and its side size is taken as (the only) command line argument (16 384 is the default). The matrix is filled with a simple pattern to facilitate validation.

The serial_transpose() function is provided as a reference serial implementation. The parallel_transpose() function is the one you should implement. The main() function is already prepared to measure the execution time of both functions and to validate parallel computation results.

Compilation and execution on parlab

For compilation, you may use the provided Makefile, and optionally the build.sh sbatch script. This is not an ordinary shell script, but it needs to be executed via slurm as

$> sbatch ./build.sh

It will enqueue the job and execute it once the resources become available. The job ID is returned by the sbatch command. You can check the job status using the squeue command or the sacct -j <jobID> command. Once the job is finished, the output is stored in ./build.log file.

If you prefer to run the make directly as a terminal command, use srun:

$> srun -A nprg042s -p mpi-homo-short -c 1 -n 1 make

Similarly, you can run the compiled job using

$> sbatch ./run.sh [ <matrix-side-size> ]

It takes one optional argument (matrix size) and yields the output (measured times) into run.log. Alternatively, you can run it interactively as:

$> srun -A nprg042s -p mpi-homo-short -c 32 -n 1 ./tbb-matrix-tran [ <matrix-side-size> ]

Feel free to experiment with the number of allocated processors (-c argument) and the size of the matrix. For debugging, it is recommended to keep the size small (up to a few thousand) and use only 4-8 CPUs so you can better share resources in the class. The 32 cores are a reasonable maximum (64 cores are likely to exhibit no improvement since there are only 32 physical cores).

After implementing the solution on your own, you may consult the result notes.