NPRG058 - AdvPara

Assignment 3 - MPI Matrix multiplication

Write a MPI program, which will perform a matrix multiplication. Source matrices may be too large to fit into a one node memory and they surely do not fit to memory of one MPI process. The resulting matrix will fit into memory of one MPI process. No other technology should be used, only vectorization instructions (SSE, AVX) are allowed. Files for source and result matrices are available for all ranks (the NFS filesystem is shared by all nodes).

The program accepts three command line parameters, first two parameters are paths to source matrices, the third one is the resulting matrix.

We have prepared some testing data and programs. They are placed in /home/_teaching/advpara/ha3-mpi-matmul.

exec/generator generates a matrix
exec/comparator compares two matrices
serial/serial-mulmatrix is a serial version of matrix multiplication
data folder contains pre-generated matrices, where we multiply matrices in the form A x B = R, see corresponding file extensions. The prefixes h and l refer to the size of the resulting matrix.

Matrix file format

It is a binary format where the first two 4-bytes integer numbers are C(olumns) and R(ows), followed by R*C floating point numbers in the IEEE 754 single precision format (float). The matrix uses row-major layout, i.e. all elements from one row are in a compact sequence and the matrix is a sequence of rows.

Implementation variants

The implementation should focus on homogenous clusters, so it could make a good use of collective operations. Use SLURM partition mpi-homo-short and 256 processes for performance measurements. Note, that it is wise to use smaller matrices and lower number of processes (especially nodes) for debugging.

Stretch goal: Try also devise a solution for heterogenous clusters (then use partition mpi-hetero-short and 512 processes).

Testing

Use following parameters for SLURM MPI startup:

homogenous implementation - -p mpi-homo-short -n 256 -N 8 -m cyclic -c 2 --mem-per-cpu=1G
heterogenous implementation - -p mpi-hetero-short -n 512 -N 14 -c 2 --mem-per-cpu=512M

To make our results directly comparable, the speedup measurements should be performed against serial implementation. Beware, the serial computation takes really loooong time, so the serial times are already provided in a text file.

As a part of your solution, create a table (in a readme file) with measured times. If you implement the heterogeneous stretch goal, describe (in the readme), how to run each version properly and how they differ in implementation.

NPRG058: Advanced Programming in Parallel Environment

Assignment 3 - MPI Matrix multiplication

Matrix file format

Implementation variants

Testing