Write a MPI program, which will perform a matrix multiplication. Source matrices may be too large to fit into a one node memory and they surely do not fit to memory of one MPI process. The resulting matrix will fit into memory of one MPI process. No other technology should be used, only vectorization instructions (SSE, AVX) are allowed. Files for source and result matrices are available for all ranks (the NFS filesystem is shared by all nodes).
The program accepts three command line parameters, first two parameters are paths to source matrices, the third one is the resulting matrix.
We have prepared some testing data and programs. They are placed in /home/_teaching/advpara/ha3-mpi-matmul
.
exec/generator
generates a matrixexec/comparator
compares two matricesserial/serial-mulmatrix
is a serial version of matrix multiplicationdata
folder contains pre-generated matrices, where we multiply matrices in the form A x B = R, see corresponding file extensions. The prefixes h
and l
refer to the size of the resulting matrix.It is a binary format where the first two 4-bytes integer numbers are C(olumns) and R(ows), followed by R*C floating point numbers in the IEEE 754 single precision format (float). The matrix uses row-major layout, i.e. all elements from one row are in a compact sequence and the matrix is a sequence of rows.
The implementation should focus on homogenous clusters, so it could make a good use of collective operations. Use SLURM partition mpi-homo-short
and 256 processes for performance measurements. Note, that it is wise to use smaller matrices and lower number of processes (especially nodes) for debugging.
Stretch goal: Try also devise a solution for heterogenous clusters (then use partition mpi-hetero-short
and 512 processes).
Use following parameters for SLURM MPI startup:
-p mpi-homo-short -n 256 -N 8 -m cyclic -c 2 --mem-per-cpu=1G
-p mpi-hetero-short -n 512 -N 14 -c 2 --mem-per-cpu=512M
To make our results directly comparable, the speedup measurements should be performed against serial implementation. Beware, the serial computation takes really loooong time, so the serial times are already provided in a text file.
As a part of your solution, create a table (in a readme file) with measured times. If you implement the heterogeneous stretch goal, describe (in the readme), how to run each version properly and how they differ in implementation.