NPRG058: Advanced Programming in Parallel Environment

Assignment 3 - MPI Matrix multiplication

Write a MPI program, which will perform a matrix multiplication. Source matrices may be too large to fit into a one node memory and they surely do not fit to memory of one MPI process. The resulting matrix will fit into memory of one MPI process. No other technology should be used, only vectorization instructions (SSE, AVX) are allowed. Files for source and result matrices are available for all ranks (the NFS filesystem is shared by all nodes).

The program accepts three command line parameters, first two parameters are paths to source matrices, the third one is the resulting matrix.

We have prepared some testing data and programs. They are placed in /home/_teaching/advpara/ha3-mpi-matmul.

Matrix file format

It is a binary format where the first two 4-bytes integer numbers are C(olumns) and R(ows), followed by R*C floating point numbers in the IEEE 754 single precision format (float). The matrix uses row-major layout, i.e. all elements from one row are in a compact sequence and the matrix is a sequence of rows.

Implementation variants

The implementation should focus on homogenous clusters, so it could make a good use of collective operations. Use SLURM partition mpi-homo-short and 256 processes for performance measurements. Note, that it is wise to use smaller matrices and lower number of processes (especially nodes) for debugging.

Stretch goal: Try also devise a solution for heterogenous clusters (then use partition mpi-hetero-short and 512 processes).

Testing

Use following parameters for SLURM MPI startup:

To make our results directly comparable, the speedup measurements should be performed against serial implementation. Beware, the serial computation takes really loooong time, so the serial times are already provided in a text file.

As a part of your solution, create a table (in a readme file) with measured times. If you implement the heterogeneous stretch goal, describe (in the readme), how to run each version properly and how they differ in implementation.