NPRG042 Programming in Parallel Environment

Labs 02 - TBB matrix transpose (Result Notes)

Perhaps the simplest way to parallelize the matrix transpose operation is to use the parallel_for() construct:

template<typename T, class O>
void parallel_for_transpose(Matrix<T, O>& matrix)
{
    tbb::parallel_for<std::size_t>(0, matrix.m(), [&](std::size_t i) {
        for (std::size_t j = i + 1; j < matrix.n(); ++j) {
            std::swap(matrix.at(i, j), matrix.at(j, i));
 }
 });
}

Note that we are iterating only over the upper triangle of the matrix, as the lower triangle is already transposed. The std::swap() function is used to perform the in-place transposition. Also, note that only the outer loop is parallelized. That creates a workload imbalance as each row has a different number of elements to transpose.

A more elaborate solution could be achieved by dividing the matrix into tiles and transposing each tile separately. This way, the workload is more balanced, and the cache locality is improved.

Perhaps the best solution is to employ the divide-and-conquer approach (especially if the matrix is square). The matrix is divided into four quadrants. The two quadrants lying on the main diagonal are transposed recursively. The remaining two quadrants are traversed together as they also need to be swapped, but also following a recursive division until small-enough tiles are reached. This way, the workload is balanced, and the cache locality is preserved.