NPRG058: Advanced Programming in Parallel Environment

Assignment 2 - CUDA Histogram II

We continue with the Histogram saga (assuming you have the first assignment completed).

In this episode, we will try to improve the performance of the atomic updates kernel by three optimizations:

Privatization: In case of small histogram, the collisions of atomic updates are frequent. Try to make more copies of histograms and aggregate them at the end. It is not feasible to have a copy for every thread, but we can make fixed amount of copies and assign threads (or thread blocks) to these copies in round-robin manner.
Shared Memory Caching: Utilization of shared memory is yet another specialized case of privatization. If the histogram fits shared memory, we can aggregate updates of one thread block here and merge local copy of histogram with global copy at the end. In case of very small histograms, it may be beneficial to create multiple copies inside shared memory (e.g., one copy in one memory bank).
Work Aggregation: to fully utilize shared memory, it is better to aggregate as much updates as possible. I.e., one CUDA thread should process more than one input. Make this amount (items per thread) a tuning parameter and try to find the right value empirically.

Experiment! Implement all approaches (or their combinations) and empirically determine, which approaches are better and when (i.e., test them for different inputs and different sizes of histogram).