NPRG058 - AdvPara

Assignment 2 - CUDA Histogram I

Write a CUDA accelerated implementation that will compute a histogram (frequency analysis) of a text file. The input is a plain text where one char is one byte (we do not bother with encoding details). The histogram holds number of occurrences for each character. The range of characters in the histogram may be adjusted by algorithm parameters, range 0-127 is computed by default.

The initial source code is already in /home/_teaching/advpara/ha2-cuda-histogram. CUDA kernels along with C function that invokes them (runners) should be in kernels/kernels.cu. You also need to copy the runner header into headers/kernels.cuh and do not forget to explicitly instantiate its C++ template with appropriate parameters (it is sufficient if the kernel is executable with uint8 as input element type and uint32 as result type).

Each algorithm has to be wrapped in a class which inherits from IHistogramAlgorithm. Place your CUDA algorithm wrappers into headers/cuda.hpp. You will also need to register your algorithm wrapper (create a unique pointer instance) in getAlgorithm function of histogram.cpp. See serial algorithm and enlist your algorithms analogically.

In the first episode of Histogram saga, you need to implement two simple approaches:

Trivial lock-free approach where one thread computes one bin of the final histogram (iterates over the entire input). This way no explicit synchronization is required; however, low number of bins will not saturate GPU.
Straightforward atomic solution where one thread handles one input character and increments corresponding histogram bin (in global memory) using atomic instructions.

Check out all the input files and compare performance of the two aforementioned approaches.

Stretch Goal: Ty to use warp-opportunistic approach to aggregate atomic updates within one warp (see the lectures how to achieve that).

NPRG058: Advanced Programming in Parallel Environment

Assignment 2 - CUDA Histogram I