NPRG058: Advanced Programming in Parallel Environment

Assignment 2 - CUDA Histogram III (Official assignment)

Complete the histogram algorithm using the things you have learnt in the last lecture. Namely, improve data transfers between host and GPU. Besides the baseline (single memcpy for each direction). Implement iterative approach where data transfers overlap with computations (optionally use pinned memory). As a second attempt choose either memory mapping approach or unified memory (with explicit prefetching) approach.

Do not be disappointed that the data transfers take much longer than the actual kernel (actually the kernel takes much longer to execute when mapping/unified memory is used since the transfers are amortized in it). It is also part of the experience to remember that the data transfer cost is high, so we need to choose carefully which tasks are suitable for GPU and which ones are not.

Stretch goal is to utilize multiple GPUs in your solution. However, the final solution should use only 1 GPU (you may add CLI argument --gpus which defaults to 1 if not present, for instance).

Complete Solution

Finalized solution should be submitted to Git repositories as official assignment. Your solution needs to implement the following algorithms (please use the designated names in your CLI arguments):

All algorithms should also reflect --blockSize that sets number of threads in each CUDA block.

Testing

To make our results directly comparable, the main measurements should be performed using lorem_small.txt input file, but with --repeatInput 32k option. That will cause the input to be 2 GiB large. Keep the size of the histogram on defaults.

Experiment with other parameters mentioned above (block sizes, number of private copies, ...). Update your CLI arguments so the best values are set as default.

Measure all algorithms on volta01 and create a table (in readme file) with measured times for all algorithms mentioned above (show kernel time and preparation time separately).