Complete the histogram algorithm using the things you have learnt in the last lecture. Namely, improve data transfers between host and GPU. Besides the baseline (single memcpy for each direction). Implement iterative approach where data transfers overlap with computations (optionally use pinned memory). As a second attempt choose either memory mapping approach or unified memory (with explicit prefetching) approach.
Do not be disappointed that the data transfers take much longer than the actual kernel (actually the kernel takes much longer to execute when mapping/unified memory is used since the transfers are amortized in it). It is also part of the experience to remember that the data transfer cost is high, so we need to choose carefully which tasks are suitable for GPU and which ones are not.
Stretch goal is to utilize multiple GPUs in your solution. However, the final solution should use only 1 GPU (you may add CLI argument --gpus
which defaults to 1 if not present, for instance).
Finalized solution should be submitted to Git repositories as official assignment. Your solution needs to implement the following algorithms (please use the designated names in your CLI arguments):
serial
- CPU sequential baseline (already implemented in given source codes)naive
- solution without any synchronization where one thread computes one histogram bin (should have been implemented in Episode I)atomic
- simple atomic updates, no privatization, no shared memory (should have been implemented in Episode I)atomic_shm
- atomic updates with private copy in shared memory (no privatization in global memory, --itemsPerThread
indicate how many input chars are processed in one thread, and --privCopies
indicate how many histogram copies are placed in shared memory)overlap
- same as atomic_shm
, but the input data are transferred in chunks and overlap with computations (add CLI arguments: --chunkSize
= number of chars in one chunk and --pinned
= bool flag that indicates whether the chunk buffers has to be allocated as portable memory)final
- same as atomic_shm
, but either using memory mapping or unified memory (your choice should be mentioned in readme file in your repository).All algorithms should also reflect --blockSize
that sets number of threads in each CUDA block.
To make our results directly comparable, the main measurements should be performed using lorem_small.txt
input file, but with --repeatInput 32k
option. That will cause the input to be 2 GiB
large. Keep the size of the histogram on defaults.
Experiment with other parameters mentioned above (block sizes, number of private copies, ...). Update your CLI arguments so the best values are set as default.
Measure all algorithms on volta01
and create a table (in readme file) with measured times for all algorithms mentioned above (show kernel time and preparation time separately).