NPRG058: Advanced Programming in Parallel Environment

Lab practice 03 - CUDA blur stencil

The objective is to implement a blur stencil using CUDA. The stencil computes a blurry image out of a regular image using simple averaging. Each new pixel is a weighted average of all pixels within its radius r (i.e., from [x-r, y-r] to [x+r, y+r] covering (2r+1)^2 large window). The weight is 1/d where d is the Manhattan distance from the center (x,y). The center pixel uses explicit weight 5 (to avoid division by zero).

Use the starter pack at /home/_teaching/advpara/labs/03-cuda-blur. Copy it to your home and modify the cuda_blur function (you will also need to create a new kernel for the computation). Read the serial_blur function (the referential implementation) first. Use the attached Makefile for build and run.sh script for baseline measurement. Sample images are included in the starter pack (data subdir).

Note the CUCH macro and how it is used to wrap CUDA calls (for better error handling).

Evaluation: You may have noticed that CUDA implementation is somewhat slower on GPU than on CPU. Find out why. Print out times of individual steps/CUDA calls (do not forget that kernel call is asynchronous).

Stretch goal: One of the issues of this stencil is that it is quite data-bound. Contemplate possibilities of how to improve the data caching and re-use (trying to remember what you learnt in basic parallel programming course).