NPRG042 Programming in Parallel Environment

Assignment 5

Duplicate citizens
Apache Spark
Assigned:	5.5.2025
Deadline:	18.5.2025 23:59 (CEST)
ReCodEx:	assignment
Results:	Homogeneous cluster (max 8 nodes)

speedup	points
2× or less	0
2× to 3.5×	1
3.5× to 5×	2
5× to 7×	3
7× or more	4
Speedup baseline runs 180 s

Finally, we do have support for Spark in ReCodEx! Please note, that the submission process have changed. Your solutions should be submitted to ReCodEx the same way as the other assignments.

The task is to count the number of people from a given population list with identical first and last names who live in the same region. The region can be identified by taking the highest ZIP-code number.

Use the spark instance in /home/_teaching/para/spark (DO NOT COPY this directory). That is a self-contained Spark with Hadoop, so you do not need to install anything else. It can be started using spark-slurm.sh sbatch script (to run the cluster under SLURM) or by local invocation (on one node) like:

$> srun -p mpi-homo-short --mem=50G -c 4 /home/_teaching/para/spark/bin/spark-submit --master local[*] \
   ./your-solution.py /home/_teaching/para/05-spark/data/debug/small.csv

(adjust the memory and number of cores for your task accordingly).

The data and the initial scripts are available in /home/_teaching/para/05-spark. There are multiple datasets that have several GiB, so please, do not copy them. The measurements will use the large.csv dataset only, other datasets in debug subdir are for debugging and additional experimentation. The path to the input data file is given as the only command line attribute to your solution.

The input file has 4 columns, the first two are the first and last name respectively, the last column is the area (ZIP) code.

Output

The result must be stored in the ./output.csv file in the current working directory. As the name suggests, the file is in CSV format without a header where the columns are separated by a comma (,) and lines are separated by a simple newline (LF). The first column is the region number (i.e. the highest number from the postcode). The second column is the number of collisions (duplicates found) in that region. The file must be sorted by the region number in ascending order.

Example

Let us have the following input file:

Jan,Novák,269235752,12345
Jakub,Yaghob,996583646,12345
Jan,Novák,786940673,28167
Vladan,Majerech,302273606,29876
Jan,Novák,217963156,13456
Jakub,Yaghob,778990574,12345
Jan,Novák,804096211,28002
Jan,Novák,252349303,15678

Upon closer inspection, there are 2 areas (ZIP groups) -- starting with 1 and 2. In the first area, there are 3 duplicates of Jan Novák and 2 duplicates of Jakub Yaghob (i.e., 5 people in total). In the second area, there are only two people named Jan Novák. So the output file should look like this (first column is the area number, the second is the number of duplicates):

1,5
2,2

Submission (CHANGED!)

Submit your solution to ReCodEx. You are expected to submit a single Python script (with the .py extension). For practical reasons, the ReCodEx uses only the tiny.csv input dataset to perform a smoke test of your solution. Furthermore, the environment in ReCodEx uses only spark-submit --master local[1] to evaluate the solutions, which works slightly differently than the spark-slurm.sh script. Most notably, it executes only a single worker.

You are expected to perform your own thorough testing using the large.csv dataset on the entire parlab cluster.

Final evaluation

Your solution will be tested using the spark-slurm.sh script with identical #SBATCH parameters (i.e., using 8 nodes, and 16 workers). We are using a slight modification of the script which saves the measured time in a file rather than printing it out, but that should not affect the measurements. You can easily test your solutions as

$> sbatch /home/_teaching/para/05-spark/spark-slurm.sh ./your-solution.py /home/_teaching/para/05-spark/data/large.csv

(use datasets in debug subdir for debugging and additional testing).

The baseline is a referential solution executed on a single node with single worker. Since the measured times have larger variance in this assignment, we did some evaluations and set the baseline time to 180s (as a rounded average of multiple runs after cutting off outliers).