NPRG042 Programming in Parallel Environment [2021/22]

Assignment 4

Duplicit citizens
Apache Spark
Assigned:	11.4.2022
Deadline:	30.4.2022 23:59 (CEST)
Supervisor:	Jakub Yaghob
ReCodEx:	Not available
Results:	Homogeneous cluster (max 8 nodes)

speedup	points
2× or less	0
2× to 4×	1
4× to 6×	2
6× to 8×	3
8× or more	4

The task is counting the number of people from a given population list with the same first and last name who live in the same region. The region can be identified by the highest postcode number.

Use containerized Spark on our SLURM parlab cluster. External path to the Spark container image is /home/_teaching/para/04-spark/spark.

The pre-generated list seznam.csv can be found in the external path /home/_teaching/para/04-spark. Internal path to the list is /opt/data/seznam.csv.

We will not measure speedup, but only the correctness of the output, because Spark cluster startup takes too long and with a large variation of times.

Startup shell script

You may find a spark-slurm.sh startup shell script in the assignment directory. You can customize it to suit your solution to the problem. The script will be run using the sbatch SLURM command, therefore you can use the corresponding #SBATCH commands in the script.

In addition to your solution, place the above mentioned shell script to your ${HOME}/submit_box directory. It must have the name spark-slurm.sh, as it will be used in automatic testing.

The script will be called with three parameters:

External path to the Spark container image. It has fixed (external) path /home/_teaching/para/04-spark/spark.
Name of the network interface for Spark communication. It has fixed name eno1 for the mpi-homo-short SLURM partition.
Read-write directory. The directory will be bound as /mnt/1 directory inside the container.

Your solution can have any name you want, you just need to modify the script accordingly. The solution will by placed in the above mentioned R/W directory, therefore it will have internal path e.g. /mnt/1/mysolution.py

Output format

The output file must have the name output.csv. The output file is in CSV format, data separated by lines, no header, only LF. The first column is the "region number", i.e. just the highest number from the postcode. This column is sorted in ascending order. The second column is the number of collisions in that region. Write the output file to the above mentione R/W directory inside the container, i.e. use internal name /mnt/1/output.csv.

Testing

Everything in the submit_box is copied to a target test directory. seznam.csv is added to the target test directory (only symlink from internal path /mnt/1/seznam.csv to internal path /opt/data/seznam.csv. Execute sbatch with your spark-slurm.sh from the target test directory with three above mentioned parameters. The script will be executed on the mpi-homo-short partition. The script has to run your application, i.e. the application is hardcoded in the script (currently the script accepts the 4th parameter). The application writes the output to the file /mnt/1/output.csv, where it is taken by the test script in the target test directory and compared with the correct result.

Counting duplicit citizens

In the first step, Find number of collisions for all pairs { name, surname } in a region, i.e. how many citizens with the same pair is in the same region. In the second step, compute resulting number for the region. You can use two methods:

Count all 1-collision pairs and substract them from the number of all citizens in the region
Add all 2(and more)-collisions together in the region