Duplicit citizens | |
Apache Spark | |
Assigned: | 11.4.2022 |
Deadline: | 30.4.2022 23:59 (CEST) |
Supervisor: | Jakub Yaghob |
ReCodEx: | Not available |
Results: | Homogeneous cluster (max 8 nodes) |
speedup | points |
---|---|
2× or less | 0 |
2× to 4× | 1 |
4× to 6× | 2 |
6× to 8× | 3 |
8× or more | 4 |
The task is counting the number of people from a given population list with the same first and last name who live in the same region. The region can be identified by the highest postcode number.
Use containerized Spark on our SLURM parlab cluster. External path to the Spark container image is /home/_teaching/para/04-spark/spark
.
The pre-generated list seznam.csv
can be found in the external path /home/_teaching/para/04-spark
.
Internal path to the list is /opt/data/seznam.csv
.
We will not measure speedup, but only the correctness of the output, because Spark cluster startup takes too long and with a large variation of times.
You may find a spark-slurm.sh
startup shell script in the assignment directory. You can customize it to suit your solution to the problem.
The script will be run using the sbatch
SLURM command, therefore you can use the corresponding #SBATCH
commands in the script.
In addition to your solution, place the above mentioned shell script to your ${HOME}/submit_box
directory.
It must have the name spark-slurm.sh
, as it will be used in automatic testing.
The script will be called with three parameters:
/home/_teaching/para/04-spark/spark
.eno1
for the mpi-homo-short
SLURM partition./mnt/1
directory inside the container.Your solution can have any name you want, you just need to modify the script accordingly.
The solution will by placed in the above mentioned R/W directory, therefore it will have internal path e.g. /mnt/1/mysolution.py
The output file must have the name output.csv
.
The output file is in CSV format, data separated by lines, no header, only LF.
The first column is the "region number", i.e. just the highest number from the postcode.
This column is sorted in ascending order.
The second column is the number of collisions in that region.
Write the output file to the above mentione R/W directory inside the container, i.e. use internal name /mnt/1/output.csv
.
Everything in the submit_box
is copied to a target test directory.
seznam.csv
is added to the target test directory (only symlink from internal path /mnt/1/seznam.csv
to internal path /opt/data/seznam.csv
.
Execute sbatch
with your spark-slurm.sh
from the target test directory with three above mentioned parameters.
The script will be executed on the mpi-homo-short
partition.
The script has to run your application, i.e. the application is hardcoded in the script
(currently the script accepts the 4th parameter).
The application writes the output to the file /mnt/1/output.csv
,
where it is taken by the test script in the target test directory and compared with the correct result.
In the first step, Find number of collisions for all pairs { name, surname }
in a region, i.e. how many citizens with the same pair is in the same region.
In the second step, compute resulting number for the region. You can use two methods: