NPRG042 Programming in Parallel Environment

Labs 05 - Spark

Getting started

Let's start with the example from slides -- counting the number of words in a text file. The initial scripts and testing data are in /home/_teaching/para/labs/spark. Please do not copy the data, you may access them in place, copy only the scripts. Furthermore, please test your solution on the smallest file first, before running it on the whole wiki dataset.

Implement your solution in the wiki.py file. You may run it using the local methods (using only one node and srun in interactive mode) or using the sbatch script which starts the Spark cluster under SLURM. The Spark is self-contained (with Hadoop) in /home/_teaching/para/spark/

Running the script locally:

$> srun -p mpi-homo-short --mem=50G -c 4 /home/_teaching/para/spark/bin/spark-submit --master local[*] ./wiki.py

The --master local[*] instructs Spark to use as many cores as available, so you may tinker a little with the -c parameter of srun to allocate different amounts of cores. Furthermore, if your script prints something out it might be better to direct stderr into a file.

For your convenience, there is a script run.sh, which executes the srun and Spark with appropriate parameters and measures wall time.

To execute the script using multiple nodes, use spark-slurm.sh script:

$> sbatch ./spark-slurm.sh ./wiki.py

By default, the script uses two nodes (-N 2). To measure speedup, try increasing this to 8, but beware that you will occupy the whole cluster for quite some time (so keep the nodes at 2 at least for debugging). Currently, the script launches 2 tasks/workers per node (i.e., 1 worker per socket). Empirically, this seems to be the best configuration for most jobs.

More practice on RDD

  1. Count only words that have 3 characters at most.

  2. Let's find out, how many words there are for every possible starting letter (character).

  3. Print out 10 of the most frequent words with at least 3 characters. Beware this may take a serious amount of memory (when executed on the larger datasets), so better to use sbatch for this one.

  4. Get a list of all words that are both in Czech and English wiki and list top-10 of them. Use some suitable metric for ranking the words that will combine the occurrences in both languages.

After implementing the solution on your own, you may consult the result notes.