Task 1: Dataset overview

Goal: preapre a report (.html) that briefly describes the Titanic dataset.

  • Hint: you can use last lesson's output as a source .ipynb file (or download one from the GitHub repository.)

  • subtask 1 - Convert the document using jupyter nbconvert (see documentation for reference).

    • Hint: rerender the document after each next step to see the progress.
    • Hint: keep your original document to be able to compare it with the future versions. (You can do it using --output my_old_nb option.)
  • subtask 2 - Let the document be recalculated once it's rendering. Use --execute option.

  • subtask 3 - Get rid of the code chunks in the converted document. Use --no-input option.

  • subtask 4 - Add some structure to the document

    • Use Markdown tags for headers - e.g. Goal, Data, Columns, ..., Summary). You may add some text as well (e.g. try to describe how to read a graph you've included).
    • Be sure to insert a (top level) header # Columns and add several subsections (## Age, ## Sex, ...). It will be necessary to solve following problems. Note: it would be great to include some plot/table here (e.g. to describe outcome ~ column relation). But any lorem ipsum is enough to test the stuff.

Task 2: Dataset overview using the pretty-jupyter

Goal: use pretty-jupyter Python package to produce pretier reports. See documentation for reference).

  • subtask 1 - Add template pj to the nbconvert command. (You need to have pretty-jupyter installed!)
    • Hint: Now you can get rid off --no-input option, the pretty-jupyter deals with that for you.
    • Hint: Among other things, it gives you table of contents and prettier font and more.
  • subtask 2 - Add report metadata
    • Add a raw-type cell as the first cell of your document. There you can specify the metadata. (see e.g. this and this)
    • You may check out page's metadata.
author: Me
title: Document name
date: 2022-01-01
  • subtask 3 - Use tabsets to convert column description desctions into panels.
    • Add <span class='pj-token' style='display: none;'>.tabset)|O_O|just|O_O|after|O_O|the|O_O|#|O_O|Columns|O_O|header|O_O|(see|O_O|docs|O_O|for|O_O|reference.
  • subtask 4 - Use dynamically rendered text, e.g. try to describe datasets dimensions but do not hardcode them.
    • Add a magic %load_ext pretty_jupyter in some cell (it could be in the same cell you load packages.)
    • Store the desired values in some variables, e.g. n_rows = data.shape[0].
    • Insert a code cell and add %%jmd (or %%jinja markdown) on the very first line of the cell.
    • Write down your text into the cess. Add {{ n_rows }} to provide the variable value.
    • Optionally, you may try to add a data_frame (hint: you need to convert it to html first - use {{ df.to_html() }}) or a plot (use matplotlib_fig_to_html() function from pretty_jupyter.helpers to do so: {{ matplotlib_fig_to_html(plt) }}). See docs for reference.

Task 3 - Dataset overview using pandas-profiling

Goal: do the dataset profiling using pandas-profiling python package.

  • subtask 1 - Use pandas-profiling CMD tool to profile titanic_train.csv dataset.
    • run pandas_profiling --title "Můj Titanic Train" data/titanic_train.csv. (You can locate the file in the same directory as the input data.)
  • subtask 2 - Use pandas-profiling package to profile titanic_train.csv dataset.
import pandas as pd
from pandas_profiling import ProfileReport

titanic_train = pd.read_csv("../data/titanic2/titanic_train.csv")
titanic_profile = ProfileReport(titanic_train, title="Titanic Profiling Report")
titanic_profile.to_file("my_titanic_report.html")
  • subtask 3 - Get rid of correlation section (as it may be time-consuming). Or keep just one.
ProfileReport(titanic_train, title="Titanic Profiling Report", correlations={
        "auto": {"calculate": False},
        "pearson": {"calculate": False},
        "spearman": {"calculate": False},
        "kendall": {"calculate": False},
        "phi_k": {"calculate": False},
        "cramers": {"calculate": False},
    })
  • subtask 4 - Customize the report a bit. E.g., you can try:
    • change theme - html.style.theme (either flatly or united)
    • base color - html.style.primary_color (enter a hex code)
    • ... (see docs for more details).

Task 4 - Dataset overview using Quarto

Goal: do the dataset profiling using Quarto. (Of course, you need to isntall it first -- se the download section for reference).

  • subtask 1 - Convert the .ipynb into .html using Quarto.
    • Use the CMD utility quarto render <my_document.ipynb> --to html.
  • subtask 2 - (Optional; best to use VSCode + Quarto extension to do so)
    • Create a very new .qmd file.
    • You can mock-up the content using e.g. hello-world example. (You may need plotly package as well.)
    • Try to convert the document (use ctrl + shift + k shortcut in VSCode).
    • Try to tweak the metadata a bit (it's simillar to pretty-jupyter raw cell.). Set e.g.
      • theme
      • toc (incl. toc position, toc title, etc.)
      • abstract
      • ...
      • Hint: docs for reference.
      • Hint: VSCode supports intellisense to help you with that.
    • Try to add cell metadata (see docs for reference).
      • Add fig-label and use it in text.
      • Add eval: false to disable evaluation.
      • Add code-line-numbers to add (surprise, surprise) line numbers to the code chunk.
      • ...

Task 5 - Exploratory analysis

Goal: prepare a brief exploratory analysis of the Titanic (train) dataset. Do cover following questions:

  • What is the outcome distribution (survived)?
  • What is the relation between the outcome and each individual column? (Visualise it!)
  • Are the missing at random or does it (somehow) correspond to the outcome?
  • ... (add your own questions.)

Do not forget to structure the report, describe input data and summarise your insights. Follow tips covered in lecture (e.g. use of colors, text tescription of charts, charts annotations, ...)