Task 1: Dataset overview¶
Goal: preapre a report (.html
) that briefly describes the Titanic dataset.
Hint: you can use last lesson's output as a source
.ipynb
file (or download one from the GitHub repository.)subtask 1 - Convert the document using
jupyter nbconvert
(see documentation for reference).- Hint: rerender the document after each next step to see the progress.
- Hint: keep your original document to be able to compare it with the future versions. (You can do it using
--output my_old_nb
option.)
subtask 2 - Let the document be recalculated once it's rendering. Use
--execute
option.subtask 3 - Get rid of the code chunks in the converted document. Use
--no-input
option.subtask 4 - Add some structure to the document
- Use Markdown tags for headers - e.g. Goal, Data, Columns, ..., Summary). You may add some text as well (e.g. try to describe how to read a graph you've included).
- Be sure to insert a (top level) header
# Columns
and add several subsections (## Age
,## Sex
, ...). It will be necessary to solve following problems. Note: it would be great to include some plot/table here (e.g. to describe outcome ~ column relation). But any lorem ipsum is enough to test the stuff.
Task 2: Dataset overview using the pretty-jupyter
¶
Goal: use pretty-jupyter
Python package to produce pretier reports. See documentation for reference).
- subtask 1 - Add
template pj
to thenbconvert
command. (You need to havepretty-jupyter
installed!)- Hint: Now you can get rid off
--no-input
option, thepretty-jupyter
deals with that for you. - Hint: Among other things, it gives you table of contents and prettier font and more.
- Hint: Now you can get rid off
- subtask 2 - Add report metadata
author: Me
title: Document name
date: 2022-01-01
- subtask 3 - Use
tabsets
to convert column description desctions into panels.- Add
<span class='pj-token' style='display: none;'>.tabset)
|O_O|just|O_O|after|O_O|the|O_O|#|O_O|Columns
|O_O|header|O_O|(see|O_O|docs|O_O|for|O_O|reference.
- Add
- subtask 4 - Use dynamically rendered text, e.g. try to describe datasets dimensions but do not hardcode them.
- Add a magic
%load_ext pretty_jupyter
in some cell (it could be in the same cell you load packages.) - Store the desired values in some variables, e.g.
n_rows = data.shape[0]
. - Insert a code cell and add
%%jmd
(or%%jinja markdown
) on the very first line of the cell. - Write down your text into the cess. Add
{{ n_rows }}
to provide the variable value. - Optionally, you may try to add a data_frame (hint: you need to convert it to html first - use
{{ df.to_html() }}
) or a plot (usematplotlib_fig_to_html()
function frompretty_jupyter.helpers
to do so:{{ matplotlib_fig_to_html(plt) }}
). See docs for reference.
- Add a magic
Task 3 - Dataset overview using pandas-profiling
¶
Goal: do the dataset profiling using pandas-profiling
python package.
- subtask 1 - Use
pandas-profiling
CMD tool to profiletitanic_train.csv
dataset.- run
pandas_profiling --title "Můj Titanic Train" data/titanic_train.csv
. (You can locate the file in the same directory as the input data.)
- run
- subtask 2 - Use
pandas-profiling
package to profiletitanic_train.csv
dataset.
import pandas as pd
from pandas_profiling import ProfileReport
titanic_train = pd.read_csv("../data/titanic2/titanic_train.csv")
titanic_profile = ProfileReport(titanic_train, title="Titanic Profiling Report")
titanic_profile.to_file("my_titanic_report.html")
- subtask 3 - Get rid of
correlation
section (as it may be time-consuming). Or keep just one.
ProfileReport(titanic_train, title="Titanic Profiling Report", correlations={
"auto": {"calculate": False},
"pearson": {"calculate": False},
"spearman": {"calculate": False},
"kendall": {"calculate": False},
"phi_k": {"calculate": False},
"cramers": {"calculate": False},
})
- subtask 4 - Customize the report a bit. E.g., you can try:
- change theme -
html.style.theme
(eitherflatly
orunited
) - base color -
html.style.primary_color
(enter a hex code) - ... (see docs for more details).
- change theme -
Task 4 - Dataset overview using Quarto¶
Goal: do the dataset profiling using Quarto
. (Of course, you need to isntall it first -- se the download section for reference).
- subtask 1 - Convert the
.ipynb
into.html
usingQuarto
.- Use the CMD utility
quarto render <my_document.ipynb> --to html
.
- Use the CMD utility
- subtask 2 - (Optional; best to use VSCode + Quarto extension to do so)
- Create a very new
.qmd
file. - You can mock-up the content using e.g. hello-world example. (You may need
plotly
package as well.) - Try to convert the document (use
ctrl + shift + k
shortcut in VSCode). - Try to tweak the metadata a bit (it's simillar to
pretty-jupyter
raw cell.). Set e.g.- theme
- toc (incl. toc position, toc title, etc.)
- abstract
- ...
- Hint: docs for reference.
- Hint: VSCode supports intellisense to help you with that.
- Try to add cell metadata (see docs for reference).
- Add
fig-label
and use it in text. - Add
eval: false
to disable evaluation. - Add
code-line-numbers
to add (surprise, surprise) line numbers to the code chunk. - ...
- Add
- Create a very new
Task 5 - Exploratory analysis¶
Goal: prepare a brief exploratory analysis of the Titanic (train) dataset. Do cover following questions:
- What is the outcome distribution (
survived
)? - What is the relation between the outcome and each individual column? (Visualise it!)
- Are the missing at random or does it (somehow) correspond to the outcome?
- ... (add your own questions.)
Do not forget to structure the report, describe input data and summarise your insights. Follow tips covered in lecture (e.g. use of colors, text tescription of charts, charts annotations, ...)