Storytelling

or Effective DS Reporting

Dominik Matula

Outline

Intro
Theory
- Report types
- Report structure
- Communicating data
- Report as a code
Examples & Discussion

Motivation - Doing science

Motivation - Writing Science

As a scientist,
you’re proffesional writer.

Joshua Schimel

Reports are an efficient way to communicate your DS work.
- They puts together your thoughts, results and methods to achieve them.

Motivation - Pseudosocial networks

TAČR Research project, Profinit + MFF UK

Example reports

DS Report types

Based on PURPOSE, AUDIENCE and RESOURCES

1) Based on PURPOSE

Data description
- DU + DP
Exploratory analysis
- DP
Explanatory analysis
- M + DP
Monitoring report
- E + M + D

Relates to CRISP-DM phases Cross-industry standard process for data mining

1a) Data description report

General dataset overview
- Statistical unit
- No. observations and variables
Individual variables overview
- Distribution, moments, extremes, missings, etc.
Dataset metainformation
- Do we have data docs?
- Do we know the gathering process?
- Fixed vs Periodicaly updated
Goal:
- Find out what we’re facing …

Dataset problems
- Missings (and way of missing)
- Outliers (and source of outlierness)
Tip: YData profiling (docs) ydata_profiling data.csv report.html

1b) Exploratory analysis report

Most creative part
- You need to find out what’s in data.
- There are no strict guidelines.
Variables relationships
- Charts / Tables / Summary stats.
- Explaining outcome variability (if any).
Goal:
- Familiarize yourself with the data.
- Generate hypotheses (stories).

The First Landing of Christopher Columbus in America Painting by Dioscoro Teofilo Puebla Tolin

1c) Explanatory analysis report

Model building phase
- Selecting a subpopulation
- Variables transformation
- Feature engineering
- Modelling
Model explanation
- Model performance evaluation
- Main drivers (e.g., regression coeficients, SHAP values, …)
Goal:
- Reveal what’s going on
- Collect evidence

Christopher Columbus being given the sailing commission by King Ferdinand and Queen Isabella for his Enterprise of the Indies in Sante Fe, Spain, on April 30, 1492

1d) Monitoring report

Goal: Assure everything is fine and the model is well behaving.
To be run repeatedly
- New version of model
- New version of data
- Tip: Automate it!
Model characteristics:
- Performance metrics
- Callibration plots
- Feature importances, SHAP values, ..

Futher analyses
- Ddepends on a business demand
- E.g., what value does it brings, A/B testing comparisions

2) Based on AUDIENCE

Knowledge report

Audience
- Co-worker / future me.
- To be studied on her won.
Tips
- High detail
- Include methodology
- Open questions / not followed paths

Business report

Audience
- Management, general public, customer
- To be presented /walked through
Tips
- Strong story line
- Fancy & straightforward graphics
- Low detail
- Call to action

3) Based on RESOURCES

Amount of prior knowledge
- Future me / Coworker / Business guy / General public -Data journalism (iRozhlas, …)
Amount of interaction
- Paper on her own / Paper walked through / Presentation / …
- One man show / Open discussion / Interactive exploration
- Online / Offline

Report structure

Academic paper

Abstract
- A brief synopsis of the paper – what did I do?
Introduction
- Context+purpose of the study – what‘s the problem?
Methods
- How did I solve the problem?
Results
- What did I find out?
Discussion
- What does it mean? (interpretation in a context)
References, Acknowledgements, …

DataScience report

Motivation / Introduction
- What do we do and why?
Data
- What data do we use, source of data, docs reference
- Incl. some basic stats & sanity checks
Methodology
- This describes your work (data analyses, problesms, findings, model building, …)
Summary / Results
- Lessons learned in a brief structured form.
Next steps
- Call to action, open questions

Example report.

Developing a storyline - OCAR

Storyline is a key concept
- Makes your audience absorb your thoughts
4 elements behind all stories (incl. DS)
- O - Opening
- C - Challenge
- A - Action
- R - Resolution

Narative arc

A story is a set of nested arcs.

OCAR in reporting

Introduction
- Opening - first paragraph, describes the context/larger problem
- Background - further details, incl.
  what should reader know etc.
- Challenge - what are the specific hypotheses/questions/goals of the report
Data
- Background - further details, extends Opening (Introduction)
Methodology
- Action - what did we do?

Results / Summary
- Resolution - what did it all mean, what we’ve learned?
Next steps
- Cliff hanger, teaser for your future work

Developing the storyline faster

OCAR
- Opening – Challenge – Action – Resolution
- slowest, take your time working into the story.
ABDCE
- Action – Background – Development – Climax – Ending
- faster, get right into the action.
LDR
- Lead – Development – Resolution
- faster yet — but people will read to the end.
LD
- Lead – Development
- fastest — the whole story is up front.

Story follows data, not vice versa!

Communicating data

Choice of a channel

Coice of a channel has big impact!
Text
- Good at describing a background/context
- Slow & tedious to retrieve an information
Tables
- Structured information, precise, side information
- Hard to spot trends
Charts
- High signal-to-noise ratio, a quick glance at trends
- May be missleading (un/deliberately), imprecise

Graphs in report - 4 basic rules

Pick relevant graphs only
- Include all graphs that supports
  your story.
- Neither less, nor more.
- Report is not an evidence
  of your hard work!

Use familiar graph types
- What is your audience knowledge?
- Prefere high signal-to-noise ratio
- But do not overengineer your charts.
- Explain unfamiliar types.

Be consistent
- Use scales to tell the story
  (color, fill, shape, linetype, ..)
- Follow general conventions
- Be ware of default color palettes
  e.g., M<Ž (cze) while F<M (eng)

Annotate, describe and link
- Add title, axis labels etc.
- Sum up the chart message.

And stay tuned, we’ll cover them in next session.

Interactive charts

… are useful. Sometimes.
- Let readers explore on their own.
- Visualise 3D data, add pop-up labels, animation…
… are dangerous. Sometimes.
- Put the burden of a storytelling on readers.
- May be time-consuming to consume.
- Misuse because it’s cool.
- Lots of charts will slowdown your UI/report.
Quite easy to have them via plotly package.
- Default visualization tool in Databricks

import plotly.express as px
import seaborn as sns
penguins = sns.load_dataset("penguins")

px.scatter(penguins, x="body_mass_g", y="bill_length_mm", color="sex")

Report as a code

Why ‘implement’ your reports

Manual work massively reduced
- Speed up
- Reproducible research
- Less error-prone
- Future-proof
IDE features available, e.g.
- Version control (GIT)
- Code completion (and text completion using AI)
- …
Tools
- Python: Jupyter, Quarto, JupyterLab, DataSpell, …
- R: RMarkdown, Quarto, Sweave (outdated), Distill, Jupyter, …

Where it began

1984, Donald Knuth
Literate programming
Programming paradigm combining
- Program code (for computer)
- Narratives (for humans)

Where it began: Sweave

By Friedrich Leisch in 2002
LaTeX integration to S programming language
Code chunks (incl. meta info)
Inline expressions
Limitations
- LaTeX focused
- No easy way to control image sizes
- Hard to extend (pgfSweave for TikZ, R2HTML, …; BUT only one at a time!)

Sweave example

# example.Snw
\ documentclass[a4paper]{article}

\ begin{document}

<<echo=false, results=hide>>=
library(lattice)
library(xtable)
data(cats, package="MASS")
@

\section*{The Cats Data}

Consider the \texttt {cats} regression example from
Venables \& Ripley (1997). The data frame contains
measurements of heart and body weight
of \Sexpr{nrow(cats)} cats (\Sexpr{sum(cats$Sex=="F")}
female , \Sexpr{sum(cats$Sex=="M")} male).

A linear regression model of heart weight by sex and
gender can be fitted in R using the command
<<>>=
lm1 = lm(Hwt~Bwt*Sex, data=cats)
lm1
@
Tests for significance of the coefficients are shown in
Table ~\ref{tab:coef}, a scatter plot including the
regression lines is shown in Figure ~\ref{fig:cats}.

\SweaveOpts{echo=false}

<<results=tex>>=
xtable(
    lm1,
    caption="Linear regression model for cats data.",
    label="tab:coef"
)
@

\begin{figure}
    \centering
<<fig=true, width=12, height=6>>=
lset(col.whitebg())
print(xyplot(Hwt~Bwt|Sex, data=cats, type=c("p", "r")))
@
    \caption{The cats data from package MASS.}
    \label{fig:cats}
\end{figure}

\end{document}

# R script
library(tools)
Sweave("example.Snw")

Jupyter recap

Report oriented open-source web IDE
No. 1 choice for Python-based research
- Other kernels available, too (e.g., R)
Various extensions & flavours
- jupyter lab – IDE, wrapper around notebooks, extensions, etc.
- jupyter nbconvert – converts .ipynb to .html etc.
- pretty-jupyter – prettier outputs, dynamic text, ToC, tabsets, etc.

Quarto intro

Next gen Jupyter/Rmarkdown
- Works with Python, R, Julia and Observable.
- Converts reports to various output formats (using Pandoc)
Standalone tool by Posit (Rstudio)
- Install it from Quarto site
- CMD-line utility + VSCode extension
Native file format: .qmd
- Text file, design based on RMarkdown
- Natively supports .Rmd and .ipynb

Quarto output example

Quarto vs Jupyter

Notebook wars by Yihui Xie
Text file (Q) vs nested JSON (J)
- GIT, code reviews
Not stored output (Q) vs Stored output (J)
- Out-of-order execution
- Fast enough rendering?
Quarto extras
- Customization, multiple output formats, prettier outputs
- Cross referencing, bibl., ToC, Tabsets, …
- Can render ipynb, too 👍
- Quarto projects

This is how Quarto looks like - a text file

Quarto in VSCode

Render document in terminal (.qmd, .ipynb or .rmd):
- quarto render path/to/notebook.qmd --to html
Extension that eases your work
- Autocompletion (yaml header, cell metadata)
- Live preview
- Shortcuts
- Outline
- Tip: Install quarto first. Then install this extension.
Useful shortcuts (defaults):
- ctrl + shift + k : knit (render) the document
- ctrl + alt + i : insert a code block

Quarto metainformation

YAML header to rull them all:

---
title: "Clickers vs. Nonclickers"
author: "Dominik Matula"
date: "2024-10-04"
abstract: "Some text ..."
format:
    html:
        toc: true
        embed-resources: true
        css: styles.css 
        ...
    pdf:
        ...
---

See docs for more details
- Tip: VSCode extension gives you autocompletion! (ctrl + space).
- Tip: You can share the config within Quarto project.

Quarto document structure

----
title: "My new report"
author: "[Dominik Matula](dmatula@profinit.eu)"
format: html
...
----

# Header

## Sub header
Some text

```{.python}
#| eval: false
#| fig-cap: "A line plot"
import seaborn as sns
...

::: {.panel-tabset}

## Sub header 1

## Sub header 2

:::

Quarto presentation

With just a small adjustment, you can change report into presentation
- Adjsut format: revealjs in header
- h2 becomes slides
- h1 becomes section dividers
- You can use ::: {} sections to set up behaviour, e.g.:
  - ::: {.callout-tip} gives you info box
  - ::: {.fragment} reveals things on next click
  - …
Details in documentation.

Best practics

Have the reports in the project Git repository
- E.g., top-level folder explorations etc.
- Tip: have a subfolder per topic
Commit both the source files (.qmd/.ipynb) and rendered files (.html)
- You never know whether you’ll be able to render it in future
- Tip: use GIT LFS to reduce repo size
Follow the PR procedure for reports as well.
Tip: Github pages is a great way to share your reports.
- Tip: Quarto blog.
Example project.

Literature

Joshua Schimel: Writing science
Cole N. Knaflic: Storytelling with data
Alberto Cairo: How charts lie
Clause O. Wilke: Fundamentals of data visualization
Yihui Xie: Rmarkdown
Yihui Xie: Notebook wars
Quarto guide

Exercises

#TidyTuesday

Weekly data project
- Source data available on GitHub
- Every week new data (since 2018)!
Datasets on various topics
- Mostly ready to use
Results shared & discussed on X.com
- #tidytuesday or thos viewer

Exercises 1/2 - data exploration

Your task:

Download data
- Use either Coffee [2024-05-14], FeederWatch [2023-01-10], or any other TidyTuesday data.
- You can use the hazard data as well.
Profile data using ydata-profiling.
- Explore the report.
- Q: what is the statistical unit? Shouldn’t it be transformed beforehand (melt, pivot, ..)?
Exploratory report structure
- Create a new jupyter notebook & propose the report structure using text chunks with headers.
- You can add data loading cell & some data overview.
- Note: You shall start with a research question. Pick your own or use one of the following:
  - Coffee: How does the coffee samples differ from each other? (coffee_x_bitterness, coffee_x_acidity)?
  - Coffee: What are the main drivers of coffee taste (e.g., prefer_overall column vs. age, gender, number_of_children, political_affiliation, …)
  - FeederWatch: Observed species (species_code) overview – how many of them have been spotted? Which ones are rare/most common?
  - FeederWatch: Are there species tightly coupled with specific place? (either latitude & longitude etc. or you can use other dataset and study are characteristics vs specific species occurances..)
  - Hazard: Does everybody have a favourite position (machines)? (Is player behaviour in favourite position different wrt. other positions he/she visits?)
  - Hazard: For how long can last typical player? (you’ll need to join consecutive entries with some small gaps to get a session..)
Export the report to html using jupyter nbcovert
- Try to use --no-input to turn off code cells.
- Tip: use pretty-jupyter package & --template pj to make it better looking

Exercises 2/2 - Quarto first steps