Storytelling

or Effective DS Reporting

Outline

  • Intro
  • Theory
    • Report types
    • Report structure
    • Communicating data
    • Report as a code
  • Examples & Discussion

Motivation - Doing science

 

Motivation - Writing Science

As a scientist,
you’re proffesional writer.

Joshua Schimel

  • Reports are an efficient way to communicate your DS work.
    • They puts together your thoughts, results and methods to achieve them.

Motivation - Pseudosocial networks

TAČR Research project, Profinit + MFF UK


  • Example reports

DS Report types

Based on PURPOSE, AUDIENCE and RESOURCES

1) Based on PURPOSE

  1. Data description
    • DU + DP
  2. Exploratory analysis
    • DP
  3. Explanatory analysis
    • M + DP
  4. Monitoring report
    • E + M + D

Relates to CRISP-DM phases Cross-industry standard process for data mining

1a) Data description report

  • General dataset overview
    • Statistical unit
    • No. observations and variables
  • Individual variables overview
    • Distribution, moments, extremes, missings, etc.
  • Dataset metainformation
    • Do we have data docs?
    • Do we know the gathering process?
    • Fixed vs Periodicaly updated
  • Goal:
    • Find out what we’re facing …
  • Dataset problems
    • Missings (and way of missing)
    • Outliers (and source of outlierness)
  • Tip: YData profiling (docs) ydata_profiling data.csv report.html

1b) Exploratory analysis report

  • Most creative part
    • You need to find out what’s in data.
    • There are no strict guidelines.
  • Variables relationships
    • Charts / Tables / Summary stats.
    • Explaining outcome variability (if any).
  • Goal:
    • Familiarize yourself with the data.
    • Generate hypotheses (stories).

The First Landing of Christopher Columbus in America Painting by Dioscoro Teofilo Puebla Tolin

1c) Explanatory analysis report

  • Model building phase
    • Selecting a subpopulation
    • Variables transformation
    • Feature engineering
    • Modelling
  • Model explanation
    • Model performance evaluation
    • Main drivers (e.g., regression coeficients, SHAP values, …)
  • Goal:
    • Reveal what’s going on
    • Collect evidence

Christopher Columbus being given the sailing commission by King Ferdinand and Queen Isabella for his Enterprise of the Indies in Sante Fe, Spain, on April 30, 1492

1d) Monitoring report

  • Goal: Assure everything is fine and the model is well behaving.
  • To be run repeatedly
    • New version of model
    • New version of data
    • Tip: Automate it!
  • Model characteristics:
    • Performance metrics
    • Callibration plots
    • Feature importances, SHAP values, ..
  • Futher analyses
    • Ddepends on a business demand
    • E.g., what value does it brings, A/B testing comparisions

2) Based on AUDIENCE

Knowledge report

  • Audience
    • Co-worker / future me.
    • To be studied on her won.
  • Tips
    • High detail
    • Include methodology
    • Open questions / not followed paths

Business report

  • Audience
    • Management, general public, customer
    • To be presented /walked through
  • Tips
    • Strong story line
    • Fancy & straightforward graphics
    • Low detail
    • Call to action

3) Based on RESOURCES

  • Amount of prior knowledge
    • Future me / Coworker / Business guy / General public -Data journalism (iRozhlas, …)
  • Amount of interaction
    • Paper on her own / Paper walked through / Presentation /
    • One man show / Open discussion / Interactive exploration
    • Online / Offline

XKCD 365

Report structure

Academic paper

  • Abstract
    • A brief synopsis of the paper – what did I do?
  • Introduction
    • Context+purpose of the study – what‘s the problem?
  • Methods
    • How did I solve the problem?
  • Results
    • What did I find out?
  • Discussion
    • What does it mean? (interpretation in a context)
  • References, Acknowledgements, …

DataScience report

  • Motivation / Introduction
    • What do we do and why?
  • Data
    • What data do we use, source of data, docs reference
    • Incl. some basic stats & sanity checks
  • Methodology
    • This describes your work (data analyses, problesms, findings, model building, …)
  • Summary / Results
    • Lessons learned in a brief structured form.
  • Next steps
    • Call to action, open questions

Example report.

Developing a storyline - OCAR

  • Storyline is a key concept
    • Makes your audience absorb your thoughts
  • 4 elements behind all stories (incl. DS)
    • O - Opening
    • C - Challenge
    • A - Action
    • R - Resolution
  • Narative arc

  • A story is a set of nested arcs.

OCAR in reporting

  • Introduction
    • Opening - first paragraph, describes the context/larger problem
    • Background - further details, incl.
      what should reader know etc.
    • Challenge - what are the specific hypotheses/questions/goals of the report
  • Data
    • Background - further details, extends Opening (Introduction)
  • Methodology
    • Action - what did we do?
  • Results / Summary
    • Resolution - what did it all mean, what we’ve learned?
  • Next steps
    • Cliff hanger, teaser for your future work

Developing the storyline faster

  • OCAR
    • Opening – Challenge – Action – Resolution
    • slowest, take your time working into the story.
  • ABDCE
    • Action – Background – Development – Climax – Ending
    • faster, get right into the action.
  • LDR
    • Lead – Development – Resolution
    • faster yet — but people will read to the end.
  • LD
    • Lead – Development
    • fastest — the whole story is up front.

Story follows data, not vice versa!

Communicating data

Choice of a channel

  • Coice of a channel has big impact!
  • Text
    • Good at describing a background/context
    • Slow & tedious to retrieve an information
  • Tables
    • Structured information, precise, side information
    • Hard to spot trends
  • Charts
    • High signal-to-noise ratio, a quick glance at trends
    • May be missleading (un/deliberately), imprecise

Graphs in report - 4 basic rules

  1. Pick relevant graphs only
    • Include all graphs that supports
      your story.
    • Neither less, nor more.
    • Report is not an evidence
      of your hard work!
  1. Use familiar graph types
    • What is your audience knowledge?
    • Prefere high signal-to-noise ratio
    • But do not overengineer your charts.
    • Explain unfamiliar types.
  1. Be consistent
    • Use scales to tell the story
      (color, fill, shape, linetype, ..)
    • Follow general conventions
    • Be ware of default color palettes
      e.g., M<Ž (cze) while F<M (eng)
  1. Annotate, describe and link
    • Add title, axis labels etc.
    • Sum up the chart message.

And stay tuned, we’ll cover them in next session.

Interactive charts

  • … are useful. Sometimes.
    • Let readers explore on their own.
    • Visualise 3D data, add pop-up labels, animation…
  • … are dangerous. Sometimes.
    • Put the burden of a storytelling on readers.
    • May be time-consuming to consume.
    • Misuse because it’s cool.
    • Lots of charts will slowdown your UI/report.
  • Quite easy to have them via plotly package.
    • Default visualization tool in Databricks
import plotly.express as px
import seaborn as sns
penguins = sns.load_dataset("penguins")

px.scatter(penguins, x="body_mass_g", y="bill_length_mm", color="sex")

Report as a code

Why ‘implement’ your reports

  • Manual work massively reduced
    • Speed up
    • Reproducible research
    • Less error-prone
    • Future-proof
  • IDE features available, e.g. 
    • Version control (GIT)
    • Code completion (and text completion using AI)
  • Tools
    • Python: Jupyter, Quarto, JupyterLab, DataSpell, …
    • R: RMarkdown, Quarto, Sweave (outdated), Distill, Jupyter, …

Where it began

  • 1984, Donald Knuth
  • Literate programming
  • Programming paradigm combining
    • Program code (for computer)
    • Narratives (for humans)

Where it began: Sweave

  • By Friedrich Leisch in 2002
  • LaTeX integration to S programming language
  • Code chunks (incl. meta info)
  • Inline expressions
  • Limitations
    • LaTeX focused
    • No easy way to control image sizes
    • Hard to extend (pgfSweave for TikZ, R2HTML, …; BUT only one at a time!)

Sweave example

# example.Snw
\ documentclass[a4paper]{article}

\ begin{document}

<<echo=false, results=hide>>=
library(lattice)
library(xtable)
data(cats, package="MASS")
@

\section*{The Cats Data}

Consider the \texttt {cats} regression example from
Venables \& Ripley (1997). The data frame contains
measurements of heart and body weight
of \Sexpr{nrow(cats)} cats (\Sexpr{sum(cats$Sex=="F")}
female , \Sexpr{sum(cats$Sex=="M")} male).

A linear regression model of heart weight by sex and
gender can be fitted in R using the command
<<>>=
lm1 = lm(Hwt~Bwt*Sex, data=cats)
lm1
@
Tests for significance of the coefficients are shown in
Table ~\ref{tab:coef}, a scatter plot including the
regression lines is shown in Figure ~\ref{fig:cats}.

\SweaveOpts{echo=false}

<<results=tex>>=
xtable(
    lm1,
    caption="Linear regression model for cats data.",
    label="tab:coef"
)
@

\begin{figure}
    \centering
<<fig=true, width=12, height=6>>=
lset(col.whitebg())
print(xyplot(Hwt~Bwt|Sex, data=cats, type=c("p", "r")))
@
    \caption{The cats data from package MASS.}
    \label{fig:cats}
\end{figure}

\end{document}
# R script
library(tools)
Sweave("example.Snw")

Jupyter recap

  • Report oriented open-source web IDE
  • No. 1 choice for Python-based research
    • Other kernels available, too (e.g., R)
  • Various extensions & flavours
    • jupyter lab – IDE, wrapper around notebooks, extensions, etc.
    • jupyter nbconvert – converts .ipynb to .html etc.
    • pretty-jupyter – prettier outputs, dynamic text, ToC, tabsets, etc.

Sharing Jupyter notebooks

  • .ipynb is not a format to be shared with business / general public 🙂
  • Convert it using nbconvert (based on jinja package)
    • Various output formats, incl. .html and .pdf.
  • How to use:
    • CMD: jupyter nbconvert --to html path/to/notebook.ipynb
    • IDE:
  • Tips:
    • Hide code via --no-input (anytime you can share the .ipynb)
    • Consider --execute to reevaluate cells (to ensure reproducibility)
    • Use pretty-jupyter extension to get ToC, tabsets, prettier output, …

Quarto intro

  • Next gen Jupyter/Rmarkdown
    • Works with Python, R, Julia and Observable.
    • Converts reports to various output formats (using Pandoc)
  • Standalone tool by Posit (Rstudio)
    • Install it from Quarto site
    • CMD-line utility + VSCode extension
  • Native file format: .qmd
    • Text file, design based on RMarkdown
    • Natively supports .Rmd and .ipynb

Quarto output example

Quarto vs Jupyter

  • Notebook wars by Yihui Xie
  • Text file (Q) vs nested JSON (J)
    • GIT, code reviews
  • Not stored output (Q) vs Stored output (J)
    • Out-of-order execution
    • Fast enough rendering?
  • Quarto extras
    • Customization, multiple output formats, prettier outputs
    • Cross referencing, bibl., ToC, Tabsets, …
    • Can render ipynb, too 👍
    • Quarto projects

This is how Jupyter really looks like

This is how Quarto looks like - a text file

Quarto in VSCode

  • Render document in terminal (.qmd, .ipynb or .rmd):
    • quarto render path/to/notebook.qmd --to html
  • Extension that eases your work
    • Autocompletion (yaml header, cell metadata)
    • Live preview
    • Shortcuts
    • Outline
    • Tip: Install quarto first. Then install this extension.
  • Useful shortcuts (defaults):
    • ctrl + shift + k : knit (render) the document
    • ctrl + alt + i : insert a code block

Quarto metainformation

  • YAML header to rull them all:
---
title: "Clickers vs. Nonclickers"
author: "Dominik Matula"
date: "2024-10-04"
abstract: "Some text ..."
format:
    html:
        toc: true
        embed-resources: true
        css: styles.css 
        ...
    pdf:
        ...
---
  • See docs for more details
    • Tip: VSCode extension gives you autocompletion! (ctrl + space).
    • Tip: You can share the config within Quarto project.

Quarto document structure

----
title: "My new report"
author: "[Dominik Matula](dmatula@profinit.eu)"
format: html
...
----

# Header

## Sub header
Some text

```{.python}
#| eval: false
#| fig-cap: "A line plot"
import seaborn as sns
...

::: {.panel-tabset}

## Sub header 1

## Sub header 2

:::

Quarto presentation

  • With just a small adjustment, you can change report into presentation
    • Adjsut format: revealjs in header
    • h2 becomes slides
    • h1 becomes section dividers
    • You can use ::: {} sections to set up behaviour, e.g.:
      • ::: {.callout-tip} gives you info box
      • ::: {.fragment} reveals things on next click
  • Details in documentation.

Sharing your reports

Best practics

  • Have the reports in the project Git repository
    • E.g., top-level folder explorations etc.
    • Tip: have a subfolder per topic
  • Commit both the source files (.qmd/.ipynb) and rendered files (.html)
    • You never know whether you’ll be able to render it in future
    • Tip: use GIT LFS to reduce repo size
  • Follow the PR procedure for reports as well.
  • Tip: Github pages is a great way to share your reports.
    • Tip: Quarto blog.
  • Example project.

Literature

Exercises

#TidyTuesday

  • Weekly data project
  • Datasets on various topics
    • Mostly ready to use
  • Results shared & discussed on X.com
    • #tidytuesday or thos viewer

Exercises 1/2 - data exploration

Your task:

  • Download data
    • Use either Coffee [2024-05-14], FeederWatch [2023-01-10], or any other TidyTuesday data.
    • You can use the hazard data as well.
  • Profile data using ydata-profiling.
    • Explore the report.
    • Q: what is the statistical unit? Shouldn’t it be transformed beforehand (melt, pivot, ..)?
  • Exploratory report structure
    • Create a new jupyter notebook & propose the report structure using text chunks with headers.
    • You can add data loading cell & some data overview.
    • Note: You shall start with a research question. Pick your own or use one of the following:
      • Coffee: How does the coffee samples differ from each other? (coffee_x_bitterness, coffee_x_acidity)?
      • Coffee: What are the main drivers of coffee taste (e.g., prefer_overall column vs. age, gender, number_of_children, political_affiliation, …)
      • FeederWatch: Observed species (species_code) overview – how many of them have been spotted? Which ones are rare/most common?
      • FeederWatch: Are there species tightly coupled with specific place? (either latitude & longitude etc. or you can use other dataset and study are characteristics vs specific species occurances..)
      • Hazard: Does everybody have a favourite position (machines)? (Is player behaviour in favourite position different wrt. other positions he/she visits?)
      • Hazard: For how long can last typical player? (you’ll need to join consecutive entries with some small gaps to get a session..)
  • Export the report to html using jupyter nbcovert
    • Try to use --no-input to turn off code cells.
    • Tip: use pretty-jupyter package & --template pj to make it better looking

Exercises 2/2 - Quarto first steps

Your task:

  • Install Quarto + VSCode on your computer. Install quarto extension as well.
  • Create a new quarto document (.qmd)
  • Setup yaml header of the document - author, title, date
  • Add some content - text paragraphs (lorem ipsum) and code paragraphs.
  • Render the document (shortcut: ctrl + shift + k)
    • Note: you cen even open & render your .ipynb notebook!
  • Change format to revealjs in the report header (yaml). Rerender the document.
    • h2 sections become slides, h1 sections become section separators.