DS Visualisation

Picture is worth a thousand words

Outline

  • Intro
  • Charts ZOO
  • Plotting tools
  • Grammar of graphics
  • Plotnine
  • Plotting geo data

Motivation

Datasaurus dozen

Motivation

Datasaurus dozen, incl. summary stats

Project usecase

  • Not so uncommon scenario:
    • You have a binary target
    • You have (engineered) tons of features
    • You need to find out how much information do they contain
  • Examples:
    • Dynamic pricing project (CZ) – see example report
    • MCF Hackathon (UK)
    • Next best offer (CZ)

Charts ZOO

How to decide what chart to use

  • Sometimes it’s obvious. Sometimes it’s a hard problem…
  • Think about what story would you like to tell.
    • What is the key message you need to continue the story with?
    • Is there a standard plot type that matches your needs?
  • Note:
    • It’s like belles-lettres. The more you read, the better you know what’s realy good.
    • The best way is to practice. (Data journalism, feedback from others, …)
    • Of course, you need to know your tools.

I prefer to start with a pencil and a sheet of paper…

Plotting tools intro

Optional part of the lecture

Most common plotting tools

  • matplotlib (2003)
  • seaborn (2013)
  • plotly (2013)
  • plotnine (2017)
  • seaborn.objects (2022)

Matplotlib

  • Library for making 2D plots in Python
  • Built on top of NumPy
  • By John D. Hunter (1968-2012), neurobiologist
    • Build as a patch to IPython for enabling MATLAB-style plotting
import matplotlib as mpl
import matplotlib.pyplot as plt

Matplotlib: Figure x Axes

  • Axes (canvas) vs Figure (whole picture)
    • Figure contains one or more axes
  • You can get them via plt.figure() (returns Figure), plt.subplots() (returns Figure and Axes tuple)
  • But we won’t be studying plotting in matplotlib
    • Although, some details will be covered later.
    • See the docs for more details..

Matplotlib figure - example

import matplotlib.pyplot as plt

fig, ax = plt.subplots()

fig, axes = plt.subplots(nrows=2, ncols=2)

fig = plt.figure()
fig.add_axes([0,0,0.5,0.5])    # xmin, ymin, dx, dy
fig.add_axes([0.5,0.5,0.5,0.5]) 

axes = fig.axes
axes[0].set_title("1st")
axes[1].set_title("2nd")

fig.show()

Seaborn 101

2020-09-07T14:13:58.676334 image/svg+xml Matplotlib v3.3.1, https://matplotlib.org/

  • Seaborn is Python library for making statistical graphics
    • Built on top of matplotlib
    • Closely related to pandas DataFrames
    • De facto standard tool for Python DS visualizations.
  • Three figure-level functions:
    • relplot - x~y relation plots
      (scatterplot, linechart, ..)
    • displot - distribution plots
      (histogram, KDE, ..)
    • catplot - categorical plots
      (barplot, boxplot, ..)
  • A lot of params – docs is your friend!
    • Tip: kind = change plot type, col = create subplots (facets)

Seaborn 101 - examples

import seaborn as sns

tips = sns.load_dataset("tips")
tips.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
sns.displot(data = tips, x="total_bill")

sns.displot(data = tips, x="total_bill", kind="kde")

sns.relplot(data = tips, x="total_bill", y="tip", color="black")

sns.relplot(data = tips, x="total_bill", y="tip", hue="sex")

sns.catplot(data = tips, x="sex", y="tip", kind="box")

sns.displot(data = tips, x="tip", hue="time", col="sex", kind="kde")

Seaborn 102

  • There are other functions to produce plots in Seaborn
    • sns.scatterplot, sns.relplot, …
    • You plot them directly on a canvas (ax)
      • Last active is used in case you do not provide any (ax=ax)
      • You get a new one, if you don’t have any
  • Plus you can customize default outputs using matplotlib (add lines, texts, etc.)
  • Usual approach
    • Create canvas (figure, ax/es)
    • Plot charts on ax/es
    • Customize (add labels etc.)
    • And .show the figure (if needed)

Seaborn 102 - example

import seaborn as sns
import matplotlib.pyplot as plt

fig, axes = plt.subplots(nrows=1, ncols=2)
sns.scatterplot(data=tips, x="total_bill", y="tip", hue="sex", ax=axes[0])
sns.scatterplot(data=tips, x="total_bill", y="tip", hue="smoker", ax=axes[1])

fig.suptitle("Tipping")
for ax in axes:
    ax.set_xlabel("Total bill [$]")
    ax.set_ylabel("Tip [$]")

fig.show()

Seaborn 103

  • We can use matplotlib to customize plot
    • ax’s methods to draw some extra stuff
    • Plus there are other ways (e.g, plt itself)
  • Some of the most handy artefacts to add
    • ax.axvline, ax.axhline – vertical/horizontal line
    • ax.axline – arbitrary line (xy1, xy2 or slope)
    • ax.annotate – add arbitrary text (incl. arrow if needed)
    • ax.arrow – add only the arrow
  • You can customize labels etc. as well
    • ax.set_title
    • ax.set_xlabel, ax.set_ylabel

Seaborn 103 - example

fig, ax = plt.subplots(1, 1)
sns.scatterplot(data=tips, x="total_bill", y="tip", ax=ax)
ax.set_title("Nadpis")
ax.axhline(y=2, color="red")
ax.text(x=35, y=1.6, s="Kritická hranice", color="red")
ax.annotate(
    text="Pepa",
    xy=[7.3, 5.5],
    xytext=[7.3, 7],
    arrowprops=dict(facecolor="black"),
    horizontalalignment="center",
)

fig.show()                 

Imperative x Declarative
plotting

Leonardo’s Task

Francesco del Giocondo, 1465-1538

Leonardo da Vinci, 1452-1519

Mona Lisa (Midjourney)

Imperative plotting

  • Providing detailed instructions step by step
    • Draw a yellow circle in the middle.
    • Draw two small black circles side by side..
    • (…)
    • Paint a small white dot on position x=37, y=59.
  • Pros:
    • Full control.
    • Get whatever you want
  • Cons:
    • Tedious.
    • Leonardos are rare.
  • matplotlib (to some extent)

Mona Lisa (DevianArt)

Declarative plotting

  • Asking a friend/artist to draw a picture according to your needs
    • Can you draw me a young lady with dark hair sitting alone and having a misterious smile on her face?
  • Pros
    • Nicely looking results out-of-the box.
  • Cons
    • Lower level of control.
    • Need to know language to express your needs.
  • Functional API: seaborn
  • Object API: plotnine, seaborn.objects

Elizabeth Campel, 1812

Bonus: Declarative + AI

Midjourney - UI example

Sorcery plotting

  • Like Declarative, but instead of explaining your needs you just wave your magic wand and scream a magic formula..
    • Mona Lisa!
  • Pros
    • Negligible effort.
    • Masterpiece?
  • Cons:
    • Need to know (a lot of) magic words.
    • Might not correspond to your needs.
    • It usually needs to be customized afterwards.
      But sometimes it’s not possible…
  • Databricks, Excel (and seaborn..)

Ex: Imperative - matplotlib

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
tips = sns.load_dataset("tips")

# Create a figure and axes:
fig, ax = plt.subplots()

# Plot the data:
ax.scatter(x=tips['total_bill'], y=tips['tip'])

# Fit and plot a linear regression line:
m, b = np.polyfit(x=tips['total_bill'], y=tips['tip'], deg=1)
ax.plot(tips['total_bill'], m*tips['total_bill'] + b, color='red')

# Set the labels, title, and legend:
ax.set_xlabel('Total bill')
ax.set_ylabel('Tip')
ax.set_title('Imperative Example - Matplotlib (fig, ax)')
ax.legend(['Data', 'Linear fit'])

Ex: Declarative - seaborn

import matplotlib.pyplot as plt
import seaborn as sns
tips = sns.load_dataset("tips")

# Ddeclare varibles -> elements relationship
sns.regplot(data=tips, x='total_bill', y='tip', line_kws={'label': 'Linear Fit', 'color': 'red'})

# Add non-data ink
plt.title('Declarative Example - Seaborn')
plt.legend(labels=['Data', 'Linear Fit']);

Ex: Declarative - seaborn.objects

import seaborn.objects as so
import seaborn as sns
tips = sns.load_dataset("tips")

(
    tips
    .pipe(so.Plot, x="total_bill", y="tip", color="smoker")
    .add(so.Dot())
    .add(so.Line(color="red"), so.PolyFit())
    .label(
        title="Declarative using seaborn.objects"
    )
)

Ex: Sorcery - Databricks

Grammar of graphics

How to communicate our plotting desires?

What is it?

Grammar is the system and structure of a language.

. . .

Grammar of Graphics is a framework that allows us to explain the components of any visual in a straightforward manner.

By Leland Wilkinson, 1999

Layered Grammar of Graphics

  • By Hadley Wickham, 2010
  • Journal of Computational and Graphical Statistics
  • Extensions and refinements developed while working on ggplot2

A Layered Grammar of graphics, available online

7 Components of GG

  1. Theme = Adds all non-data ink
  2. Coordinates = How do we position the visual?
  3. Statistics = How we preprocess the data?
  4. Facets = Do we split the visual into subplots?
  5. Geometric objects = What marks are we using?
  6. Aesthetics = How do we show it?
  7. Data = What do we want to show?

Important

This is not how seaborn works. Its author implements this in seaborn.objects module. (seaborn >= 0.12)
Nevertheless, it worths to know how to think about plots. We’re covering just basics relevant for seaborn.

GG in Python

  • seaborn.objects
    • Object API on top of seaborn
    • Since late 0.11.x version (preview)
    • Big step forward (wrt. seaborn), but still limited & slow development
  • plotnine
    • Implementation of Grammar of graphics in Python by Hassan Kibirige
    • Based on ggplot2 (R) (Layered Grammar of Graphics, Hadley Wickham)
    • Built with: matplotlib, pandas, mizani (scales), statsmodels, scipy
  • Follows unusuall ggplot’s syntax: gluing layers by + operand
import plotnine as gg  # p9, pn, ...

gg.ggplot().show()

1) Data

  • This makes the difference - Painting vs Plotting
  • Data format
    • wide vs. long vs. tidy data
    • Use pandas .pivot, .melt, .long_to_wide to switch formats

Ex: data - Hazard

  • Adjusted Hazard - log-ins
    • sessions clipped to 2.5h (cut end)
    • overlaps resolved (cut first)
    • deduplicated
  • Added few characteristics
    • Hour / Day of week / Day of month / Week of year / Month
    • Place characteristics
    • Riskness of the player based on our model
df_v_konto_key df_v_misto_key df_v_herni_pozice_key prihlasenicas odhlasenicas doba_hrani doba_pauzy typmisto sidlokodruian jtsk_y ... obec psc ulice cp skore hodina den_v_tydnu den_v_mesici mesic tyden
0 31347675069 127018397 156812556 2023-03-31 17:27:24 2023-03-31 17:31:33 249.0 NaN H 18167829.0 464086.17 ... Lysá nad Labem 28922 Smetanova 789 0.0 17 Friday 31 March 13
1 31347675069 127018397 156812550 2023-03-31 17:34:21 2023-03-31 17:27:24 -417.0 0.0 H 18167829.0 464086.17 ... Lysá nad Labem 28922 Smetanova 789 0.0 17 Friday 31 March 13
2 31347675069 127018397 156812571 2023-03-31 18:09:20 2023-03-31 17:34:21 -2099.0 0.0 H 18167829.0 464086.17 ... Lysá nad Labem 28922 Smetanova 789 0.0 18 Friday 31 March 13
3 31347675069 127018397 156812571 2023-03-31 18:09:37 2023-03-31 18:09:20 -17.0 0.0 H 18167829.0 464086.17 ... Lysá nad Labem 28922 Smetanova 789 0.0 18 Friday 31 March 13
4 31347675069 127018397 160811749 2023-03-31 18:29:13 2023-03-31 18:09:37 -1176.0 0.0 H 18167829.0 464086.17 ... Lysá nad Labem 28922 Smetanova 789 0.0 18 Friday 31 March 13

5 rows × 22 columns

Ex: plotnine

  • Goal: What is the distribution of log-ins in time? (hodina)
    • Note: you can specify data globally in ggplot() and locally in given layer (stay tuned)
gg.ggplot(data = prihl).show()

2) Aesthetics

sns.scatterplot(data=tips, x="total_bill", y="tip", hue="smoker")  # using mapping

2) Aesthetics

  • Mapping of variable values to chart properties
    • sns.relplot(data=tips, x="total_bill", y="tip")
    • It’s called Aesthetics (ggplot2) x (Mark) properties (seaborn.objects)
  • How to:
    • use mapping = gg.aes() to assign column names to aesthetics
      • you can set it globally (ggplot() param) or localy (on geometry-level, stay tuned)
      • e.g.: ggplot(data=tips, mapping=gg.aes(x = "total_bill", y="tip", color="sex"))
    • use scale_<scalename>_<scaletype>() to customize aesthetics mapping
      • e.g.: scale_x_continuous(breaks = np.arange(0, 100, 10))
      • e.g.: scale_color_manual(values = {"Male": "blue", "Female": "orangered"})

Important

Reminder: not fully integrated concept in bare seaborn!

Non-aesthetics way

pocty_her = (
    prihl
    .groupby(["den_v_tydnu"], as_index=False)
    .agg(n = ("df_v_konto_key", "count"))
)

x_vals = pocty_her["den_v_tydnu"]
y_vals = pocty_her["n"]

sns.barplot(x=y_vals, y=x_vals)

Common aesthetics

  • Position: x, y
  • Size: pointsize, linewidth
  • Color: hue (single color provided by color)
  • Transparency: alpha
  • Style: marker, fill, linestyle, edgestyle
  • Text properties: text, halign, valign, fontsize

Ex: plotnine

  • Goal: What is the distribution of log-ins in time? (hodina)
  • How do we visualize variable distribution? histogram
  • What aesthetics does the plot have? x
gg.ggplot(data = prihl, mapping = gg.aes(x="hodina"))

(
    gg.ggplot(data = prihl, mapping = gg.aes(x="hodina"))
    + gg.scale_x_continuous(breaks = np.arange(0, 24, 2))
)

3) Statistics

  • Sometimes it’s necessary to aggregate data before mapping
    • Barplot (count)
    • Histogram, KDE (cut & count)
    • Regression line (model fit)
    • Boxplot (quantiles, IQR)
  • Either part of the plot or aggregated beforehand

Ex: plotnine

  • Goal: What is the distribution of log-ins in time? (hodina)
  • How do we visualize variable distribution? histogram
  • What aesthetics does the plot have? x
  • What statistic do we need? count
(
    gg.ggplot(data = prihl, mapping = gg.aes(x="hodina"))
    + gg.stat_count()
    + gg.scale_x_continuous(breaks=np.arange(0, 24, 2))
)

4) Geometric objects

  • How to actually visualize the data (data ink)
    • Relies on aesthetics
    • List of aesthetics differ among geoms
  • Basic ones
    • points, lines, bars, tiles, text, …
  • Custom and compounded
    • histogram, boxplot, area, contour, …

Ex: plotnine

  • Goal: What is the distribution of log-ins in time? (hodina)
  • How do we visualize variable distribution? histogram
  • What aesthetics does the plot have? x
  • What statistic do we need? count
  • What geom will we use to display the statistic?
(
    gg.ggplot(data = prihl, mapping = gg.aes(x="hodina"))
    + gg.stat_count(geom="bar")
    + gg.scale_x_continuous(breaks=np.arange(0, 24, 2))
)

(
    gg.ggplot(data = prihl, mapping = gg.aes(x="hodina"))
    + gg.stat_count(geom="line")
    + gg.scale_x_continuous(breaks=np.arange(0, 24, 2))
)

Ex: plotnine

  • Goal: What is the distribution of log-ins in time? (hodina)
  • How do we visualize variable distribution? histogram
  • What aesthetics does the plot have? x, group
  • What statistic do we need? count
  • What geom will we use to display the statistic? line
(
    gg.ggplot(data = prihl, mapping = gg.aes(x="hodina", group="tyden"))
    + gg.stat_count(geom="line")
    + gg.scale_x_continuous(breaks=np.arange(0, 24, 2))
)

(
    gg.ggplot(data = prihl, mapping = gg.aes(x="hodina", group="tyden"))
    + gg.geom_line(stat="count", alpha=.3)
    + gg.geom_vline(xintercept=[1, 8], color="firebrick", linetype="dashed")
    + gg.scale_x_continuous(breaks=np.arange(0, 24, 2))
)

5) Facets

  • A way to split visualization into subplots,
    • while keeping other components untouched
  • It’s like another aesthetic (maps discrete variables)
  • Figure-level functions
    • add param col="<variable>"
    • (relplot, displot, catplot)
  • Axes-level functions
    • Prepare canvas with two or more axes
      • E.g., using matplotlib.pyplot’s subplots function
    • Draw the plots on each of them
    • Note: you can use sns.FacetGrid().map() etc. as well

Ex: plotnine

  • Goal: What is the distribution of log-ins in time? (hodina)
  • How do we visualize variable distribution? histogram
  • What aesthetics does the plot have? x, group
  • What statistic do we need? count
  • What geom will we use to display the statistic? line, vline
  • Will we use any facets? (mesic)
(
    gg.ggplot(data = prihl, mapping = gg.aes(x="hodina", group="tyden"))
    + gg.geom_line(stat="count", alpha=.3)
    + gg.geom_vline(xintercept=[1, 8], color="firebrick", linetype="dashed", alpha=.1)
    + gg.scale_x_continuous(breaks=np.arange(0, 24, 6))
    + gg.facet_wrap("mesic")
)

6) Coordinates

  • A way the graph is projected to screen
  • Common coordinate systems
    • Cartesian
    • Polar
  • Adjust point-of-view
    • Zooming-in
    • Flipping
    • Fixing aspect ratio

Example:

Ex: plotnine - coord_flip

(
    prihl
    .assign(mesic = lambda d: pd.Categorical(d["mesic"], categories = list_mesicu, ordered=True))
    .pipe(gg.ggplot, mapping = gg.aes(x="mesic"))
    + gg.geom_histogram(stat="count")
)

(
    prihl
    .assign(mesic = lambda d: pd.Categorical(d["mesic"], categories = list_mesicu[::-1], ordered=True))
    .pipe(gg.ggplot, mapping = gg.aes(x="mesic"))
    + gg.geom_histogram(stat="count")
    + gg.coord_flip()
)

7) Theme

  • Represents all non-data ink
    • Labels, titles, …
    • Comments
    • Grid lines
    • Higlights
    • Background
  • Same Data->Aes->Stats->Geoms…, different feeling

Tip

Postprocessing might involve 3rd party tools as well.

Ex: plotnine - resolving night time

vsechny_casy = pd.DataFrame(
    [(tyden, hodina) for tyden in range(0, 52) for hodina in range(0, 24)], 
    columns=["tyden", "hodina"]
)

pocty_prihlaseni = (
    prihl
    .groupby(["hodina", "tyden"], as_index=False)
    .agg(n = ("df_v_konto_key", "count"))
    .merge(vsechny_casy, on=["hodina", "tyden"], how="right")
    .fillna({"n": 0})
)
(
    gg.ggplot(pocty_prihlaseni, mapping = gg.aes(x="hodina", y="n", group="tyden"))
    + gg.geom_line(stat="identity", alpha=.3)
    + gg.geom_vline(xintercept=[1, 8], color="firebrick", linetype="dashed")
    + gg.scale_x_continuous(breaks=np.arange(0, 24, 2))
)

Ex: plotnine - labels & comments

(
    gg.ggplot(pocty_prihlaseni, mapping = gg.aes(x="hodina", y="n", group="tyden"))
    + gg.geom_line(stat="identity", alpha=.3)
    + gg.geom_rect(data = lambda d: d.head(3), xmin=1, xmax=8, ymin=0, ymax=600, alpha=.1)
    + gg.geom_vline(xintercept=[1, 8], color="firebrick", linetype="dashed")
    + gg.annotate(label="Povinna vecerka", x=1.5, y=400, geom="text", angle=90, size=7, color="firebrick")
    + gg.scale_x_continuous(breaks=np.arange(0, 24, 2))
    + gg.labs(
        x = "Hodina",
        y = "Pocet her",
        title = "Rozdělení týdenních počtů her",
        subtitle = "",
        caption = f"Hazard dataset, {prihl.shape[0]} zaznamu"
    )
)

Ex: plotnine - themes

  • Use theme_... to setup predefined styles
    • e.g., theme_light(), theme_dark(), theme_void(), ..
    • Tip: _ = gg.theme_set(gg.theme_light()) to set it globally
  • Use theme() to customize plot appearance
    • Syntax: <param> = gg.element_...()
    • Example: plot_title = gg.element_text(...)
(
    gg.ggplot(pocty_prihlaseni, mapping = gg.aes(x="hodina", y="n", group="tyden"))
    + gg.geom_line(stat="identity", alpha=.3)
    + gg.geom_rect(data = lambda d: d.head(3), xmin=1, xmax=8, ymin=0, ymax=600, alpha=.1)
    + gg.geom_vline(xintercept=[1, 8], color="firebrick", linetype="dashed")
    + gg.annotate(label="Povinna vecerka", x=1.5, y=400, geom="text", angle=90, size=7, color="firebrick")
    + gg.scale_x_continuous(breaks=np.arange(0, 24, 2))
    + gg.labs(
        x = "Hodina",
        y = "Pocet her",
        title = "Rozdělení týdenních počtů her",
        subtitle = "",
        caption = f"Hazard dataset, {prihl.shape[0]} zaznamu"
    )
    + gg.theme_light()
    + gg.theme(
        plot_title=gg.element_text(face="bold"),
        plot_caption=gg.element_text(color="gray", size=7),
        panel_border=gg.element_rect(color="white"),
        panel_grid_minor_x= gg.element_blank()
    )
)

Ex: plotnine - final polishing

(
    gg.ggplot(pocty_prihlaseni, mapping = gg.aes(x="hodina", y="n", group="tyden"))
    + gg.geom_smooth(mapping=gg.aes(group=1), method="gpr", color="firebrick")
    + gg.geom_rect(data = lambda d: d.head(3), xmin=1, xmax=8, ymin=0, ymax=600, alpha=.05)
    + gg.geom_line(stat="identity", alpha=.2)
    + gg.annotate(label="Povinna vecerka", x=4.5, y=500, geom="text", size=7, color="gray")
    + gg.scale_x_continuous(breaks=np.arange(0, 24, 1))
    + gg.labs(
        x = "Hodina začátku hry [UTC]",
        y = "Počet her v daném týdnu",
        title = "Rozdělení týdenních počtů her",
        subtitle = "",
        caption = f"Hazard dataset - {prihl.shape[0]:,.0f} zaznamu".replace(",", " ")
    )
    + gg.theme_light()
    + gg.theme(
        plot_title=gg.element_text(face="bold"),
        plot_caption=gg.element_text(color="gray", size=7),
        panel_border=gg.element_rect(color="white"),
        panel_grid_minor_x= gg.element_blank()
    )
)

Ex: Theme - MS Excel

Plotnine recap - (reference)

  • ggplot() - constructor creating a new ggplot object
  • aes() - controls aesthetics mapping on global / layer level
    • scale_...() - controls aes appearance (e.g., transformation, breaks, labels)
      • e.g., scale_x_continuous, scale_color_manual
  • geom_...() - geometric layer
    • e.g., geom_point, geom_line, geom_histogram, …
  • stat_...() - controls variable -> aesthetic aggregation
    • e.g., stat_identity, stat_count, stat_density, …
  • facet_...() - variable -> facets (subplots)
    • e.g., facet_grid, facet_wrap, ..
  • theme() - controls overall appearance
    • theme_light(), theme_dark(), … - predefined themes
    • theme_set() - global setup
    • use element_...() to setup given element
  • coord_...() - controls plotting coordinates

Reading GEO data

  • gpd.read_file()
    • ShapeFile: .shp
    • GeoJson: .geojson
    • GeoPackage: .gpkg
    • Geo databases engine
  • Input filters
    • bbox – bounding box (read only elements in a box)
    • mask – intersection filter (read only CZ elements)
    • rows – read only specific rows (e.g., rows=10, rows=slice(5,10))
    • columns – read only specific columns
    • ignore_geometry
  • Note: can handle various projections (.crs), too.

Reading GEO data – example

  • Where to get CZ GEO data?
    • CUZK ⬅️ we’ll be using this one.
    • OpenStreetMaps via GeoFrabrik.de [link] - POIs etc.
    • Other sources …
import geopandas as gpd

kraje = gpd.read_file(kraje_shp_file).set_index("NAZEV")
kraje.head(3)
KOD REGS_KOD NUTS3_KOD geometry
NAZEV
Hlavní město Praha 19 19 CZ010 POLYGON ((-746760.67 -1058142.71, -746757.65 -...
Středočeský kraj 27 27 CZ020 POLYGON ((-700345.18 -989088.81, -700328.64 -9...
Jihočeský kraj 35 35 CZ031 POLYGON ((-711379.92 -1136266.41, -711363.97 -...

geopandas’s cool tools

  • Work with gpd.DataFrame as you work with pd.DataFrame
  • Beside, geometry can provide (gdf = gpd.DataFrame()):
    • Area measurementsgdf.area
    • Centroid locationgdf.centroid
    • Boundary polygongdf.boundary
    • Distance measurementsgdf.distance(other_gdf.iloc[0])
    • Checking geometry intersections – `gdf.intersects(other_gdf)
    • Get convex hull geometry – gdf.convex_hull
  • Note, it works on active geometry
    • gdf.set_geometry(col_name)

geopandas’s cool tools – examples

kraje.centroid.head()
NAZEV
Hlavní město Praha    POINT (-739952.022 -1045726.481)
Středočeský kraj       POINT (-734798.177 -1053881.92)
Jihočeský kraj        POINT (-757095.136 -1152712.059)
Plzeňský kraj         POINT (-835551.971 -1084574.549)
Karlovarský kraj      POINT (-859350.657 -1015909.505)
dtype: geometry
kraje["rozloha"] = kraje.area/1e6
kraje["rozloha"]
NAZEV
Hlavní město Praha        496.176193
Středočeský kraj        10928.395151
Jihočeský kraj          10058.126861
Plzeňský kraj            7648.609450
Karlovarský kraj         3310.111472
Ústecký kraj             5338.749064
Liberecký kraj           3163.580958
Královéhradecký kraj     4759.092545
Pardubický kraj          4519.556466
Kraj Vysočina            6795.087814
Jihomoravský kraj        7185.859183
Olomoucký kraj           5271.469023
Moravskoslezský kraj     5430.528592
Zlínský kraj             3961.499796
Name: rozloha, dtype: float64

Plotting

  • Alas, Seaborn can’t work with geodata directly 😞
  • Fortunately, GeoPandas provides direct way to plot geometry
  • Use gdf.plot() (same as pandas’s df.plot())
    • Note: kind="geo" plots the map. There are other options as well.
    • See Docs for more details
kraje.plot()

Plotting

  • Map any variable to color via column

Warning

Watch out, it’s not hue anymore!

kraje.plot(column="rozloha")

Adjusting plots via matplotlib

  • You can plot on matplotlib’s canvas via ax param:
fig, axes = plt.subplots(1, 2, figsize=(15, 35))

kraje.plot(ax=axes[0])
axes[0].set_axis_off()
axes[0].set_title("Česká republika")

kraje.plot(column="rozloha", ax=axes[1])
axes[1].set_axis_off()
axes[1].set_title("Rozhloha krajů ČR")

fig.show()

Adding content on geo plots

  • Just add other layers on the ax
  • Note: use zorder to control layers order!
centroids = kraje.centroid
borders = kraje.boundary

fig, ax = plt.subplots(1, 1)

# geometry plot
kraje.plot(
    column="rozloha",
    alpha=0.3,
    cmap="OrRd",
    ax=ax,
    legend=True,
    legend_kwds={"label": "Area"},
    zorder=5,
)

# extra geometries
centroids.plot(ax=ax, color="red", markersize=2, zorder=2)
borders.plot(ax=ax, color="red", linewidth=0.1, zorder=3)
ax.set_axis_off()
ax.set_title("Kraje a jejich centra")

# extra matplotlib elements
prague_x = centroids["Hlavní město Praha"].x
prague_y = centroids["Hlavní město Praha"].y
ax.axvline(prague_x, linestyle="dotted", color="black")
ax.axhline(prague_y, linestyle="dotted", color="black")

fig.show()

Dealing with missings

  • Use missing_kwd dict param to set color, edgecolor, label etc.¨
fig, axes = plt.subplots(1, 2, figsize=(15, 35))

kraje_wo_vysocina = kraje.assign(rozloha = lambda d: np.where(d.index == 'Kraj Vysočina', np.nan, d["rozloha"]))
kraje_wo_vysocina.plot(ax=axes[0])
kraje_wo_vysocina.plot(column="rozloha", missing_kwds=dict(color="red"), ax=axes[1])

for ax in axes:
    ax.set_axis_off()

fig.show()

Exploring

  • Even more, you can have interactive maps out-of-the-box!
  • How to:
    • Install Python dependencies -folium and mapclassify in my case
    • Use .explore() instead of .plot()
kraje.explore(column="rozloha")
Make this Notebook Trusted to load map: File -> Trust Notebook

Plotting with plotnine

  • It’s simple: gpd + geom_map()
(
    gg.ggplot(data=kraje) 
    + gg.geom_map(mapping=gg.aes(fill="rozloha")) 
    + gg.coord_equal()
    + gg.theme_light()
    + gg.theme_void()
    + gg.labs(title = "Kraje CR dle rozlohy")
)

Exercise

Exercise

Your turn

  • Load hazard dataset provided by @jhucin
  • Prepare charts to explore & answer the following questions:
    • Does the time distribution differ among week/weekend days?
    • Who are 10 the most active users?
    • Can we distinguish between regular and occasional players?
    • Does the time distribution differ among occasional & regular players?
    • How does individual [player/place] play time vary?
    • How does typical play time relate to frequency of playings?
    • How many places does typical player visit?
    • What is the distribution of break times?
      • (Note: what about filtering out last break time in a session?)
  • [Optional] Extra tasks
    • Where in CR are the players the most active?
    • Does the play length differ among regions?
    • Are the players taking mandatory break times?
    • Are risky players more concentrated somewhere? (place/city/region)
    • Do the players have any favourite place?
    • Do the risky players behave differently? (play time, break time, frequency of playes, …)
    • Riskness of a place based on riskness of regular players on that place – are there any suspicious places? (avg / quantile of high risky players / …)