# Visualisation in examples (Python – seaborn)

Here are examples of graphs in Python by pandas and seaborn packages. Be sure to have seaborn in version 0.12+.

Complete tutorials to pandas and seaborn can be found at links:

* [Pandas](https://pandas.pydata.org)
* [Seaborn general](https://seaborn.pydata.org)
* [Seaborn Objects](https://seaborn.pydata.org/tutorial/objects_interface.html)

First we read packages, setup the environment, read and adjust data.

In [None]:
### Setup
%matplotlib inline
# should enable plotting without explicit call .show()

# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import seaborn.objects as so
import matplotlib.pyplot as plt

# classes for special types
from pandas.api.types import CategoricalDtype

# Apply the default theme, set bigger font
sns.set_theme()

# Reading and adjusting data
K = pd.read_csv("application_train.csv")

# random sample of 2000 rows
K = K.sample(n=2000, axis=0)

# sample of columns
K.columns = K.columns.str.lower() # column names to lowercase
K = K[['sk_id_curr', 'target', 'name_contract_type', 'code_gender', 'cnt_children', 'amt_credit',
 'occupation_type', 'name_education_type', 'days_birth', 'own_car_age', 'ext_source_1']]

## 1. General analysis

This was part of EDA practice.

## 2. Distributions of individual variables

We did an exploratory analysis by numerical means: frequencies, means, standard deviations etc. Now we support it by visualiation. 

Basic seaborn method for plotting graph of individual distribution is [displot](https://seaborn.pydata.org/tutorial/distributions.html). It can make plots both for categorial and numeric variables.

### 2.1 Categorial variable

Before we make a graph, let's make a frequency table (we combine absolute and relative frequencies).

In [None]:
# absolute and relative frequency table
freqtab = K.groupby("name_education_type").agg(count=("sk_id_curr", "count")) # absolute frequencies (counts)
freqtab["count_rel"] = freqtab["count"] / sum(freqtab["count"]) # relative frequencies
freqtab

It is reasonable to make the variable `name_education_type` ordered. Then we can compute cumulative frequencies.

In [None]:
# making a new categorical ordered variable
cat_type = CategoricalDtype(categories=["Lower secondary", "Secondary / secondary special",
 "Incomplete higher", "Higher education"],
 ordered=True)
K["education"] = K["name_education_type"].astype(cat_type)

# frequency table
freqtab = K.groupby("education").agg(count=("sk_id_curr", "count")) # absolute frequencies (counts)
freqtab["count_cum"] = freqtab["count"].cumsum() # cumulative frequencies
freqtab["count_rel"] = freqtab["count"] / sum(freqtab["count"]) # relative frequencies
freqtab["count_relcum"] = freqtab["count_rel"].cumsum() # cumulative relative frequencies
freqtab

The visualisation of frequencies is simple – we use barplot, either standard (bars beside) or stacked (useful for cumulative frequencies). The universal method for univariate distribution is *displot*. Variable name is assigned either to *x* or to *y* parameter, bars are then either vertical or horizontal.

In [None]:
# graphs for absolute and relative frequencies
# done directly from DataFrame, no need to compute frequency table
# horizontal bars because of long lables, but vertical bars are more common
g = sns.displot(data=K, y="education") # absolute freqs
g = sns.displot(data=K, y="education", stat='probability') # relative freqs - difference only at Y scale

In [None]:
# countplot can be used for absolute freqs
g = sns.countplot(data=K, y='education') # absolute freqs

In [None]:
# for stacked barplot, we use frequency table computed above
freqtab["hlp"] = [""] * len(freqtab) # dummy variable, just for filling the seaborn parameter
# "education" is an alternative name for the index here
g = sns.displot(data=freqtab, x="hlp", hue="education", multiple="stack", weights="count_rel")

# for stacked absolute frequencies, use "count" instead of "count_rel"

If we want to annotate the graph, we may use *set* methods. For more fine-tuning (colors etc.) see seaborn tutorial.

In [None]:
g = sns.displot(data=freqtab, x="hlp", hue="education", multiple="stack", weights="count_rel") \
 .set_axis_labels("Education", "Relative frequency") \
 .set(title="Distribution of education")

In [None]:
# similar graphs as above, but by Seaborn Objects
so.Plot(K, y="education").add(so.Bar(), so.Hist())
so.Plot(K, y="education").add(so.Bar(), so.Hist(stat='probability'))
so.Plot(K, y="education", color='education').add(so.Bar(), so.Hist())
so.Plot(freqtab, x="hlp", y="count_rel", color="education").add(so.Bar(), so.Stack())
so.Plot(freqtab, x="hlp", y="count_rel", color="education") \
 .add(so.Bar(), so.Stack()) \
 .label(x="Education", y="Relative frequency", title="Distribution of education")

### 2.2 Numerical variable

Again, before we start to plot graphs, let's calculate some statistical characteristics of variable `days_birth` (days of lifetime, for some reason with minus sign).

In [None]:
# computing statistical characteristics of distribution
print("Min and max: ", "%.1f" % K["days_birth"].min(), "|", "%.1f" % K["days_birth"].max())
print("Mean: ", "%.1f" % K["days_birth"].mean())
print("Median: ", "%.1f" % K["days_birth"].median())
print("Std. dev.: ", "%.1f" % K["days_birth"].std())
# decils
hlp_10s = [i/10.0 for i in range(0, 11)]
hlp_qs = K["days_birth"].quantile(hlp_10s)
print("Decils:\n", hlp_qs)
# IQR
hlp_qs = K["days_birth"].quantile([0.25, 0.75])
print("IQR: ", hlp_qs.iloc[1]-hlp_qs.iloc[0])

We make bunch of graphs with different level of detail and smoothing. Many of them use *displot* method and the parameter *kind* changes type of graph (ecdf, density etc.) from the default type, which is histogram. Some graphs use *catplot* method because stripplot and swarmplot are under that method, not under displot.

If the variable is numeric but with few unique values, we can treat it as categorial – note using *discrete* parameter to adjust bar positions in histogram.

In [None]:
### Numerical discrete variable
# treated as categorial
g = sns.displot(data=K, x="cnt_children") # not so pretty
g = sns.displot(data=K, x="cnt_children", discrete=True) # better adjusted bars

Continuous numeric variable can be plotted many ways depending on required completeness of information.

In [None]:
# histogram
g = sns.displot(data=K, x="days_birth", bins=5)
# for less smoothing, use bigger number of bins:
g = sns.displot(data=K, x="days_birth", bins=20)

In [None]:
# density with rug
g = sns.displot(data=K, x="days_birth", kind="kde", rug=True, fill=True, bw_adjust=1.5)
# for less smoothing, use bigger number of bins:
g = sns.displot(data=K, x="days_birth", kind="kde", rug=True, fill=True, bw_adjust=0.5)

In [None]:
# ecdf with rug
g = sns.displot(data=K, x="days_birth", kind="ecdf", rug=True)
# one may plot an additional line to the graph
g = sns.displot(data=K, x="days_birth", kind="ecdf", rug=True) \
 .refline(y=0.25)

In [None]:
# rug itself can be displayed via catplot and stripplot or swarmplot
# but is unsuitable for big samples
g = sns.stripplot(data=K, x="days_birth", jitter=False, size=1)
# for less or no overlapping, use
g = sns.catplot(data=K, x="days_birth")
g = sns.catplot(data=K, x="days_birth", kind="swarm")
# solution - make a smaller sample

In [None]:
# boxplot
g = sns.catplot(data=K, y="days_birth", kind="box")

## 3. Relationships of variables

Method for analysis and plotting are different depending on type (categorial or numerical) of both variables. 

* If one of variables is categorial, the basic strategy is to split the data into categories by this variable and to study distribution of the other variable for each category (and to compare distributions among various categories).
* If both variables are numerical, then we use bivariate plots and compute statistics like correlation.
* For a numerical variable, one may consider binning it and getting this way a categorial variable.

### 3.1 Categorial vs. categorial

In this case we usually compute a contingency table (2-D frequency table).

In [None]:
# contingency table with absolute frequencies
pd.crosstab(K["education"], K["code_gender"])

In [None]:
# for relative frequencies in contingency table, use parameter normalize:
pd.crosstab(K["education"], K["code_gender"], normalize="columns") # relative by columns

Visualisation of contingency table, similarly to frequency table, can be done by some kind of barplot. Bars can be:

* put beside one by one
* stacked within each category as absolute counts
* stacked within each category as relative counts (all stacked bars sum up to 1)

In [None]:
# barplot with bars beside
g = sns.displot(data=K, x="code_gender", hue="education", multiple="dodge")\
 .refline(x=0.5) # auxiliary line to split categories

In [None]:
# barplot with stacked bars as absolute counts
g = sns.displot(data=K, x="code_gender", hue="education", multiple="stack")

In [None]:
# barplot stacked as relative counts (sums up to 1)
# needs data preparation
hlp_df = pd.crosstab(K["education"], K["code_gender"], normalize="columns")
print(hlp_df)
# for plotting stacked barplot, we need to transform this "wide" format to "long" format
hlp_df.reset_index(inplace=True)
hlp_df = pd.melt(hlp_df, id_vars="education", var_name="code_gender", value_name="prop")
print(hlp_df)

g = sns.displot(data=hlp_df, x="code_gender", hue="education", multiple="stack", weights="prop")

Another idea is to make *heatmap* – replace each cell in a contingency table by color tone according to the value in the cell. This is good for plotting absolute frequencies but may be confusing for relative ones.

In [None]:
# discrete heatmap
g = sns.displot(data=K, x="code_gender", y="education", cbar=True)

### 3.2 Categorial vs. numerical

We can split the data by categorial variable and compute statistics by categories, e. g. mean, median or SD, and compare them. Computing is easy by pandas *groupby* and *agg* methods.

In [None]:
# statistics by categories
# it is possible to use user defined functions
def quartileH(x): # upper quartile
 return x.quantile(q=0.75)

K.groupby("code_gender").agg({"ext_source_1": ["mean", "median", "std", "max", "min", quartileH]})

There are many ways how to do splitting by categories when plotting:

* multiple lines (curves), possibly overlapping
* use one axis for categories (sections inside one graph), distribution graph in each section separately
* split figure to separate graphs

We can either use *displot* with parameters *hue* or *col* or *catplot* with category variable as *x* (or *y*, if we want split the graph horizontally). For plotting aggregated statistics we can use *barplot*, which is a special functionality of *catplot* method.

In [None]:
# numeric vs. category as overlapping lines/curves
g = sns.displot(data=K, x="ext_source_1", hue="code_gender") # overlapping histograms
g = sns.displot(data=K, x="ext_source_1", hue="code_gender", kind="kde") # overlapping KDE

In [None]:
# numeric vs. category as separate graphs
g = sns.displot(data=K, x="ext_source_1", col="code_gender",
 stat="probability", common_norm=False) # separate histograms

In [None]:
# numeric vs. category as sections of one graph
g = sns.catplot(data=K, x="code_gender", y="ext_source_1") # stripplot
g = sns.catplot(data=K, x="code_gender", y="ext_source_1", kind="violin") # violinplot
g = sns.catplot(data=K.assign(temp=""), x="temp", y="ext_source_1", hue="code_gender", kind="violin", split=True)

In [None]:
# barplots with estimator by categories, boxplots
g = sns.catplot(data=K, x="code_gender", y="ext_source_1", kind="bar")
g = sns.catplot(data=K, x="code_gender", y="ext_source_1", kind="box")

### 3.3 Numerical vs. numerical

The basic statistics for this case is a correlation coefficient. For plotting we use *relplot* or *displot* method with two basic cases:

* for each x value there can be more observations – *scatterplot* (a cloud of points), heatmap, contourplot
* for each x value there is only one observation or we want to aggregate over y axis – *lineplot* (time series)

In [None]:
# correlation between two columns
print("Correlation: ", K["days_birth"].corr(K["amt_credit"]))

# correlation matrix between every pair of numerical variables
corrmat = K.corr(numeric_only=True)
corrmat

In [None]:
# the correlation matrix can be plotted by heatmap
g = sns.heatmap(corrmat.round(2), annot=True)

In [None]:
g = sns.relplot(data=K, x="days_birth", y="ext_source_1") # scatterplot
g = sns.displot(data=K, x="days_birth", y="ext_source_1", cbar=True) # heatmap
g = sns.displot(data=K, x="days_birth", y="ext_source_1", kind="kde") # contourplot

In [None]:
# lineplot
hlptab = K.groupby('own_car_age').agg({'days_birth': "mean"})
g = sns.relplot(data=hlptab, x="own_car_age", y="days_birth", kind='line')

Scatterplot or contourplot can be combined with graphs of individual distributions (histogram, density). It does method *jointplot*.

In [None]:
# jointplot - both scatterplot and individual distributions
g = sns.jointplot(data=K, x="days_birth", y="ext_source_1")