# Visualisation (Python – seaborn)

## 2. Tasks for you

We will use the same data as above (file *application_train.csv* from *kaggle_home_credit.zip*) but bigger volume of it.

1. Read file *application_train.csv* again and make from it a random sample of 5 000 records.

In [None]:
### Setup
%matplotlib inline
# should enable plotting without explicit call .show()

# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# classes for special types
from pandas.api.types import CategoricalDtype

# Apply the default theme, set bigger font
sns.set_theme()

# Reading and adjusting data
K = pd.read_csv("application_train.csv")
K = K.sample(n=5000, axis=0) # random sample of 5000 records
K.columns = K.columns.str.lower() # column names to lowercase

2. Transform data: *data_birth* -> *age*, *days_employed* -> *years_employed*, consider 1 year = 365.25 days. Original data are negative, so transform to positive. If there are any negative values after the transformation, replace them by np.nan.

In [None]:
K["age"] = -K["days_birth"] / 365.25 
K["yrs_employed"] = -K["days_employed"] / 365.25
K["age"] = np.where(K["age"] < 0, np.nan, K["age"]) # cleaning from nonsense values
K["yrs_employed"] = np.where(K["yrs_employed"] < 0, np.nan, K["yrs_employed"]) # cleaning from nonsense values

3. Explore distribution of *age* by ECDF, density estimation, histogram, boxplot:
 - In histogram use bins of 5 years, try to make reasonable boundaries of them (e. g. 20-25 etc., see parameter *bins*).
 - In density estimation, limit the curve to the variable range (see parameter *cut* in *kdeplot*). Try various amount of smoothing.
 - For one graph (no matter which one) do a neat annotation (proper title, axis labels), try to change theme (*set_theme* method), font size (*font_scale* in *set* method), color (find yourself).

In [None]:
g = sns.displot(data=K, x="age", kind="ecdf")
g = sns.displot(data=K, x="age", kind="kde", cut=0, bw_adjust=0.8)
g = sns.catplot(data=K, y="age", kind="box")

sns.set_theme(style="whitegrid") # changing theme
g = sns.displot(data=K, x="age", bins=range(20, 75, 5), color="green") \
 .set_axis_labels("Age [years]", "Count") \
 .set(title="Distribution of applicants' age")
sns.set_theme() # changing theme back

4. Is distribution of *age* more likely Gaussian-like, or skewed? Does 2-sigma rule hold for it?

In [None]:
# From the histogram the distribution looks like something between uniform and normal distribution,
# i. e. it is close to Gaussian.
# For 2-sigma rule, let's compute the mean and SD ("sigma"):
age_mean = K["age"].mean()
age_std = K["age"].std()
print("Mean age: ", "%.1f" % age_mean)
print("SD age: ", "%.1f" % age_std)
print("Share of record within 2 sigma:", "%.3f" % np.mean(np.abs(K["age"] - age_mean) < 2*age_std))
# Within 2-sigma, there is more than expected part of data

5. Explore distribution of *cnt_fam_members*, consider it like a categorial ordered variable – make frequency table(s) and graphs.

In [None]:
# frequency table(s)
hlp_df = K.groupby("cnt_fam_members").agg(cnt_abs=("sk_id_curr", "count"))
hlp_df["cnt_cum"] = hlp_df["cnt_abs"].cumsum()
hlp_df["cnt_rel"] = hlp_df["cnt_abs"] / sum(hlp_df["cnt_abs"])
hlp_df["cnt_rel_cum"] = hlp_df["cnt_rel"].cumsum()
print(hlp_df)

# graphs
g = sns.displot(data=K, x="cnt_fam_members", discrete=True)
hlp_df["hlp"] = ""
g = sns.displot(data=hlp_df, x="hlp", hue="cnt_fam_members", multiple="stack", weights="cnt_rel")

6. Explore relationship of *flag_own_car*, *name_family_status*, *yrs_employed* and *ext_source_1* to answer following questions:
 - What is share of car owners in groups by family status? (Compute owner shares as decimal numbers and plot them as by categories.)
 - Plot *ext_source_1* and *yrs_employed* first together and then with distinction of car ownership as a category. (Hint: making some axis in log scale may help.)
 - What are distributions of *ext_source_1* in groups by family status (make a plot)? What statistics do describe well them distribution? Compute them for each group.
 - Do the same for *yrs_employed* instead of *ext_source_1*. Do we use same or different statistics to describe distribution of *yrs_employed*? Again, compute them.

In [None]:
### Share of car owners by family status
# How is *flag_own_car* encoded?
print("Unique values of flag_own_car:")
print(K["flag_own_car"].unique()) # it is a string variable with values "Y", "N"

# if *flag_own_car* were 0/1 or True/False variable, a share of positive values (1's, True's)
# could be computed as the mean
# so we convert *flag_own_car* to True/False variable

# we can either make a new column
K["flag_own_car2"] = (K["flag_own_car"]=="Y")
hlp_df = K.groupby("name_family_status").agg(owner_share=("flag_own_car2", "mean"))
print(hlp_df)
# or to make the conversion "on the fly"
hlp_df = K.assign(temp=K["flag_own_car"]=="Y") \
 .groupby("name_family_status").agg(owner_share=("temp", "mean"))
print(hlp_df)

# another way is to make contingency table and to compute relative frequency by categories
hlp_df = pd.crosstab(K["name_family_status"], K["flag_own_car"], normalize="index")
print(hlp_df["Y"])

# for plotting, we can use barplot showing means for categories
# conversion of flag_own_car is made "on the fly"
g = sns.catplot(data=K.assign(temp=K["flag_own_car"]=="Y"),
 y="name_family_status", x="temp", kind="bar", errorbar=None) \
 .set_axis_labels("Share of car owners", "Family status")

In [None]:
### Analyzing ext_source_1 and yrs_employed
g = sns.relplot(data=K, x="ext_source_1", y="yrs_employed") # too many points overlapping
g = sns.displot(data=K, x="ext_source_1", y="yrs_employed", cbar=True) # a bit better
g = sns.relplot(data=K, x="ext_source_1", y="yrs_employed").set(yscale="log") # using log scale Y is useful, too
# bad idea: g = sns.displot(data=K, x="ext_source_1", y="yrs_employed", cbar=True).set(yscale="log")
g = sns.displot(data=K.assign(yrs_log=np.log10(K["yrs_employed"])),
 x="ext_source_1", y="yrs_log", cbar=True)

In [None]:
# and when splitting by flag_own_car...
g = sns.relplot(data=K, x="ext_source_1", y="yrs_employed", hue="flag_own_car")
g = sns.relplot(data=K, x="ext_source_1", y="yrs_employed", hue="flag_own_car").set(yscale="log")

In [None]:
### Distribution of ext_source_1 in groups by family status
# histograms
# ugly: g = sns.displot(data=K, x="ext_source_1", hue="name_family_status", stat="probability", common_norm=False) # ugly
g = sns.displot(data=K, x="ext_source_1", hue="name_family_status",
 multiple="dodge", stat="probability", common_norm=False) # too many bars
# common_norm is False here because we want to normalize for each category, not overall
# the best is to make separate graphs for each category
g = sns.displot(data=K, x="ext_source_1", hue="name_family_status", bins=[i/10.0 for i in range(0, 11)],
 col="name_family_status", stat="probability", common_norm=False)

# density estimations are better than histograms to share the same graph
g = sns.displot(data=K, x="ext_source_1", hue="name_family_status", kind="kde", common_norm=False)

# boxplots
g = sns.catplot(data=K, x="name_family_status", y="ext_source_1", kind="box")

# and statistics - looks gaussian, so means and SD by categories are a good idea
K.groupby("name_family_status").agg({"ext_source_1": ["mean", "std", "count"]})

In [None]:
### Distribution of yrs_employed in groupy by family status
# histograms
# having experience from above, let's make just separate graphs for each category
g = sns.displot(data=K, x="yrs_employed", hue="name_family_status",
 col="name_family_status", stat="probability", common_norm=False)

# density estimations are better than histograms to share the same graph
g = sns.displot(data=K, x="yrs_employed", hue="name_family_status", kind="kde", common_norm=False, cut=0)
# cut at 0 - lower values cannot appear

# boxplots
g = sns.catplot(data=K, x="name_family_status", y="yrs_employed", kind="box")

# and statistics - looks skewed, so we will use median and quartiles instead of mean and SD
def quartile_l(x):
 return np.nanquantile(x, q=0.25)

def quartile_h(x):
 return np.nanquantile(x, q=0.75)

K.groupby("name_family_status").agg(emp_median=("yrs_employed", np.median),
 emp_ql=("yrs_employed", quartile_l),
 emp_qh=("yrs_employed", quartile_h),
 emp_count=("yrs_employed", "count"))

7. Make a plot of *age* distribution for grouping by *code_gender* and *cnt_children* (together, i. e. nested grouping).

In [None]:
g = sns.catplot(data=K, x="cnt_children", y="age", hue="code_gender", kind="violin", split=True)
g = sns.catplot(data=K, x="cnt_children", y="age", hue="code_gender", kind="box")