{ "cells": [ { "cell_type": "markdown", "id": "bf4f17ca", "metadata": {}, "source": [ "# Visualisation in examples (Python – seaborn)" ] }, { "cell_type": "markdown", "id": "acab7550", "metadata": {}, "source": [ "Here are examples of graphs in Python by pandas and seaborn packages. Be sure to have seaborn in version 0.12+.\n", "\n", "Complete tutorials to pandas and seaborn can be found at links:\n", "\n", "* [Pandas](https://pandas.pydata.org)\n", "* [Seaborn general](https://seaborn.pydata.org)\n", "* [Seaborn Objects](https://seaborn.pydata.org/tutorial/objects_interface.html)" ] }, { "cell_type": "markdown", "id": "c6060570", "metadata": {}, "source": [ "First we read packages, setup the environment, read and adjust data." ] }, { "cell_type": "code", "execution_count": null, "id": "ea2bf1de", "metadata": {}, "outputs": [], "source": [ "### Setup\n", "%matplotlib inline\n", "# should enable plotting without explicit call .show()\n", "\n", "# Import libraries\n", "import pandas as pd\n", "import numpy as np\n", "import seaborn as sns\n", "import seaborn.objects as so\n", "import matplotlib.pyplot as plt\n", "\n", "# classes for special types\n", "from pandas.api.types import CategoricalDtype\n", "\n", "# Apply the default theme, set bigger font\n", "sns.set_theme()\n", "\n", "# Reading and adjusting data\n", "K = pd.read_csv(\"application_train.csv\")\n", "\n", "# random sample of 2000 rows\n", "K = K.sample(n=2000, axis=0)\n", "\n", "# sample of columns\n", "K.columns = K.columns.str.lower() # column names to lowercase\n", "K = K[['sk_id_curr', 'target', 'name_contract_type', 'code_gender', 'cnt_children', 'amt_credit',\n", " 'occupation_type', 'name_education_type', 'days_birth', 'own_car_age', 'ext_source_1']]" ] }, { "cell_type": "markdown", "id": "522875cc", "metadata": {}, "source": [ "## 1. General analysis\n", "\n", "This was part of EDA practice." ] }, { "cell_type": "markdown", "id": "0892af80", "metadata": {}, "source": [ "## 2. Distributions of individual variables\n", "\n", "We did an exploratory analysis by numerical means: frequencies, means, standard deviations etc. Now we support it by visualiation. \n", "\n", "Basic seaborn method for plotting graph of individual distribution is [displot](https://seaborn.pydata.org/tutorial/distributions.html). It can make plots both for categorial and numeric variables." ] }, { "cell_type": "markdown", "id": "e0305cc3", "metadata": {}, "source": [ "### 2.1 Categorial variable\n", "\n", "Before we make a graph, let's make a frequency table (we combine absolute and relative frequencies)." ] }, { "cell_type": "code", "execution_count": null, "id": "2dfe3f07", "metadata": {}, "outputs": [], "source": [ "# absolute and relative frequency table\n", "freqtab = K.groupby(\"name_education_type\").agg(count=(\"sk_id_curr\", \"count\")) # absolute frequencies (counts)\n", "freqtab[\"count_rel\"] = freqtab[\"count\"] / sum(freqtab[\"count\"]) # relative frequencies\n", "freqtab" ] }, { "cell_type": "markdown", "id": "e002deb6", "metadata": {}, "source": [ "It is reasonable to make the variable `name_education_type` ordered. Then we can compute cumulative frequencies." ] }, { "cell_type": "code", "execution_count": null, "id": "e6011dea", "metadata": {}, "outputs": [], "source": [ "# making a new categorical ordered variable\n", "cat_type = CategoricalDtype(categories=[\"Lower secondary\", \"Secondary / secondary special\",\n", " \"Incomplete higher\", \"Higher education\"],\n", " ordered=True)\n", "K[\"education\"] = K[\"name_education_type\"].astype(cat_type)\n", "\n", "# frequency table\n", "freqtab = K.groupby(\"education\").agg(count=(\"sk_id_curr\", \"count\")) # absolute frequencies (counts)\n", "freqtab[\"count_cum\"] = freqtab[\"count\"].cumsum() # cumulative frequencies\n", "freqtab[\"count_rel\"] = freqtab[\"count\"] / sum(freqtab[\"count\"]) # relative frequencies\n", "freqtab[\"count_relcum\"] = freqtab[\"count_rel\"].cumsum() # cumulative relative frequencies\n", "freqtab" ] }, { "cell_type": "markdown", "id": "546133dd", "metadata": {}, "source": [ "The visualisation of frequencies is simple – we use barplot, either standard (bars beside) or stacked (useful for cumulative frequencies). The universal method for univariate distribution is *displot*. Variable name is assigned either to *x* or to *y* parameter, bars are then either vertical or horizontal." ] }, { "cell_type": "code", "execution_count": null, "id": "83e28ff9", "metadata": {}, "outputs": [], "source": [ "# graphs for absolute and relative frequencies\n", "# done directly from DataFrame, no need to compute frequency table\n", "# horizontal bars because of long lables, but vertical bars are more common\n", "g = sns.displot(data=K, y=\"education\") # absolute freqs\n", "g = sns.displot(data=K, y=\"education\", stat='probability') # relative freqs - difference only at Y scale" ] }, { "cell_type": "code", "execution_count": null, "id": "89b571af", "metadata": {}, "outputs": [], "source": [ "# countplot can be used for absolute freqs\n", "g = sns.countplot(data=K, y='education') # absolute freqs" ] }, { "cell_type": "code", "execution_count": null, "id": "15bb487e", "metadata": {}, "outputs": [], "source": [ "# for stacked barplot, we use frequency table computed above\n", "freqtab[\"hlp\"] = [\"\"] * len(freqtab) # dummy variable, just for filling the seaborn parameter\n", "# \"education\" is an alternative name for the index here\n", "g = sns.displot(data=freqtab, x=\"hlp\", hue=\"education\", multiple=\"stack\", weights=\"count_rel\")\n", "\n", "# for stacked absolute frequencies, use \"count\" instead of \"count_rel\"" ] }, { "cell_type": "markdown", "id": "4ab8a647", "metadata": {}, "source": [ "If we want to annotate the graph, we may use *set* methods. For more fine-tuning (colors etc.) see seaborn tutorial." ] }, { "cell_type": "code", "execution_count": null, "id": "73ed3e9c", "metadata": {}, "outputs": [], "source": [ "g = sns.displot(data=freqtab, x=\"hlp\", hue=\"education\", multiple=\"stack\", weights=\"count_rel\") \\\n", " .set_axis_labels(\"Education\", \"Relative frequency\") \\\n", " .set(title=\"Distribution of education\")" ] }, { "cell_type": "code", "execution_count": null, "id": "0115fcb7", "metadata": {}, "outputs": [], "source": [ "# similar graphs as above, but by Seaborn Objects\n", "so.Plot(K, y=\"education\").add(so.Bar(), so.Hist())\n", "so.Plot(K, y=\"education\").add(so.Bar(), so.Hist(stat='probability'))\n", "so.Plot(K, y=\"education\", color='education').add(so.Bar(), so.Hist())\n", "so.Plot(freqtab, x=\"hlp\", y=\"count_rel\", color=\"education\").add(so.Bar(), so.Stack())\n", "so.Plot(freqtab, x=\"hlp\", y=\"count_rel\", color=\"education\") \\\n", " .add(so.Bar(), so.Stack()) \\\n", " .label(x=\"Education\", y=\"Relative frequency\", title=\"Distribution of education\")" ] }, { "cell_type": "markdown", "id": "02b232c4", "metadata": {}, "source": [ "### 2.2 Numerical variable\n", "\n", "Again, before we start to plot graphs, let's calculate some statistical characteristics of variable `days_birth` (days of lifetime, for some reason with minus sign)." ] }, { "cell_type": "code", "execution_count": null, "id": "6e4b11a2", "metadata": {}, "outputs": [], "source": [ "# computing statistical characteristics of distribution\n", "print(\"Min and max: \", \"%.1f\" % K[\"days_birth\"].min(), \"|\", \"%.1f\" % K[\"days_birth\"].max())\n", "print(\"Mean: \", \"%.1f\" % K[\"days_birth\"].mean())\n", "print(\"Median: \", \"%.1f\" % K[\"days_birth\"].median())\n", "print(\"Std. dev.: \", \"%.1f\" % K[\"days_birth\"].std())\n", "# decils\n", "hlp_10s = [i/10.0 for i in range(0, 11)]\n", "hlp_qs = K[\"days_birth\"].quantile(hlp_10s)\n", "print(\"Decils:\\n\", hlp_qs)\n", "# IQR\n", "hlp_qs = K[\"days_birth\"].quantile([0.25, 0.75])\n", "print(\"IQR: \", hlp_qs.iloc[1]-hlp_qs.iloc[0])" ] }, { "cell_type": "markdown", "id": "49de6aa5", "metadata": {}, "source": [ "We make bunch of graphs with different level of detail and smoothing. Many of them use *displot* method and the parameter *kind* changes type of graph (ecdf, density etc.) from the default type, which is histogram. Some graphs use *catplot* method because stripplot and swarmplot are under that method, not under displot.\n", "\n", "If the variable is numeric but with few unique values, we can treat it as categorial – note using *discrete* parameter to adjust bar positions in histogram." ] }, { "cell_type": "code", "execution_count": null, "id": "ff9b6d68", "metadata": { "scrolled": true }, "outputs": [], "source": [ "### Numerical discrete variable\n", "# treated as categorial\n", "g = sns.displot(data=K, x=\"cnt_children\") # not so pretty\n", "g = sns.displot(data=K, x=\"cnt_children\", discrete=True) # better adjusted bars" ] }, { "cell_type": "markdown", "id": "494cf815", "metadata": {}, "source": [ "Continuous numeric variable can be plotted many ways depending on required completeness of information." ] }, { "cell_type": "code", "execution_count": null, "id": "4ec2d5ff", "metadata": {}, "outputs": [], "source": [ "# histogram\n", "g = sns.displot(data=K, x=\"days_birth\", bins=5)\n", "# for less smoothing, use bigger number of bins:\n", "g = sns.displot(data=K, x=\"days_birth\", bins=20)" ] }, { "cell_type": "code", "execution_count": null, "id": "66e495d1", "metadata": {}, "outputs": [], "source": [ "# density with rug\n", "g = sns.displot(data=K, x=\"days_birth\", kind=\"kde\", rug=True, fill=True, bw_adjust=1.5)\n", "# for less smoothing, use bigger number of bins:\n", "g = sns.displot(data=K, x=\"days_birth\", kind=\"kde\", rug=True, fill=True, bw_adjust=0.5)" ] }, { "cell_type": "code", "execution_count": null, "id": "ccf9b884", "metadata": {}, "outputs": [], "source": [ "# ecdf with rug\n", "g = sns.displot(data=K, x=\"days_birth\", kind=\"ecdf\", rug=True)\n", "# one may plot an additional line to the graph\n", "g = sns.displot(data=K, x=\"days_birth\", kind=\"ecdf\", rug=True) \\\n", " .refline(y=0.25)" ] }, { "cell_type": "code", "execution_count": null, "id": "c8024b05", "metadata": {}, "outputs": [], "source": [ "# rug itself can be displayed via catplot and stripplot or swarmplot\n", "# but is unsuitable for big samples\n", "g = sns.stripplot(data=K, x=\"days_birth\", jitter=False, size=1)\n", "# for less or no overlapping, use\n", "g = sns.catplot(data=K, x=\"days_birth\")\n", "g = sns.catplot(data=K, x=\"days_birth\", kind=\"swarm\")\n", "# solution - make a smaller sample" ] }, { "cell_type": "code", "execution_count": null, "id": "5812c121", "metadata": { "scrolled": true }, "outputs": [], "source": [ "# boxplot\n", "g = sns.catplot(data=K, y=\"days_birth\", kind=\"box\")" ] }, { "cell_type": "markdown", "id": "93c6b1d2", "metadata": {}, "source": [ "## 3. Relationships of variables\n", "\n", "Method for analysis and plotting are different depending on type (categorial or numerical) of both variables. \n", "\n", "* If one of variables is categorial, the basic strategy is to split the data into categories by this variable and to study distribution of the other variable for each category (and to compare distributions among various categories).\n", "* If both variables are numerical, then we use bivariate plots and compute statistics like correlation.\n", "* For a numerical variable, one may consider binning it and getting this way a categorial variable." ] }, { "cell_type": "markdown", "id": "d01de385", "metadata": {}, "source": [ "### 3.1 Categorial vs. categorial\n", "\n", "In this case we usually compute a contingency table (2-D frequency table)." ] }, { "cell_type": "code", "execution_count": null, "id": "921e6914", "metadata": {}, "outputs": [], "source": [ "# contingency table with absolute frequencies\n", "pd.crosstab(K[\"education\"], K[\"code_gender\"])" ] }, { "cell_type": "code", "execution_count": null, "id": "ef30d3ff", "metadata": {}, "outputs": [], "source": [ "# for relative frequencies in contingency table, use parameter normalize:\n", "pd.crosstab(K[\"education\"], K[\"code_gender\"], normalize=\"columns\") # relative by columns" ] }, { "cell_type": "markdown", "id": "dab1e664", "metadata": {}, "source": [ "Visualisation of contingency table, similarly to frequency table, can be done by some kind of barplot. Bars can be:\n", "\n", "* put beside one by one\n", "* stacked within each category as absolute counts\n", "* stacked within each category as relative counts (all stacked bars sum up to 1)" ] }, { "cell_type": "code", "execution_count": null, "id": "e6a9d1cc", "metadata": {}, "outputs": [], "source": [ "# barplot with bars beside\n", "g = sns.displot(data=K, x=\"code_gender\", hue=\"education\", multiple=\"dodge\")\\\n", " .refline(x=0.5) # auxiliary line to split categories" ] }, { "cell_type": "code", "execution_count": null, "id": "59b48b3b", "metadata": {}, "outputs": [], "source": [ "# barplot with stacked bars as absolute counts\n", "g = sns.displot(data=K, x=\"code_gender\", hue=\"education\", multiple=\"stack\")" ] }, { "cell_type": "code", "execution_count": null, "id": "eba6fe5e", "metadata": {}, "outputs": [], "source": [ "# barplot stacked as relative counts (sums up to 1)\n", "# needs data preparation\n", "hlp_df = pd.crosstab(K[\"education\"], K[\"code_gender\"], normalize=\"columns\")\n", "print(hlp_df)\n", "# for plotting stacked barplot, we need to transform this \"wide\" format to \"long\" format\n", "hlp_df.reset_index(inplace=True)\n", "hlp_df = pd.melt(hlp_df, id_vars=\"education\", var_name=\"code_gender\", value_name=\"prop\")\n", "print(hlp_df)\n", "\n", "g = sns.displot(data=hlp_df, x=\"code_gender\", hue=\"education\", multiple=\"stack\", weights=\"prop\")" ] }, { "cell_type": "markdown", "id": "d5abb212", "metadata": {}, "source": [ "Another idea is to make *heatmap* – replace each cell in a contingency table by color tone according to the value in the cell. This is good for plotting absolute frequencies but may be confusing for relative ones." ] }, { "cell_type": "code", "execution_count": null, "id": "a73cd696", "metadata": {}, "outputs": [], "source": [ "# discrete heatmap\n", "g = sns.displot(data=K, x=\"code_gender\", y=\"education\", cbar=True)" ] }, { "cell_type": "markdown", "id": "6e4f817a", "metadata": {}, "source": [ "### 3.2 Categorial vs. numerical\n", "\n", "We can split the data by categorial variable and compute statistics by categories, e. g. mean, median or SD, and compare them. Computing is easy by pandas *groupby* and *agg* methods." ] }, { "cell_type": "code", "execution_count": null, "id": "1bf6630f", "metadata": {}, "outputs": [], "source": [ "# statistics by categories\n", "# it is possible to use user defined functions\n", "def quartileH(x): # upper quartile\n", " return x.quantile(q=0.75)\n", "\n", "K.groupby(\"code_gender\").agg({\"ext_source_1\": [\"mean\", \"median\", \"std\", \"max\", \"min\", quartileH]})" ] }, { "cell_type": "markdown", "id": "b76ac0e9", "metadata": {}, "source": [ "There are many ways how to do splitting by categories when plotting:\n", "\n", "* multiple lines (curves), possibly overlapping\n", "* use one axis for categories (sections inside one graph), distribution graph in each section separately\n", "* split figure to separate graphs\n", "\n", "We can either use *displot* with parameters *hue* or *col* or *catplot* with category variable as *x* (or *y*, if we want split the graph horizontally). For plotting aggregated statistics we can use *barplot*, which is a special functionality of *catplot* method." ] }, { "cell_type": "code", "execution_count": null, "id": "c449adab", "metadata": {}, "outputs": [], "source": [ "# numeric vs. category as overlapping lines/curves\n", "g = sns.displot(data=K, x=\"ext_source_1\", hue=\"code_gender\") # overlapping histograms\n", "g = sns.displot(data=K, x=\"ext_source_1\", hue=\"code_gender\", kind=\"kde\") # overlapping KDE" ] }, { "cell_type": "code", "execution_count": null, "id": "90c08922", "metadata": {}, "outputs": [], "source": [ "# numeric vs. category as separate graphs\n", "g = sns.displot(data=K, x=\"ext_source_1\", col=\"code_gender\",\n", " stat=\"probability\", common_norm=False) # separate histograms" ] }, { "cell_type": "code", "execution_count": null, "id": "027ca328", "metadata": {}, "outputs": [], "source": [ "# numeric vs. category as sections of one graph\n", "g = sns.catplot(data=K, x=\"code_gender\", y=\"ext_source_1\") # stripplot\n", "g = sns.catplot(data=K, x=\"code_gender\", y=\"ext_source_1\", kind=\"violin\") # violinplot\n", "g = sns.catplot(data=K.assign(temp=\"\"), x=\"temp\", y=\"ext_source_1\", hue=\"code_gender\", kind=\"violin\", split=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "a14191c4", "metadata": {}, "outputs": [], "source": [ "# barplots with estimator by categories, boxplots\n", "g = sns.catplot(data=K, x=\"code_gender\", y=\"ext_source_1\", kind=\"bar\")\n", "g = sns.catplot(data=K, x=\"code_gender\", y=\"ext_source_1\", kind=\"box\")" ] }, { "cell_type": "markdown", "id": "83b72bda", "metadata": {}, "source": [ "### 3.3 Numerical vs. numerical\n", "\n", "The basic statistics for this case is a correlation coefficient. For plotting we use *relplot* or *displot* method with two basic cases:\n", "\n", "* for each x value there can be more observations – *scatterplot* (a cloud of points), heatmap, contourplot\n", "* for each x value there is only one observation or we want to aggregate over y axis – *lineplot* (time series)" ] }, { "cell_type": "code", "execution_count": null, "id": "400f1398", "metadata": {}, "outputs": [], "source": [ "# correlation between two columns\n", "print(\"Correlation: \", K[\"days_birth\"].corr(K[\"amt_credit\"]))\n", "\n", "# correlation matrix between every pair of numerical variables\n", "corrmat = K.corr(numeric_only=True)\n", "corrmat" ] }, { "cell_type": "code", "execution_count": null, "id": "bf390edf", "metadata": {}, "outputs": [], "source": [ "# the correlation matrix can be plotted by heatmap\n", "g = sns.heatmap(corrmat.round(2), annot=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "6b70ed52", "metadata": {}, "outputs": [], "source": [ "g = sns.relplot(data=K, x=\"days_birth\", y=\"ext_source_1\") # scatterplot\n", "g = sns.displot(data=K, x=\"days_birth\", y=\"ext_source_1\", cbar=True) # heatmap\n", "g = sns.displot(data=K, x=\"days_birth\", y=\"ext_source_1\", kind=\"kde\") # contourplot" ] }, { "cell_type": "code", "execution_count": null, "id": "26d32cb5", "metadata": {}, "outputs": [], "source": [ "# lineplot\n", "hlptab = K.groupby('own_car_age').agg({'days_birth': \"mean\"})\n", "g = sns.relplot(data=hlptab, x=\"own_car_age\", y=\"days_birth\", kind='line')" ] }, { "cell_type": "markdown", "id": "ff1ff38c", "metadata": {}, "source": [ "Scatterplot or contourplot can be combined with graphs of individual distributions (histogram, density). It does method *jointplot*." ] }, { "cell_type": "code", "execution_count": null, "id": "f9cab4fc", "metadata": {}, "outputs": [], "source": [ "# jointplot - both scatterplot and individual distributions\n", "g = sns.jointplot(data=K, x=\"days_birth\", y=\"ext_source_1\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 5 }