{ "cells": [ { "cell_type": "markdown", "id": "bf4f17ca", "metadata": {}, "source": [ "# Visualisation (Python – seaborn)" ] }, { "cell_type": "markdown", "id": "869d2b82", "metadata": {}, "source": [ "## 2. Tasks for you\n", "\n", "We will use the same data as above (file *application_train.csv* from *kaggle_home_credit.zip*) but bigger volume of it.\n", "\n", "1. Read file *application_train.csv* again and make from it a random sample of 5 000 records." ] }, { "cell_type": "code", "execution_count": null, "id": "2890db4e", "metadata": {}, "outputs": [], "source": [ "### Setup\n", "%matplotlib inline\n", "# should enable plotting without explicit call .show()\n", "\n", "# Import libraries\n", "import pandas as pd\n", "import numpy as np\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "\n", "# classes for special types\n", "from pandas.api.types import CategoricalDtype\n", "\n", "# Apply the default theme, set bigger font\n", "sns.set_theme()\n", "\n", "# Reading and adjusting data\n", "K = pd.read_csv(\"application_train.csv\")\n", "K = K.sample(n=5000, axis=0) # random sample of 5000 records\n", "K.columns = K.columns.str.lower() # column names to lowercase" ] }, { "cell_type": "markdown", "id": "bf2606f8", "metadata": {}, "source": [ "2. Transform data: *data_birth* -> *age*, *days_employed* -> *years_employed*, consider 1 year = 365.25 days. Original data are negative, so transform to positive. If there are any negative values after the transformation, replace them by np.nan." ] }, { "cell_type": "code", "execution_count": null, "id": "501f029e", "metadata": {}, "outputs": [], "source": [ "K[\"age\"] = -K[\"days_birth\"] / 365.25 \n", "K[\"yrs_employed\"] = -K[\"days_employed\"] / 365.25\n", "K[\"age\"] = np.where(K[\"age\"] < 0, np.nan, K[\"age\"]) # cleaning from nonsense values\n", "K[\"yrs_employed\"] = np.where(K[\"yrs_employed\"] < 0, np.nan, K[\"yrs_employed\"]) # cleaning from nonsense values" ] }, { "cell_type": "markdown", "id": "6cc8cd3c", "metadata": {}, "source": [ "3. Explore distribution of *age* by ECDF, density estimation, histogram, boxplot:\n", " - In histogram use bins of 5 years, try to make reasonable boundaries of them (e. g. 20-25 etc., see parameter *bins*).\n", " - In density estimation, limit the curve to the variable range (see parameter *cut* in *kdeplot*). Try various amount of smoothing.\n", " - For one graph (no matter which one) do a neat annotation (proper title, axis labels), try to change theme (*set_theme* method), font size (*font_scale* in *set* method), color (find yourself)." ] }, { "cell_type": "code", "execution_count": null, "id": "49d3a8ad", "metadata": {}, "outputs": [], "source": [ "g = sns.displot(data=K, x=\"age\", kind=\"ecdf\")\n", "g = sns.displot(data=K, x=\"age\", kind=\"kde\", cut=0, bw_adjust=0.8)\n", "g = sns.catplot(data=K, y=\"age\", kind=\"box\")\n", "\n", "sns.set_theme(style=\"whitegrid\") # changing theme\n", "g = sns.displot(data=K, x=\"age\", bins=range(20, 75, 5), color=\"green\") \\\n", " .set_axis_labels(\"Age [years]\", \"Count\") \\\n", " .set(title=\"Distribution of applicants' age\")\n", "sns.set_theme() # changing theme back" ] }, { "cell_type": "markdown", "id": "be97d148", "metadata": {}, "source": [ "4. Is distribution of *age* more likely Gaussian-like, or skewed? Does 2-sigma rule hold for it?" ] }, { "cell_type": "code", "execution_count": null, "id": "c15b4701", "metadata": {}, "outputs": [], "source": [ "# From the histogram the distribution looks like something between uniform and normal distribution,\n", "# i. e. it is close to Gaussian.\n", "# For 2-sigma rule, let's compute the mean and SD (\"sigma\"):\n", "age_mean = K[\"age\"].mean()\n", "age_std = K[\"age\"].std()\n", "print(\"Mean age: \", \"%.1f\" % age_mean)\n", "print(\"SD age: \", \"%.1f\" % age_std)\n", "print(\"Share of record within 2 sigma:\", \"%.3f\" % np.mean(np.abs(K[\"age\"] - age_mean) < 2*age_std))\n", "# Within 2-sigma, there is more than expected part of data" ] }, { "cell_type": "markdown", "id": "49c46413", "metadata": {}, "source": [ "5. Explore distribution of *cnt_fam_members*, consider it like a categorial ordered variable – make frequency table(s) and graphs." ] }, { "cell_type": "code", "execution_count": null, "id": "0fe3c520", "metadata": {}, "outputs": [], "source": [ "# frequency table(s)\n", "hlp_df = K.groupby(\"cnt_fam_members\").agg(cnt_abs=(\"sk_id_curr\", \"count\"))\n", "hlp_df[\"cnt_cum\"] = hlp_df[\"cnt_abs\"].cumsum()\n", "hlp_df[\"cnt_rel\"] = hlp_df[\"cnt_abs\"] / sum(hlp_df[\"cnt_abs\"])\n", "hlp_df[\"cnt_rel_cum\"] = hlp_df[\"cnt_rel\"].cumsum()\n", "print(hlp_df)\n", "\n", "# graphs\n", "g = sns.displot(data=K, x=\"cnt_fam_members\", discrete=True)\n", "hlp_df[\"hlp\"] = \"\"\n", "g = sns.displot(data=hlp_df, x=\"hlp\", hue=\"cnt_fam_members\", multiple=\"stack\", weights=\"cnt_rel\")" ] }, { "cell_type": "markdown", "id": "a2600583", "metadata": {}, "source": [ "6. Explore relationship of *flag_own_car*, *name_family_status*, *yrs_employed* and *ext_source_1* to answer following questions:\n", " - What is share of car owners in groups by family status? (Compute owner shares as decimal numbers and plot them as by categories.)\n", " - Plot *ext_source_1* and *yrs_employed* first together and then with distinction of car ownership as a category. (Hint: making some axis in log scale may help.)\n", " - What are distributions of *ext_source_1* in groups by family status (make a plot)? What statistics do describe well them distribution? Compute them for each group.\n", " - Do the same for *yrs_employed* instead of *ext_source_1*. Do we use same or different statistics to describe distribution of *yrs_employed*? Again, compute them." ] }, { "cell_type": "code", "execution_count": null, "id": "8945fa6a", "metadata": {}, "outputs": [], "source": [ "### Share of car owners by family status\n", "# How is *flag_own_car* encoded?\n", "print(\"Unique values of flag_own_car:\")\n", "print(K[\"flag_own_car\"].unique()) # it is a string variable with values \"Y\", \"N\"\n", "\n", "# if *flag_own_car* were 0/1 or True/False variable, a share of positive values (1's, True's)\n", "# could be computed as the mean\n", "# so we convert *flag_own_car* to True/False variable\n", "\n", "# we can either make a new column\n", "K[\"flag_own_car2\"] = (K[\"flag_own_car\"]==\"Y\")\n", "hlp_df = K.groupby(\"name_family_status\").agg(owner_share=(\"flag_own_car2\", \"mean\"))\n", "print(hlp_df)\n", "# or to make the conversion \"on the fly\"\n", "hlp_df = K.assign(temp=K[\"flag_own_car\"]==\"Y\") \\\n", " .groupby(\"name_family_status\").agg(owner_share=(\"temp\", \"mean\"))\n", "print(hlp_df)\n", "\n", "# another way is to make contingency table and to compute relative frequency by categories\n", "hlp_df = pd.crosstab(K[\"name_family_status\"], K[\"flag_own_car\"], normalize=\"index\")\n", "print(hlp_df[\"Y\"])\n", "\n", "# for plotting, we can use barplot showing means for categories\n", "# conversion of flag_own_car is made \"on the fly\"\n", "g = sns.catplot(data=K.assign(temp=K[\"flag_own_car\"]==\"Y\"),\n", " y=\"name_family_status\", x=\"temp\", kind=\"bar\", errorbar=None) \\\n", " .set_axis_labels(\"Share of car owners\", \"Family status\")" ] }, { "cell_type": "code", "execution_count": null, "id": "4954729a", "metadata": {}, "outputs": [], "source": [ "### Analyzing ext_source_1 and yrs_employed\n", "g = sns.relplot(data=K, x=\"ext_source_1\", y=\"yrs_employed\") # too many points overlapping\n", "g = sns.displot(data=K, x=\"ext_source_1\", y=\"yrs_employed\", cbar=True) # a bit better\n", "g = sns.relplot(data=K, x=\"ext_source_1\", y=\"yrs_employed\").set(yscale=\"log\") # using log scale Y is useful, too\n", "# bad idea: g = sns.displot(data=K, x=\"ext_source_1\", y=\"yrs_employed\", cbar=True).set(yscale=\"log\")\n", "g = sns.displot(data=K.assign(yrs_log=np.log10(K[\"yrs_employed\"])),\n", " x=\"ext_source_1\", y=\"yrs_log\", cbar=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "21ba5780", "metadata": {}, "outputs": [], "source": [ "# and when splitting by flag_own_car...\n", "g = sns.relplot(data=K, x=\"ext_source_1\", y=\"yrs_employed\", hue=\"flag_own_car\")\n", "g = sns.relplot(data=K, x=\"ext_source_1\", y=\"yrs_employed\", hue=\"flag_own_car\").set(yscale=\"log\")" ] }, { "cell_type": "code", "execution_count": null, "id": "8547c3fb", "metadata": {}, "outputs": [], "source": [ "### Distribution of ext_source_1 in groups by family status\n", "# histograms\n", "# ugly: g = sns.displot(data=K, x=\"ext_source_1\", hue=\"name_family_status\", stat=\"probability\", common_norm=False) # ugly\n", "g = sns.displot(data=K, x=\"ext_source_1\", hue=\"name_family_status\",\n", " multiple=\"dodge\", stat=\"probability\", common_norm=False) # too many bars\n", "# common_norm is False here because we want to normalize for each category, not overall\n", "# the best is to make separate graphs for each category\n", "g = sns.displot(data=K, x=\"ext_source_1\", hue=\"name_family_status\", bins=[i/10.0 for i in range(0, 11)],\n", " col=\"name_family_status\", stat=\"probability\", common_norm=False)\n", "\n", "# density estimations are better than histograms to share the same graph\n", "g = sns.displot(data=K, x=\"ext_source_1\", hue=\"name_family_status\", kind=\"kde\", common_norm=False)\n", "\n", "# boxplots\n", "g = sns.catplot(data=K, x=\"name_family_status\", y=\"ext_source_1\", kind=\"box\")\n", "\n", "# and statistics - looks gaussian, so means and SD by categories are a good idea\n", "K.groupby(\"name_family_status\").agg({\"ext_source_1\": [\"mean\", \"std\", \"count\"]})" ] }, { "cell_type": "code", "execution_count": null, "id": "c3c2b7f4", "metadata": {}, "outputs": [], "source": [ "### Distribution of yrs_employed in groupy by family status\n", "# histograms\n", "# having experience from above, let's make just separate graphs for each category\n", "g = sns.displot(data=K, x=\"yrs_employed\", hue=\"name_family_status\",\n", " col=\"name_family_status\", stat=\"probability\", common_norm=False)\n", "\n", "# density estimations are better than histograms to share the same graph\n", "g = sns.displot(data=K, x=\"yrs_employed\", hue=\"name_family_status\", kind=\"kde\", common_norm=False, cut=0)\n", "# cut at 0 - lower values cannot appear\n", "\n", "# boxplots\n", "g = sns.catplot(data=K, x=\"name_family_status\", y=\"yrs_employed\", kind=\"box\")\n", "\n", "# and statistics - looks skewed, so we will use median and quartiles instead of mean and SD\n", "def quartile_l(x):\n", " return np.nanquantile(x, q=0.25)\n", "\n", "def quartile_h(x):\n", " return np.nanquantile(x, q=0.75)\n", "\n", "K.groupby(\"name_family_status\").agg(emp_median=(\"yrs_employed\", np.median),\n", " emp_ql=(\"yrs_employed\", quartile_l),\n", " emp_qh=(\"yrs_employed\", quartile_h),\n", " emp_count=(\"yrs_employed\", \"count\"))" ] }, { "cell_type": "markdown", "id": "c89d7dc5", "metadata": {}, "source": [ "7. Make a plot of *age* distribution for grouping by *code_gender* and *cnt_children* (together, i. e. nested grouping)." ] }, { "cell_type": "code", "execution_count": null, "id": "7aa7a039", "metadata": {}, "outputs": [], "source": [ "g = sns.catplot(data=K, x=\"cnt_children\", y=\"age\", hue=\"code_gender\", kind=\"violin\", split=True)\n", "g = sns.catplot(data=K, x=\"cnt_children\", y=\"age\", hue=\"code_gender\", kind=\"box\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 5 }