{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "bf4f17ca",
   "metadata": {},
   "source": [
    "# Visualisation (Python &ndash; seaborn)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "869d2b82",
   "metadata": {},
   "source": [
    "## 2. Tasks for you\n",
    "\n",
    "We will use the same data as above (file *application_train.csv* from *kaggle_home_credit.zip*) but bigger volume of it.\n",
    "\n",
    "1. Read file *application_train.csv* again and make from it a random sample of 5 000 records."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2890db4e",
   "metadata": {},
   "outputs": [],
   "source": [
    "### Setup\n",
    "%matplotlib inline\n",
    "# should enable plotting without explicit call .show()\n",
    "\n",
    "# Import libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# classes for special types\n",
    "from pandas.api.types import CategoricalDtype\n",
    "\n",
    "# Apply the default theme, set bigger font\n",
    "sns.set_theme()\n",
    "\n",
    "# Reading and adjusting data\n",
    "K = pd.read_csv(\"application_train.csv\")\n",
    "K = K.sample(n=5000, axis=0) # random sample of 5000 records\n",
    "K.columns = K.columns.str.lower() # column names to lowercase"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf2606f8",
   "metadata": {},
   "source": [
    "2. Transform data: *data_birth* -> *age*, *days_employed* -> *years_employed*, consider 1 year = 365.25 days. Original data are negative, so transform to positive. If there are any negative values after the transformation, replace them by np.nan."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "501f029e",
   "metadata": {},
   "outputs": [],
   "source": [
    "K[\"age\"] = -K[\"days_birth\"] / 365.25 \n",
    "K[\"yrs_employed\"] = -K[\"days_employed\"] / 365.25\n",
    "K[\"age\"] = np.where(K[\"age\"] < 0, np.nan, K[\"age\"]) # cleaning from nonsense values\n",
    "K[\"yrs_employed\"] = np.where(K[\"yrs_employed\"] < 0, np.nan, K[\"yrs_employed\"]) # cleaning from nonsense values"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6cc8cd3c",
   "metadata": {},
   "source": [
    "3. Explore distribution of *age* by ECDF, density estimation, histogram, boxplot:\n",
    "   - In histogram use bins of 5 years, try to make reasonable boundaries of them (e. g. 20-25 etc., see parameter *bins*).\n",
    "   - In density estimation, limit the curve to the variable range (see parameter *cut* in *kdeplot*). Try various amount of smoothing.\n",
    "   - For one graph (no matter which one) do a neat annotation (proper title, axis labels), try to change theme (*set_theme* method), font size (*font_scale* in *set* method), color (find yourself)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "49d3a8ad",
   "metadata": {},
   "outputs": [],
   "source": [
    "g = sns.displot(data=K, x=\"age\", kind=\"ecdf\")\n",
    "g = sns.displot(data=K, x=\"age\", kind=\"kde\", cut=0, bw_adjust=0.8)\n",
    "g = sns.catplot(data=K, y=\"age\", kind=\"box\")\n",
    "\n",
    "sns.set_theme(style=\"whitegrid\") # changing theme\n",
    "g = sns.displot(data=K, x=\"age\", bins=range(20, 75, 5), color=\"green\") \\\n",
    "    .set_axis_labels(\"Age [years]\", \"Count\") \\\n",
    "    .set(title=\"Distribution of applicants' age\")\n",
    "sns.set_theme() # changing theme back"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be97d148",
   "metadata": {},
   "source": [
    "4. Is distribution of *age* more likely Gaussian-like, or skewed? Does 2-sigma rule hold for it?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c15b4701",
   "metadata": {},
   "outputs": [],
   "source": [
    "# From the histogram the distribution looks like something between uniform and normal distribution,\n",
    "# i. e. it is close to Gaussian.\n",
    "# For 2-sigma rule, let's compute the mean and SD (\"sigma\"):\n",
    "age_mean = K[\"age\"].mean()\n",
    "age_std = K[\"age\"].std()\n",
    "print(\"Mean age: \", \"%.1f\" % age_mean)\n",
    "print(\"SD age: \", \"%.1f\" % age_std)\n",
    "print(\"Share of record within 2 sigma:\", \"%.3f\" % np.mean(np.abs(K[\"age\"] - age_mean) < 2*age_std))\n",
    "# Within 2-sigma, there is more than expected part of data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49c46413",
   "metadata": {},
   "source": [
    "5. Explore distribution of *cnt_fam_members*, consider it like a categorial ordered variable &ndash; make frequency table(s) and graphs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0fe3c520",
   "metadata": {},
   "outputs": [],
   "source": [
    "# frequency table(s)\n",
    "hlp_df = K.groupby(\"cnt_fam_members\").agg(cnt_abs=(\"sk_id_curr\", \"count\"))\n",
    "hlp_df[\"cnt_cum\"] = hlp_df[\"cnt_abs\"].cumsum()\n",
    "hlp_df[\"cnt_rel\"] = hlp_df[\"cnt_abs\"] / sum(hlp_df[\"cnt_abs\"])\n",
    "hlp_df[\"cnt_rel_cum\"] = hlp_df[\"cnt_rel\"].cumsum()\n",
    "print(hlp_df)\n",
    "\n",
    "# graphs\n",
    "g = sns.displot(data=K, x=\"cnt_fam_members\", discrete=True)\n",
    "hlp_df[\"hlp\"] = \"\"\n",
    "g = sns.displot(data=hlp_df, x=\"hlp\", hue=\"cnt_fam_members\", multiple=\"stack\", weights=\"cnt_rel\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2600583",
   "metadata": {},
   "source": [
    "6. Explore relationship of *flag_own_car*, *name_family_status*, *yrs_employed* and *ext_source_1* to answer following questions:\n",
    "   - What is share of car owners in groups by family status? (Compute owner shares as decimal numbers and plot them as by categories.)\n",
    "   - Plot *ext_source_1* and *yrs_employed* first together and then with distinction of car ownership as a category. (Hint: making some axis in log scale may help.)\n",
    "   - What are distributions of *ext_source_1* in groups by family status (make a plot)? What statistics do describe well them distribution? Compute them for each group.\n",
    "   - Do the same for *yrs_employed* instead of *ext_source_1*. Do we use same or different statistics to describe distribution of *yrs_employed*? Again, compute them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8945fa6a",
   "metadata": {},
   "outputs": [],
   "source": [
    "### Share of car owners by family status\n",
    "# How is *flag_own_car* encoded?\n",
    "print(\"Unique values of flag_own_car:\")\n",
    "print(K[\"flag_own_car\"].unique()) # it is a string variable with values \"Y\", \"N\"\n",
    "\n",
    "# if *flag_own_car* were 0/1 or True/False variable, a share of positive values (1's, True's)\n",
    "#   could be computed as the mean\n",
    "# so we convert *flag_own_car* to True/False variable\n",
    "\n",
    "# we can either make a new column\n",
    "K[\"flag_own_car2\"] = (K[\"flag_own_car\"]==\"Y\")\n",
    "hlp_df = K.groupby(\"name_family_status\").agg(owner_share=(\"flag_own_car2\", \"mean\"))\n",
    "print(hlp_df)\n",
    "# or to make the conversion \"on the fly\"\n",
    "hlp_df = K.assign(temp=K[\"flag_own_car\"]==\"Y\") \\\n",
    "    .groupby(\"name_family_status\").agg(owner_share=(\"temp\", \"mean\"))\n",
    "print(hlp_df)\n",
    "\n",
    "# another way is to make contingency table and to compute relative frequency by categories\n",
    "hlp_df = pd.crosstab(K[\"name_family_status\"], K[\"flag_own_car\"], normalize=\"index\")\n",
    "print(hlp_df[\"Y\"])\n",
    "\n",
    "# for plotting, we can use barplot showing means for categories\n",
    "# conversion of flag_own_car is made \"on the fly\"\n",
    "g = sns.catplot(data=K.assign(temp=K[\"flag_own_car\"]==\"Y\"),\n",
    "                y=\"name_family_status\", x=\"temp\", kind=\"bar\", errorbar=None) \\\n",
    "    .set_axis_labels(\"Share of car owners\", \"Family status\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4954729a",
   "metadata": {},
   "outputs": [],
   "source": [
    "### Analyzing ext_source_1 and yrs_employed\n",
    "g = sns.relplot(data=K, x=\"ext_source_1\", y=\"yrs_employed\") # too many points overlapping\n",
    "g = sns.displot(data=K, x=\"ext_source_1\", y=\"yrs_employed\", cbar=True) # a bit better\n",
    "g = sns.relplot(data=K, x=\"ext_source_1\", y=\"yrs_employed\").set(yscale=\"log\") # using log scale Y is useful, too\n",
    "# bad idea: g = sns.displot(data=K, x=\"ext_source_1\", y=\"yrs_employed\", cbar=True).set(yscale=\"log\")\n",
    "g = sns.displot(data=K.assign(yrs_log=np.log10(K[\"yrs_employed\"])),\n",
    "                x=\"ext_source_1\", y=\"yrs_log\", cbar=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "21ba5780",
   "metadata": {},
   "outputs": [],
   "source": [
    "# and when splitting by flag_own_car...\n",
    "g = sns.relplot(data=K, x=\"ext_source_1\", y=\"yrs_employed\", hue=\"flag_own_car\")\n",
    "g = sns.relplot(data=K, x=\"ext_source_1\", y=\"yrs_employed\", hue=\"flag_own_car\").set(yscale=\"log\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8547c3fb",
   "metadata": {},
   "outputs": [],
   "source": [
    "### Distribution of ext_source_1 in groups by family status\n",
    "# histograms\n",
    "# ugly: g = sns.displot(data=K, x=\"ext_source_1\", hue=\"name_family_status\", stat=\"probability\", common_norm=False) # ugly\n",
    "g = sns.displot(data=K, x=\"ext_source_1\", hue=\"name_family_status\",\n",
    "                multiple=\"dodge\", stat=\"probability\", common_norm=False) # too many bars\n",
    "# common_norm is False here because we want to normalize for each category, not overall\n",
    "# the best is to make separate graphs for each category\n",
    "g = sns.displot(data=K, x=\"ext_source_1\", hue=\"name_family_status\", bins=[i/10.0 for i in range(0, 11)],\n",
    "                col=\"name_family_status\", stat=\"probability\", common_norm=False)\n",
    "\n",
    "# density estimations are better than histograms to share the same graph\n",
    "g = sns.displot(data=K, x=\"ext_source_1\", hue=\"name_family_status\", kind=\"kde\", common_norm=False)\n",
    "\n",
    "# boxplots\n",
    "g = sns.catplot(data=K, x=\"name_family_status\", y=\"ext_source_1\", kind=\"box\")\n",
    "\n",
    "# and statistics - looks gaussian, so means and SD by categories are a good idea\n",
    "K.groupby(\"name_family_status\").agg({\"ext_source_1\": [\"mean\", \"std\", \"count\"]})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c3c2b7f4",
   "metadata": {},
   "outputs": [],
   "source": [
    "### Distribution of yrs_employed in groupy by family status\n",
    "# histograms\n",
    "# having experience from above, let's make just separate graphs for each category\n",
    "g = sns.displot(data=K, x=\"yrs_employed\", hue=\"name_family_status\",\n",
    "                col=\"name_family_status\", stat=\"probability\", common_norm=False)\n",
    "\n",
    "# density estimations are better than histograms to share the same graph\n",
    "g = sns.displot(data=K, x=\"yrs_employed\", hue=\"name_family_status\", kind=\"kde\", common_norm=False, cut=0)\n",
    "# cut at 0 - lower values cannot appear\n",
    "\n",
    "# boxplots\n",
    "g = sns.catplot(data=K, x=\"name_family_status\", y=\"yrs_employed\", kind=\"box\")\n",
    "\n",
    "# and statistics - looks skewed, so we will use median and quartiles instead of mean and SD\n",
    "def quartile_l(x):\n",
    "    return np.nanquantile(x, q=0.25)\n",
    "\n",
    "def quartile_h(x):\n",
    "    return np.nanquantile(x, q=0.75)\n",
    "\n",
    "K.groupby(\"name_family_status\").agg(emp_median=(\"yrs_employed\", np.median),\n",
    "                                    emp_ql=(\"yrs_employed\", quartile_l),\n",
    "                                    emp_qh=(\"yrs_employed\", quartile_h),\n",
    "                                    emp_count=(\"yrs_employed\", \"count\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c89d7dc5",
   "metadata": {},
   "source": [
    "7. Make a plot of *age* distribution for grouping by *code_gender* and *cnt_children* (together, i. e. nested grouping)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7aa7a039",
   "metadata": {},
   "outputs": [],
   "source": [
    "g = sns.catplot(data=K, x=\"cnt_children\", y=\"age\", hue=\"code_gender\", kind=\"violin\", split=True)\n",
    "g = sns.catplot(data=K, x=\"cnt_children\", y=\"age\", hue=\"code_gender\", kind=\"box\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}