{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "bf4f17ca",
   "metadata": {},
   "source": [
    "# Visualisation (Python &ndash; seaborn)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "869d2b82",
   "metadata": {},
   "source": [
    "## 2. Tasks for you\n",
    "\n",
    "We will use the same data as above (file *application_train.csv* from *kaggle_home_credit.zip*) but bigger volume of it.\n",
    "\n",
    "1. Read file *application_train.csv* again and make from it a random sample of 5 000 records."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf2606f8",
   "metadata": {},
   "source": [
    "2. Transform data: *data_birth* -> *age*, *days_employed* -> *years_employed*, consider 1 year = 365.25 days. Original data are negative, so transform to positive. If there are any negative values after the transformation, replace them by np.nan."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6cc8cd3c",
   "metadata": {},
   "source": [
    "3. Explore distribution of *age* by ECDF, density estimation, histogram, boxplot:\n",
    "   - In histogram use bins of 5 years, try to make reasonable boundaries of them (e. g. 20-25 etc., see parameter *bins*).\n",
    "   - In density estimation, limit the curve to the variable range (see parameter *cut* in *kdeplot*). Try various amount of smoothing.\n",
    "   - For one graph (no matter which one) do a neat annotation (proper title, axis labels), try to change theme (*set_theme* method), font size (*font_scale* in *set* method), color (find yourself)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be97d148",
   "metadata": {},
   "source": [
    "4. Is distribution of *age* more likely Gaussian-like, or skewed? Does 2-sigma rule hold for it?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49c46413",
   "metadata": {},
   "source": [
    "5. Explore distribution of *cnt_fam_members*, consider it like a categorial ordered variable &ndash; make frequency table(s) and graphs."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2600583",
   "metadata": {},
   "source": [
    "6. Explore relationship of *flag_own_car*, *name_family_status*, *yrs_employed* and *ext_source_1* to answer following questions:\n",
    "   - What is share of car owners in groups by family status? (Compute owner shares as decimal numbers and plot them as by categories.)\n",
    "   - Plot *ext_source_1* and *yrs_employed* first together and then with distinction of car ownership as a category. (Hint: making some axis in log scale may help.)\n",
    "   - What are distributions of *ext_source_1* in groups by family status (make a plot)? What statistics do describe well them distribution? Compute them for each group.\n",
    "   - Do the same for *yrs_employed* instead of *ext_source_1*. Do we use same or different statistics to describe distribution of *yrs_employed*? Again, compute them."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c89d7dc5",
   "metadata": {},
   "source": [
    "7. Make a plot of *age* distribution for grouping by *code_gender* and *cnt_children* (together, i. e. nested grouping)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}