{ "cells": [ { "cell_type": "markdown", "id": "bf4f17ca", "metadata": {}, "source": [ "# Visualisation (Python – seaborn)" ] }, { "cell_type": "markdown", "id": "869d2b82", "metadata": {}, "source": [ "## 2. Tasks for you\n", "\n", "We will use the same data as above (file *application_train.csv* from *kaggle_home_credit.zip*) but bigger volume of it.\n", "\n", "1. Read file *application_train.csv* again and make from it a random sample of 5 000 records." ] }, { "cell_type": "markdown", "id": "bf2606f8", "metadata": {}, "source": [ "2. Transform data: *data_birth* -> *age*, *days_employed* -> *years_employed*, consider 1 year = 365.25 days. Original data are negative, so transform to positive. If there are any negative values after the transformation, replace them by np.nan." ] }, { "cell_type": "markdown", "id": "6cc8cd3c", "metadata": {}, "source": [ "3. Explore distribution of *age* by ECDF, density estimation, histogram, boxplot:\n", " - In histogram use bins of 5 years, try to make reasonable boundaries of them (e. g. 20-25 etc., see parameter *bins*).\n", " - In density estimation, limit the curve to the variable range (see parameter *cut* in *kdeplot*). Try various amount of smoothing.\n", " - For one graph (no matter which one) do a neat annotation (proper title, axis labels), try to change theme (*set_theme* method), font size (*font_scale* in *set* method), color (find yourself)." ] }, { "cell_type": "markdown", "id": "be97d148", "metadata": {}, "source": [ "4. Is distribution of *age* more likely Gaussian-like, or skewed? Does 2-sigma rule hold for it?" ] }, { "cell_type": "markdown", "id": "49c46413", "metadata": {}, "source": [ "5. Explore distribution of *cnt_fam_members*, consider it like a categorial ordered variable – make frequency table(s) and graphs." ] }, { "cell_type": "markdown", "id": "a2600583", "metadata": {}, "source": [ "6. Explore relationship of *flag_own_car*, *name_family_status*, *yrs_employed* and *ext_source_1* to answer following questions:\n", " - What is share of car owners in groups by family status? (Compute owner shares as decimal numbers and plot them as by categories.)\n", " - Plot *ext_source_1* and *yrs_employed* first together and then with distinction of car ownership as a category. (Hint: making some axis in log scale may help.)\n", " - What are distributions of *ext_source_1* in groups by family status (make a plot)? What statistics do describe well them distribution? Compute them for each group.\n", " - Do the same for *yrs_employed* instead of *ext_source_1*. Do we use same or different statistics to describe distribution of *yrs_employed*? Again, compute them." ] }, { "cell_type": "markdown", "id": "c89d7dc5", "metadata": {}, "source": [ "7. Make a plot of *age* distribution for grouping by *code_gender* and *cnt_children* (together, i. e. nested grouping)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 5 }