{ "cells": [ { "cell_type": "markdown", "id": "0128ad0b", "metadata": {}, "source": [ "# Data preparation\n", "\n", "* From **business understanding**, we know the task to be solved. \n", "* Then we do **data understanding** to look into data.\n", "* Now we are going to do some necessary or useful data transformation to reach the aim.\n", "\n", "## Outline\n", "0. Summary of data understanding\n", "1. Missing and invalid data\n", "2. Feature extraction\n", "3. Making different statistical units\n", "4. Data transformation\n", "\n", "## Data and tasks\n", "* Titanic2 (*titanic_train.csv*) - data preparation for an analysis of ticket fares\n", "* Home Credit (*application_train.csv*) - segmentation of clients by family situation" ] }, { "cell_type": "code", "execution_count": 1, "id": "2fb31a8e", "metadata": {}, "outputs": [], "source": [ "# setup\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "sns.set_theme(style=\"ticks\", color_codes=True)" ] }, { "cell_type": "markdown", "id": "465ee695", "metadata": {}, "source": [ "## Part I. Titanic and ticket fares\n", "### Summary of data understanding\n", "Just few facts from the exploration -- for the aim of this practice.\n", "\n", "Let's consider these columns only: *pclass*, *sex*, *age*, *ticket*, *fare*, *cabin*, *embarked*" ] }, { "cell_type": "code", "execution_count": 2, "id": "2a75a9ce", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | passenger_id | \n", "ticket | \n", "pclass | \n", "fare | \n", "sex | \n", "age | \n", "cabin | \n", "embarked | \n", "
---|---|---|---|---|---|---|---|---|
0 | \n", "1216 | \n", "335432 | \n", "3 | \n", "7.7333 | \n", "female | \n", "NaN | \n", "NaN | \n", "Q | \n", "
1 | \n", "699 | \n", "315089 | \n", "3 | \n", "8.6625 | \n", "male | \n", "38.0 | \n", "NaN | \n", "S | \n", "
2 | \n", "1267 | \n", "345773 | \n", "3 | \n", "24.1500 | \n", "female | \n", "30.0 | \n", "NaN | \n", "S | \n", "
3 | \n", "449 | \n", "29105 | \n", "2 | \n", "23.0000 | \n", "female | \n", "54.0 | \n", "NaN | \n", "S | \n", "
4 | \n", "576 | \n", "28221 | \n", "2 | \n", "13.0000 | \n", "male | \n", "40.0 | \n", "NaN | \n", "S | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
845 | \n", "158 | \n", "680 | \n", "1 | \n", "50.0000 | \n", "male | \n", "55.0 | \n", "C39 | \n", "S | \n", "
846 | \n", "174 | \n", "11771 | \n", "1 | \n", "29.7000 | \n", "male | \n", "58.0 | \n", "B37 | \n", "C | \n", "
847 | \n", "467 | \n", "244367 | \n", "2 | \n", "26.0000 | \n", "female | \n", "24.0 | \n", "NaN | \n", "S | \n", "
848 | \n", "1112 | \n", "SOTON/O.Q. 3101315 | \n", "3 | \n", "13.7750 | \n", "female | \n", "3.0 | \n", "NaN | \n", "S | \n", "
849 | \n", "425 | \n", "250647 | \n", "2 | \n", "13.0000 | \n", "male | \n", "52.0 | \n", "NaN | \n", "S | \n", "
850 rows × 8 columns
\n", "\n", " | days_birth | \n", "code_gender | \n", "cnt_children | \n", "cnt_fam_members | \n", "name_family_status | \n", "
---|---|---|---|---|---|
0 | \n", "-9461 | \n", "M | \n", "0 | \n", "1.0 | \n", "Single / not married | \n", "
1 | \n", "-16765 | \n", "F | \n", "0 | \n", "2.0 | \n", "Married | \n", "
2 | \n", "-19046 | \n", "M | \n", "0 | \n", "1.0 | \n", "Single / not married | \n", "
3 | \n", "-19005 | \n", "F | \n", "0 | \n", "2.0 | \n", "Civil marriage | \n", "
4 | \n", "-19932 | \n", "M | \n", "0 | \n", "1.0 | \n", "Single / not married | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
307506 | \n", "-9327 | \n", "M | \n", "0 | \n", "1.0 | \n", "Separated | \n", "
307507 | \n", "-20775 | \n", "F | \n", "0 | \n", "1.0 | \n", "Widow | \n", "
307508 | \n", "-14966 | \n", "F | \n", "0 | \n", "1.0 | \n", "Separated | \n", "
307509 | \n", "-11961 | \n", "F | \n", "0 | \n", "2.0 | \n", "Married | \n", "
307510 | \n", "-16856 | \n", "F | \n", "0 | \n", "2.0 | \n", "Married | \n", "
307511 rows × 5 columns
\n", "