{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "If you haven't it done before: \n",
    "\n",
    "* Install python & pandas from [Anaconda](https://www.anaconda.com/products/individual)\n",
    "* Download `titanic2.zip` data from repository."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "pd.set_option(\"display.precision\", 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "The Challenge\n",
    "\n",
    "The sinking of the Titanic is one of the most infamous shipwrecks in history.\n",
    "\n",
    "On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg.\n",
    "Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.\n",
    "\n",
    "While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.\n",
    "\n",
    "In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).\n",
    "\n",
    "(c) [Kaggle](https://www.kaggle.com/c/titanic/overview)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>passenger_id</th>\n",
       "      <th>pclass</th>\n",
       "      <th>name</th>\n",
       "      <th>sex</th>\n",
       "      <th>age</th>\n",
       "      <th>sibsp</th>\n",
       "      <th>parch</th>\n",
       "      <th>ticket</th>\n",
       "      <th>fare</th>\n",
       "      <th>cabin</th>\n",
       "      <th>embarked</th>\n",
       "      <th>boat</th>\n",
       "      <th>body</th>\n",
       "      <th>home.dest</th>\n",
       "      <th>survived</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1216</td>\n",
       "      <td>3</td>\n",
       "      <td>Smyth, Miss. Julia</td>\n",
       "      <td>female</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>335432</td>\n",
       "      <td>7.73</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Q</td>\n",
       "      <td>13</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>699</td>\n",
       "      <td>3</td>\n",
       "      <td>Cacic, Mr. Luka</td>\n",
       "      <td>male</td>\n",
       "      <td>38.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>315089</td>\n",
       "      <td>8.66</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Croatia</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1267</td>\n",
       "      <td>3</td>\n",
       "      <td>Van Impe, Mrs. Jean Baptiste (Rosalie Paula Go...</td>\n",
       "      <td>female</td>\n",
       "      <td>30.0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>345773</td>\n",
       "      <td>24.15</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>449</td>\n",
       "      <td>2</td>\n",
       "      <td>Hocking, Mrs. Elizabeth (Eliza Needs)</td>\n",
       "      <td>female</td>\n",
       "      <td>54.0</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>29105</td>\n",
       "      <td>23.00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "      <td>4</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Cornwall / Akron, OH</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>576</td>\n",
       "      <td>2</td>\n",
       "      <td>Veal, Mr. James</td>\n",
       "      <td>male</td>\n",
       "      <td>40.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>28221</td>\n",
       "      <td>13.00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Barre, Co Washington, VT</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   passenger_id  pclass                                               name  \\\n",
       "0          1216       3                                 Smyth, Miss. Julia   \n",
       "1           699       3                                    Cacic, Mr. Luka   \n",
       "2          1267       3  Van Impe, Mrs. Jean Baptiste (Rosalie Paula Go...   \n",
       "3           449       2              Hocking, Mrs. Elizabeth (Eliza Needs)   \n",
       "4           576       2                                    Veal, Mr. James   \n",
       "\n",
       "      sex   age  sibsp  parch  ticket   fare cabin embarked boat  body  \\\n",
       "0  female   NaN      0      0  335432   7.73   NaN        Q   13   NaN   \n",
       "1    male  38.0      0      0  315089   8.66   NaN        S  NaN   NaN   \n",
       "2  female  30.0      1      1  345773  24.15   NaN        S  NaN   NaN   \n",
       "3  female  54.0      1      3   29105  23.00   NaN        S    4   NaN   \n",
       "4    male  40.0      0      0   28221  13.00   NaN        S  NaN   NaN   \n",
       "\n",
       "                  home.dest  survived  \n",
       "0                       NaN         1  \n",
       "1                   Croatia         0  \n",
       "2                       NaN         0  \n",
       "3      Cornwall / Akron, OH         1  \n",
       "4  Barre, Co Washington, VT         0  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('titanic_train.csv')\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "1. How many rows has the dataframe?\n",
    "2. How many columns has the dataframe?\n",
    "3. What is the percentage of non-null values in the age column?\n",
    "4. How many text columns has the dataframe?\n",
    "5. How many men and women are in the dataset?\n",
    "6. What is the average age of men and women? On average who is younger?\n",
    "7. What is the percentage of passengers travelling in 3rd class cabins? (pclass variable)\n",
    "8. How much did the most expensive ticket cost? (fare variable)\n",
    "9. Calculate average age, proportion of females and average fare per pclass.\n",
    "10. Who is more likely to travel alone men or women? (consider a passenger as travelling alone if he/she has neither siblings/spouse not parents/children)\n",
    "11. What is the most popular lastname? firstname? (name column)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "850"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# How many rows has the dataframe?\n",
    "\n",
    "df.shape[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "15"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# How many columns has the dataframe?\n",
    "\n",
    "df.shape[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.7952941176470588"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# What is the percentage of non-null values in the age column?\n",
    "\n",
    "df['age'].count() / df.shape[0]\n",
    "# alternatively:\n",
    "# df['age'].isna().sum() / df.shape[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# How many text columns has the dataframe?\n",
    "\n",
    "(df.dtypes == 'object').sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "sex\n",
       "male      551\n",
       "female    299\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# How many men and women are in the dataset?\n",
    "\n",
    "df['sex'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>avg_age</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>sex</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>female</th>\n",
       "      <td>28.86</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>male</th>\n",
       "      <td>29.90</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        avg_age\n",
       "sex            \n",
       "female    28.86\n",
       "male      29.90"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# What is the average age of men and women? On average who is younger?\n",
    "\n",
    "df.groupby('sex').agg(avg_age=('age', 'mean'))\n",
    "# alternatively:\n",
    "# df.groupby('sex')['age'].describe() # more statistics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pclass\n",
       "1    0.24\n",
       "2    0.20\n",
       "3    0.56\n",
       "Name: proportion, dtype: float64"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# What is the percentage of passengers travelling in the 3rd class? (pclass variable)\n",
    "\n",
    "df['pclass'].value_counts(normalize=True).sort_index()\n",
    "# alternatively:\n",
    "# df.groupby('pclass').agg(pass_rel_count=('passenger_id', 'count')) / df.shape[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "512.3292"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# How much did the most expensive ticket cost? (fare variable)\n",
    "\n",
    "df['fare'].max()\n",
    "# alternatively:\n",
    "# df.sort_values('fare', ascending=False).head(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>avg_age</th>\n",
       "      <th>prop_female</th>\n",
       "      <th>avg_fare</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>pclass</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>39.11</td>\n",
       "      <td>0.46</td>\n",
       "      <td>91.15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>28.60</td>\n",
       "      <td>0.40</td>\n",
       "      <td>21.26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>24.69</td>\n",
       "      <td>0.29</td>\n",
       "      <td>13.77</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        avg_age  prop_female  avg_fare\n",
       "pclass                                \n",
       "1         39.11         0.46     91.15\n",
       "2         28.60         0.40     21.26\n",
       "3         24.69         0.29     13.77"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# calculate average age, proportion of females and average fare per pclass\n",
    "\n",
    "(\n",
    "    df\n",
    "    .assign(is_female=lambda _: _['sex'] == 'female')\n",
    "    .groupby('pclass').agg(\n",
    "        avg_age=('age', 'mean'),\n",
    "        prop_female=('is_female', 'mean'),\n",
    "        avg_fare=('fare', 'mean')\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>prop_singles</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>sex</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>female</th>\n",
       "      <td>0.43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>male</th>\n",
       "      <td>0.70</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        prop_singles\n",
       "sex                 \n",
       "female          0.43\n",
       "male            0.70"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Who is more likely to travel alone: men or women? (Consider a passenger as travelling alone if he/she has no siblings/children)\n",
    "\n",
    "(\n",
    "    df\n",
    "    .assign(travel_alone=lambda _: (_['parch'] == 0) & (_['sibsp'] == 0))\n",
    "    .groupby('sex').agg(prop_singles=('travel_alone', 'mean'))\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "name\n",
       "Sage         10\n",
       "Andersson     8\n",
       "Goodwin       6\n",
       "Kelly         5\n",
       "Rice          5\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# What is the most popular lastname? firstname? (name column)\n",
    "\n",
    "df['name'].str.split(',').str[0].value_counts().head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "name\n",
       "mr         492\n",
       "miss       171\n",
       "mrs        124\n",
       "william     51\n",
       "master      46\n",
       "john        45\n",
       "henry       27\n",
       "james       24\n",
       "thomas      23\n",
       "joseph      22\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# most popular first name\n",
    "\n",
    "df['name'].str.lower().str.replace('[,\\.\\(\\)\\\"]', '', regex=True).str.split().explode().value_counts().head(10)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}