{ "cells": [ { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "If you haven't it done before: \n", "\n", "* Install python & pandas from [Anaconda](https://www.anaconda.com/products/individual)\n", "* Download `titanic2.zip` data from repository." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "pd.set_option(\"display.precision\", 2)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "The Challenge\n", "\n", "The sinking of the Titanic is one of the most infamous shipwrecks in history.\n", "\n", "On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg.\n", "Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.\n", "\n", "While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.\n", "\n", "In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).\n", "\n", "(c) [Kaggle](https://www.kaggle.com/c/titanic/overview)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
passenger_idpclassnamesexagesibspparchticketfarecabinembarkedboatbodyhome.destsurvived
012163Smyth, Miss. JuliafemaleNaN003354327.73NaNQ13NaNNaN1
16993Cacic, Mr. Lukamale38.0003150898.66NaNSNaNNaNCroatia0
212673Van Impe, Mrs. Jean Baptiste (Rosalie Paula Go...female30.01134577324.15NaNSNaNNaNNaN0
34492Hocking, Mrs. Elizabeth (Eliza Needs)female54.0132910523.00NaNS4NaNCornwall / Akron, OH1
45762Veal, Mr. Jamesmale40.0002822113.00NaNSNaNNaNBarre, Co Washington, VT0
\n", "
" ], "text/plain": [ " passenger_id pclass name \\\n", "0 1216 3 Smyth, Miss. Julia \n", "1 699 3 Cacic, Mr. Luka \n", "2 1267 3 Van Impe, Mrs. Jean Baptiste (Rosalie Paula Go... \n", "3 449 2 Hocking, Mrs. Elizabeth (Eliza Needs) \n", "4 576 2 Veal, Mr. James \n", "\n", " sex age sibsp parch ticket fare cabin embarked boat body \\\n", "0 female NaN 0 0 335432 7.73 NaN Q 13 NaN \n", "1 male 38.0 0 0 315089 8.66 NaN S NaN NaN \n", "2 female 30.0 1 1 345773 24.15 NaN S NaN NaN \n", "3 female 54.0 1 3 29105 23.00 NaN S 4 NaN \n", "4 male 40.0 0 0 28221 13.00 NaN S NaN NaN \n", "\n", " home.dest survived \n", "0 NaN 1 \n", "1 Croatia 0 \n", "2 NaN 0 \n", "3 Cornwall / Akron, OH 1 \n", "4 Barre, Co Washington, VT 0 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('titanic_train.csv')\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "1. How many rows has the dataframe?\n", "2. How many columns has the dataframe?\n", "3. What is the percentage of non-null values in the age column?\n", "4. How many text columns has the dataframe?\n", "5. How many men and women are in the dataset?\n", "6. What is the average age of men and women? On average who is younger?\n", "7. What is the percentage of passengers travelling in 3rd class cabins? (pclass variable)\n", "8. How much did the most expensive ticket cost? (fare variable)\n", "9. Calculate average age, proportion of females and average fare per pclass.\n", "10. Who is more likely to travel alone men or women? (consider a passenger as travelling alone if he/she has neither siblings/spouse not parents/children)\n", "11. What is the most popular lastname? firstname? (name column)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "850" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# How many rows has the dataframe?\n", "\n", "df.shape[0]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "15" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# How many columns has the dataframe?\n", "\n", "df.shape[1]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "0.7952941176470588" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# What is the percentage of non-null values in the age column?\n", "\n", "df['age'].count() / df.shape[0]\n", "# alternatively:\n", "# df['age'].isna().sum() / df.shape[0]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "7" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# How many text columns has the dataframe?\n", "\n", "(df.dtypes == 'object').sum()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "sex\n", "male 551\n", "female 299\n", "Name: count, dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# How many men and women are in the dataset?\n", "\n", "df['sex'].value_counts()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
avg_age
sex
female28.86
male29.90
\n", "
" ], "text/plain": [ " avg_age\n", "sex \n", "female 28.86\n", "male 29.90" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# What is the average age of men and women? On average who is younger?\n", "\n", "df.groupby('sex').agg(avg_age=('age', 'mean'))\n", "# alternatively:\n", "# df.groupby('sex')['age'].describe() # more statistics" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "pclass\n", "1 0.24\n", "2 0.20\n", "3 0.56\n", "Name: proportion, dtype: float64" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# What is the percentage of passengers travelling in the 3rd class? (pclass variable)\n", "\n", "df['pclass'].value_counts(normalize=True).sort_index()\n", "# alternatively:\n", "# df.groupby('pclass').agg(pass_rel_count=('passenger_id', 'count')) / df.shape[0]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "512.3292" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# How much did the most expensive ticket cost? (fare variable)\n", "\n", "df['fare'].max()\n", "# alternatively:\n", "# df.sort_values('fare', ascending=False).head(1)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
avg_ageprop_femaleavg_fare
pclass
139.110.4691.15
228.600.4021.26
324.690.2913.77
\n", "
" ], "text/plain": [ " avg_age prop_female avg_fare\n", "pclass \n", "1 39.11 0.46 91.15\n", "2 28.60 0.40 21.26\n", "3 24.69 0.29 13.77" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# calculate average age, proportion of females and average fare per pclass\n", "\n", "(\n", " df\n", " .assign(is_female=lambda _: _['sex'] == 'female')\n", " .groupby('pclass').agg(\n", " avg_age=('age', 'mean'),\n", " prop_female=('is_female', 'mean'),\n", " avg_fare=('fare', 'mean')\n", " )\n", ")" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
prop_singles
sex
female0.43
male0.70
\n", "
" ], "text/plain": [ " prop_singles\n", "sex \n", "female 0.43\n", "male 0.70" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Who is more likely to travel alone: men or women? (Consider a passenger as travelling alone if he/she has no siblings/children)\n", "\n", "(\n", " df\n", " .assign(travel_alone=lambda _: (_['parch'] == 0) & (_['sibsp'] == 0))\n", " .groupby('sex').agg(prop_singles=('travel_alone', 'mean'))\n", ")" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "name\n", "Sage 10\n", "Andersson 8\n", "Goodwin 6\n", "Kelly 5\n", "Rice 5\n", "Name: count, dtype: int64" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# What is the most popular lastname? firstname? (name column)\n", "\n", "df['name'].str.split(',').str[0].value_counts().head(5)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "name\n", "mr 492\n", "miss 171\n", "mrs 124\n", "william 51\n", "master 46\n", "john 45\n", "henry 27\n", "james 24\n", "thomas 23\n", "joseph 22\n", "Name: count, dtype: int64" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# most popular first name\n", "\n", "df['name'].str.lower().str.replace('[,\\.\\(\\)\\\"]', '', regex=True).str.split().explode().value_counts().head(10)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 1 }