If you haven't it done before: 

* Install python & pandas from [Anaconda](https://www.anaconda.com/products/individual)
* Download `titanic2.zip` data from repository.

In [5]:
import pandas as pd

pd.set_option("display.precision", 2)

The Challenge

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg.
Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

(c) [Kaggle](https://www.kaggle.com/c/titanic/overview)

In [6]:
df = pd.read_csv('titanic_train.csv')

df.head()

Unnamed: 0,passenger_id,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,survived
0,1216,3,"Smyth, Miss. Julia",female,,0,0,335432,7.73,,Q,13.0,,,1
1,699,3,"Cacic, Mr. Luka",male,38.0,0,0,315089,8.66,,S,,,Croatia,0
2,1267,3,"Van Impe, Mrs. Jean Baptiste (Rosalie Paula Go...",female,30.0,1,1,345773,24.15,,S,,,,0
3,449,2,"Hocking, Mrs. Elizabeth (Eliza Needs)",female,54.0,1,3,29105,23.0,,S,4.0,,"Cornwall / Akron, OH",1
4,576,2,"Veal, Mr. James",male,40.0,0,0,28221,13.0,,S,,,"Barre, Co Washington, VT",0


1. How many rows has the dataframe?
2. How many columns has the dataframe?
3. What is the percentage of non-null values in the age column?
4. How many text columns has the dataframe?
5. How many men and women are in the dataset?
6. What is the average age of men and women? On average who is younger?
7. What is the percentage of passengers travelling in 3rd class cabins? (pclass variable)
8. How much did the most expensive ticket cost? (fare variable)
9. Calculate average age, proportion of females and average fare per pclass.
10. Who is more likely to travel alone men or women? (consider a passenger as travelling alone if he/she has neither siblings/spouse not parents/children)
11. What is the most popular lastname? firstname? (name column)

In [7]:
# How many rows has the dataframe?

df.shape[0]

850

In [8]:
# How many columns has the dataframe?

df.shape[1]

15

In [21]:
# What is the percentage of non-null values in the age column?

df['age'].count() / df.shape[0]
# alternatively:
# df['age'].isna().sum() / df.shape[0]

0.7952941176470588

In [11]:
# How many text columns has the dataframe?

(df.dtypes == 'object').sum()

7

In [13]:
# How many men and women are in the dataset?

df['sex'].value_counts()

sex
male      551
female    299
Name: count, dtype: int64

In [19]:
# What is the average age of men and women? On average who is younger?

df.groupby('sex').agg(avg_age=('age', 'mean'))
# alternatively:
# df.groupby('sex')['age'].describe() # more statistics

Unnamed: 0_level_0,avg_age
sex,Unnamed: 1_level_1
female,28.86
male,29.9


In [20]:
# What is the percentage of passengers travelling in the 3rd class? (pclass variable)

df['pclass'].value_counts(normalize=True).sort_index()
# alternatively:
# df.groupby('pclass').agg(pass_rel_count=('passenger_id', 'count')) / df.shape[0]

pclass
1    0.24
2    0.20
3    0.56
Name: proportion, dtype: float64

In [24]:
# How much did the most expensive ticket cost? (fare variable)

df['fare'].max()
# alternatively:
# df.sort_values('fare', ascending=False).head(1)

512.3292

In [25]:
# calculate average age, proportion of females and average fare per pclass

(
    df
    .assign(is_female=lambda _: _['sex'] == 'female')
    .groupby('pclass').agg(
        avg_age=('age', 'mean'),
        prop_female=('is_female', 'mean'),
        avg_fare=('fare', 'mean')
    )
)

Unnamed: 0_level_0,avg_age,prop_female,avg_fare
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,39.11,0.46,91.15
2,28.6,0.4,21.26
3,24.69,0.29,13.77


In [26]:
# Who is more likely to travel alone: men or women? (Consider a passenger as travelling alone if he/she has no siblings/children)

(
    df
    .assign(travel_alone=lambda _: (_['parch'] == 0) & (_['sibsp'] == 0))
    .groupby('sex').agg(prop_singles=('travel_alone', 'mean'))
)

Unnamed: 0_level_0,prop_singles
sex,Unnamed: 1_level_1
female,0.43
male,0.7


In [27]:
# What is the most popular lastname? firstname? (name column)

df['name'].str.split(',').str[0].value_counts().head(5)

name
Sage         10
Andersson     8
Goodwin       6
Kelly         5
Rice          5
Name: count, dtype: int64

In [28]:
# most popular first name

df['name'].str.lower().str.replace('[,\.\(\)\"]', '', regex=True).str.split().explode().value_counts().head(10)

name
mr         492
miss       171
mrs        124
william     51
master      46
john        45
henry       27
james       24
thomas      23
joseph      22
Name: count, dtype: int64