{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0533a4a3",
   "metadata": {},
   "source": [
    "# Modelling I – fitting linear regression on Titanic data\n",
    "\n",
    "This notebook contains an example of fitting and evaluating linear regression model on Titanic data. We will use tickets as modelling units (rows, entities), *fare* as target (possibly log fare) and various features as predictors.\n",
    "\n",
    "## Data\n",
    "\n",
    "We use the dataset Titanic and data preparation from the recent practice (see Data Preparation)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3f8ec285",
   "metadata": {},
   "outputs": [],
   "source": [
    "# setup\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import seaborn as sns\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.model_selection import cross_val_score\n",
    "import statsmodels.api as sm\n",
    "import statsmodels.formula.api as smf\n",
    "\n",
    "pd.set_option(\"display.precision\", 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f8c7abc8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Titanic data reading and preparing - reminder from `Data Preparation` practice\n",
    "df_t1 = pd.read_csv('titanic_train.csv') # adjust file path\n",
    "df_t1 = df_t1[['passenger_id', 'ticket', 'pclass', 'fare', 'sex', 'age', 'cabin', 'embarked']]\n",
    "\n",
    "# cleaning\n",
    "df_t1 = df_t1[df_t1['fare'].notna() & (df_t1['fare']>0) & (df_t1['embarked'].notna())]\n",
    "\n",
    "# making new dataset of tickets\n",
    "# User function\n",
    "def rate_males(s):\n",
    "    return np.mean(np.where(s=='male', 1, 0))\n",
    "\n",
    "### Base table\n",
    "df_t2_base = df_t1[['ticket', 'pclass', 'fare']].drop_duplicates()\n",
    "df_t2_base = df_t2_base.set_index('ticket') # setting 'ticket' column as key\n",
    "\n",
    "### Multiple embarkment solution\n",
    "df_t2_emb = df_t1.groupby('ticket').agg({'embarked': 'max'})\n",
    "# no need to set index - groupby + agg sets index by default\n",
    "\n",
    "### Some chosen features\n",
    "df_t2_feat = df_t1.groupby('ticket').agg({'ticket': 'count', 'sex': [rate_males],\n",
    "                                      'age': ['min', 'max', np.mean, 'count'], 'cabin': 'nunique'})\n",
    "# column names update\n",
    "df_t2_feat.columns = ['pass_cnt', 'rate_males', 'age_min', 'age_max', 'age_mean', 'age_valid_cnt', 'cabin_cnt']\n",
    "\n",
    "# sex of the oldest person for the ticket\n",
    "df_t2_feat_sex_oldest = df_t1.sort_values(by=['ticket', 'age'], ascending=[True, False]) \\\n",
    "    .drop_duplicates('ticket')[['ticket', 'sex']]\n",
    "df_t2_feat_sex_oldest = df_t2_feat_sex_oldest.set_index('ticket') # setting 'ticket' column as key\n",
    "df_t2_feat_sex_oldest.columns = ['sex_oldest']\n",
    "\n",
    "### Joining tables together\n",
    "df_t2 = df_t2_base.join(df_t2_emb) # join is by default LEFT and index<->index\n",
    "df_t2 = df_t2.join(df_t2_feat)\n",
    "df_t2 = df_t2.join(df_t2_feat_sex_oldest)\n",
    "\n",
    "# mathematical transformations\n",
    "df_t2['fare_log'] = np.log10(df_t2['fare']) # we use log10 for better interpretation, but simple log is ok, too.\n",
    "df_t2['fare_per_pass'] = df_t2['fare'] / df_t2['pass_cnt']\n",
    "\n",
    "# binning, making categories and flags\n",
    "### pass_cnt\n",
    "df_t2['pass_cnt_cat'] = pd.cut(df_t2['pass_cnt'], [0, 1, 2, 3, 1000], labels=['1', '2', '3', '4+'])\n",
    "\n",
    "### age_mean\n",
    "df_t2['age_mean_cat'] = pd.cut(df_t2['age_mean'], [0, 15, 20, 25, 30, 40, 1000],\n",
    "                             labels=['15-', '15-20', '20-25', '25-30', '30-40', '40+'])\n",
    "\n",
    "### cabin_cnt (same approach as pass_cnt)\n",
    "df_t2['cabin_cnt_cat'] = pd.cut(df_t2['cabin_cnt'], [0, 1, 2, 1000], right=False, labels=['none', '1', '2+'])\n",
    "\n",
    "# flags\n",
    "df_t2['flag_child'] = (df_t2['age_min'] < 15)\n",
    "df_t2['flag_baby'] = (df_t2['age_min'] < 3)\n",
    "\n",
    "### cleanup\n",
    "del df_t2_base\n",
    "del df_t2_emb\n",
    "del df_t2_feat\n",
    "del df_t2_feat_sex_oldest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f29e14aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_t2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62820c88",
   "metadata": {},
   "source": [
    "## Linear regression\n",
    "\n",
    "We learned that *fare* is very skew, we have transformed it by log10. So we take *fare_log* as target and *embarked*, *pclass* and *pass_cnt* as predictors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3548cf7a",
   "metadata": {},
   "outputs": [],
   "source": [
    "g = sns.displot(data=df_t2, x=\"fare_log\", kind=\"kde\") \\\n",
    "    .set_axis_labels(\"Log10(ticket fare)\", \"Density\") \\\n",
    "    .set(title=\"Distribution of log ticket fare\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b83f69b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "X = df_t2[['pass_cnt']]\n",
    "y = df_t2['fare_log']\n",
    "\n",
    "# fit model\n",
    "modelA = LinearRegression().fit(X, y)\n",
    "\n",
    "# get coefficients\n",
    "print('Intercept: ', modelA.intercept_)\n",
    "print('Beta coefficients: ', modelA.coef_)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b60134f9",
   "metadata": {},
   "source": [
    "Try to interpret coefficients.\n",
    "\n",
    "How is the model performing? Let's go for a cross-validation and R2 score."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3b0774b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "scores = cross_val_score(LinearRegression(), X, y, cv=4)\n",
    "print('R2 by cval: ', scores)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b6ee763",
   "metadata": {},
   "source": [
    "Not bad. Let's look at the coefficient significance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ab291e6a",
   "metadata": {},
   "outputs": [],
   "source": [
    "modelC = smf.ols(\"fare_log ~ pass_cnt\", data=df_t2).fit()\n",
    "print(modelC.summary()) # detailed information of model and coefficients"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c9af8ae",
   "metadata": {},
   "source": [
    "Ok. Try yourself to improve the model by adding *pclass*, *embarked* and maybe some other predictors.  \n",
    "**Warning!** *pclass* is encoded as integer but in fact this is categorical predictor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ac655e97",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}