{ "cells": [ { "cell_type": "markdown", "id": "23c5302f", "metadata": {}, "source": [ "# Modelling I – fitting logistic regression on HomeCredit data\n", "\n", "This notebook contains an example of fitting and evaluating logistic regression model for the lecture Modelling I." ] }, { "cell_type": "markdown", "id": "0533a4a3", "metadata": {}, "source": [ "## Data\n", "\n", "We use the basic dataset of clients *application_train.csv*.\n", "\n", "* target (default = did not repay the loan)\n", "* loan type\n", "* sex, age, family status, count of children and family members\n", "* ownership of car, realty, income type\n", "* 3 external scores" ] }, { "cell_type": "code", "execution_count": null, "id": "3f8ec285", "metadata": {}, "outputs": [], "source": [ "# setup\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.model_selection import cross_val_score\n", "import statsmodels.api as sm\n", "import statsmodels.formula.api as smf\n", "import pandas as pd\n", "import numpy as np\n", "import seaborn as sns\n", "\n", "pd.set_option(\"display.precision\", 2)" ] }, { "cell_type": "code", "execution_count": null, "id": "e73ecffd", "metadata": {}, "outputs": [], "source": [ "# Home Credit data reading and preparing (a little)\n", "df_hc = pd.read_csv('application_train.csv') # adjust file path\n", "df_hc.columns = df_hc.columns.str.lower()\n", "\n", "# data reduction - selection of columns\n", "df_hc_colnames = ['sk_id_curr', 'target', 'name_contract_type', 'code_gender', 'flag_own_car',\n", " 'flag_own_realty', 'name_income_type', 'days_birth', 'ext_source_1',\n", " 'ext_source_2', 'ext_source_3', 'cnt_children', 'cnt_fam_members', 'name_family_status']\n", "df_hc = df_hc[df_hc_colnames]\n", "# and renaming to simpler names\n", "df_hc_colnames_new = ['id', 'target', 'loan_type', 'sex', 'has_car',\n", " 'has_realty', 'income_type', 'age_days', 'score1',\n", " 'score2', 'score3', 'cnt_children', 'cnt_fam_members', 'fam_status']\n", "df_hc.columns = df_hc_colnames_new\n", "\n", "# data transformation\n", "df_hc['age_days'] = -df_hc['age_days']\n", "df_hc['age'] = df_hc['age_days'] / 365.25" ] }, { "cell_type": "markdown", "id": "502a9c92", "metadata": {}, "source": [ "## Logistic regression\n", "We take *target* as target and *score2* (and later *sex* and *age*) as predictor(s)." ] }, { "cell_type": "code", "execution_count": null, "id": "1ca764ca", "metadata": {}, "outputs": [], "source": [ "# data preparation\n", "df_hc = df_hc[df_hc['target'].notna() & df_hc['score2'].notna() & df_hc['sex'].notna() & df_hc['age'].notna()]\n", "df_hc = df_hc[df_hc['sex']!='XNA']" ] }, { "cell_type": "code", "execution_count": null, "id": "3fa806ef", "metadata": {}, "outputs": [], "source": [ "print('Target positive rate: ', np.mean(df_hc['target']))" ] }, { "cell_type": "code", "execution_count": null, "id": "3e9e371f", "metadata": {}, "outputs": [], "source": [ "# chart of score2\n", "g = sns.displot(data=df_hc, x=\"score2\") \\\n", " .set_axis_labels(\"Score2\", \"Density\") \\\n", " .set(title=\"Distribution of Score2\")" ] }, { "cell_type": "code", "execution_count": null, "id": "6a6e1440", "metadata": {}, "outputs": [], "source": [ "### let's go modelling\n", "X = df_hc[['score2']]\n", "y = df_hc['target']\n", "\n", "# fit model\n", "modelA = LogisticRegression(solver='newton-cg', penalty='none').fit(X, y)\n", "\n", "# get coefficients\n", "print('Intercept: ', modelA.intercept_)\n", "print('Beta coefficients: ', modelA.coef_)" ] }, { "cell_type": "code", "execution_count": null, "id": "f7f35c39", "metadata": {}, "outputs": [], "source": [ "# chart of fitted positive rate vs. score\n", "hlp_x = np.array(range(101)) / 100.0 # fitting for score from 0 to 1\n", "hlp_x2 = [[x] for x in hlp_x.tolist()]\n", "hlp_y = modelA.predict_proba(hlp_x2) # fitted probabilities\n", "hlp_y = [x[1] for x in hlp_y] # transformation to a list\n", "hlp_x2 = [x[0] for x in hlp_x2] # unlisting the list\n", "hlp_df = pd.DataFrame({'score2': hlp_x2, 'pred': hlp_y}) # auxilliary DataFrame for seaborn\n", "\n", "g = sns.relplot(data=hlp_df, x=\"score2\", y=\"pred\") \\\n", " .set_axis_labels(\"Score2\", \"Probability\") \\\n", " .set(title=\"Predicted probability\")" ] }, { "cell_type": "markdown", "id": "b27265d8", "metadata": {}, "source": [ "After fitting the model, let's evaluate the performance.\n", "\n", "We can compare fitted vs. actual target values on all records, but this is not recommended -- it overrates the model. The correct way is to do **cross-validation**.\n", "\n", "The **metric** use by default in sklearn LogisticRegression is **accuracy**. It's not good in unbalanced situations like here:\n", "* it uses prediction threshold 0.5 when we work with small (or big) probabilities;\n", "* predicting all negative or all positive gives high accuracy" ] }, { "cell_type": "code", "execution_count": null, "id": "2c2880f4", "metadata": {}, "outputs": [], "source": [ "### now evaluate model\n", "# assess model performance\n", "# i. scoring itself directly\n", "print('Accuracy on itself: ', modelA.score(X, y))\n", "# Bad for two reasons:\n", "# 1. generally overestimates the performance\n", "# 2. if the target is highly unbalanced, accuracy is bad matric\n", "\n", "# ii. scoring by a cross-validation\n", "# but use smarter metric instead of accuracy - e. g. ROC AUC\n", "scores = cross_val_score(LogisticRegression(solver='newton-cg', penalty='none'), X, y, cv=5,\n", " scoring='roc_auc')\n", "print('ROC AUC by cval: ', scores)" ] }, { "cell_type": "markdown", "id": "781a13e2", "metadata": {}, "source": [ "It looks like *score2* makes our model much better than null model. The ROC AUC is about 0.65 (null model gives 0.5).\n", "\n", "We can assure ourselves in the significance of *score2* predictor by other view provided by `statmodels` package." ] }, { "cell_type": "code", "execution_count": null, "id": "ae0def59", "metadata": {}, "outputs": [], "source": [ "modelC = smf.logit(\"target ~ score2\", data=df_hc).fit()\n", "print(modelC.summary()) # detailed information of model and coefficients" ] }, { "cell_type": "markdown", "id": "ef16274e", "metadata": {}, "source": [ "> The coefficient at *score2* has p-value much lower than 0.05 threshold, therefore it is significantly different from zero and statistically significant.\n", "\n", "Now we add two predictors more: *sex* and *age*." ] }, { "cell_type": "code", "execution_count": null, "id": "212b6654", "metadata": {}, "outputs": [], "source": [ "# data preparation\n", "# it's necessary to make dummies for categorical predictor \"sex\"\n", "df_hc['sex_orig'] = df_hc['sex'] # backup of the original column - it will be lost!\n", "df_hc = pd.get_dummies(df_hc, columns=['sex'], drop_first=True)\n", "df_hc.rename(columns = {'sex_orig':'sex'}, inplace = True)" ] }, { "cell_type": "code", "execution_count": null, "id": "e017f903", "metadata": {}, "outputs": [], "source": [ "X = df_hc[['score2', 'age', 'sex_M']]\n", "y = df_hc['target']\n", "\n", "# fit model\n", "modelA = LogisticRegression(solver='newton-cg', penalty='none').fit(X, y)\n", "\n", "# get coefficients\n", "print('Intercept: ', modelA.intercept_)\n", "print('Beta coefficients: ', modelA.coef_)" ] }, { "cell_type": "code", "execution_count": null, "id": "8c2ba714", "metadata": {}, "outputs": [], "source": [ "scores = cross_val_score(LogisticRegression(solver='newton-cg', penalty='none'), X, y, cv=5,\n", " scoring='roc_auc')\n", "print('ROC AUC by cval: ', scores)" ] }, { "cell_type": "markdown", "id": "66ddfd23", "metadata": {}, "source": [ "As we can see, the model performance has slightly increased (0.65 -> 0.67). Is the model with three predictors significantly better than the former model? It seems so, but we can judge it by p-values at coefficients for *age* and *sex*. " ] }, { "cell_type": "code", "execution_count": null, "id": "91969b97", "metadata": {}, "outputs": [], "source": [ "modelC = smf.logit(\"target ~ score2 + age + sex_M\", data=df_hc).fit()\n", "print(modelC.summary()) # detailed information of model and coefficients" ] }, { "cell_type": "markdown", "id": "7d461b30", "metadata": {}, "source": [ "> Again, all coefficients at predictors are statistically significant -- different from zero.\n", "\n", "We can compare predicted and actual values. It should be made on predictions from cross-validation, but just for illustration, let's do it on predictions from model itself." ] }, { "cell_type": "code", "execution_count": null, "id": "28523a08", "metadata": {}, "outputs": [], "source": [ "X_pred = df_hc.sample(50)[['target', 'score2', 'age', 'sex_M']]\n", "y_pred = modelC.predict(X_pred) # fitted probabilities\n", "y_act = X_pred[['target']] # actual target values\n", "y_act['pred'] = y_pred\n", "y_act" ] }, { "cell_type": "markdown", "id": "106919d7", "metadata": {}, "source": [ "As we can see, all predicted probabilites are below 0.5. So for applying any smart metric, one have to put the threshold lower than 0.5. Let's try threshold 0.15 and show the confusion matrix." ] }, { "cell_type": "code", "execution_count": null, "id": "721c4d4e", "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import confusion_matrix\n", "y_class = (y_pred > 0.15)\n", "confusion_matrix(y_act['target'], y_class)" ] }, { "cell_type": "markdown", "id": "0d46820d", "metadata": {}, "source": [ "## Try it yourself\n", "Now try to use another columns as predictors and to improve risk model. Consider binning some numeric columns to categories." ] }, { "cell_type": "code", "execution_count": null, "id": "a9f3a6ea", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 5 }