{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "631f0979",
   "metadata": {},
   "source": [
    "# Nearest neighbors"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f7d5e68",
   "metadata": {},
   "source": [
    "This is an example of the *k nearest neighbors (kNN)* model. We will work with `application_train.csv` data and try to predict if the target will be positive. We will use three scores (*score1*, *score2*, *score3*) and therefore only data with all three non-NA values.\n",
    "\n",
    "* [Nearest Neighbors generally in scikit-learn](https://scikit-learn.org/stable/modules/neighbors.html)\n",
    "* [kNN classification example](https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html)\n",
    "* [KNeighborsClassifier documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "80a6699e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import seaborn as sns\n",
    "\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.model_selection import cross_val_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fead6274",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Home Credit data reading and preparing (a little)\n",
    "df_hc = pd.read_csv('application_train.csv') # adjust file path\n",
    "df_hc.columns = df_hc.columns.str.lower()\n",
    "\n",
    "# data reduction - selection of columns\n",
    "df_hc_colnames = ['sk_id_curr', 'target', 'name_contract_type', 'code_gender', 'flag_own_car',\n",
    "                 'flag_own_realty', 'days_birth', 'ext_source_1',\n",
    "                 'ext_source_2', 'ext_source_3', 'cnt_children', 'cnt_fam_members', 'name_family_status']\n",
    "df_hc = df_hc[df_hc_colnames]\n",
    "# and renaming to simpler names\n",
    "df_hc_colnames_new = ['id', 'target', 'loan_type', 'sex', 'has_car',\n",
    "                 'has_realty', 'age_days', 'score1',\n",
    "                 'score2', 'score3', 'cnt_children', 'cnt_fam_members', 'fam_status']\n",
    "df_hc.columns = df_hc_colnames_new\n",
    "\n",
    "# data transformation\n",
    "# keep only rows with all scores known\n",
    "df_hc = df_hc[df_hc['score1'].notna() & df_hc['score2'].notna() & df_hc['score3'].notna()]\n",
    "df_hc['age_days'] = -df_hc['age_days']\n",
    "df_hc['age'] = df_hc['age_days'] / 365.25\n",
    "df_hc.drop(['age_days'], axis=1)\n",
    "df_hc"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca8eafcc",
   "metadata": {},
   "source": [
    "From the base dataset we will take the train and validation sets as random samples. Before we compute distances, it is reasonable (but generally optional) to scale all columns to zero mean and same variance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20caeb13",
   "metadata": {},
   "outputs": [],
   "source": [
    "# making a sample for kNN\n",
    "df_hc2 = df_hc.sample(n=10000)\n",
    "\n",
    "# scaling scores to zero mean and SD 1\n",
    "# StandardScaler from sklearn.preprocessing can be used, too\n",
    "df_hc2['score1_std'] = (df_hc2['score1'] - df_hc2['score1'].mean()) / df_hc2['score1'].std()\n",
    "df_hc2['score2_std'] = (df_hc2['score2'] - df_hc2['score2'].mean()) / df_hc2['score2'].std()\n",
    "df_hc2['score3_std'] = (df_hc2['score3'] - df_hc2['score3'].mean()) / df_hc2['score3'].std()\n",
    "\n",
    "# dividing to train/test and validation\n",
    "df_hc_tt = df_hc2[:5000]\n",
    "df_hc_val = df_hc2[5000:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39fd0c32",
   "metadata": {},
   "outputs": [],
   "source": [
    "X = df_hc_tt[['score1_std', 'score2_std', 'score3_std']]\n",
    "y = df_hc_tt['target']\n",
    "X_val = df_hc_val[['score1_std', 'score2_std', 'score3_std']]\n",
    "y_val = df_hc_val['target']\n",
    "\n",
    "neigh = KNeighborsClassifier(n_neighbors=20)\n",
    "neigh.fit(X, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "900a5eb0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# performance on train-test dataset\n",
    "print('Accuracy on itself (train data): ', neigh.score(X,y))\n",
    "\n",
    "# cross-validation\n",
    "scores = cross_val_score(KNeighborsClassifier(n_neighbors=20), X, y, cv=5)\n",
    "print('Accuracy by cval: ', scores)\n",
    "\n",
    "scores = cross_val_score(KNeighborsClassifier(n_neighbors=20), X, y, cv=5,\n",
    "                         scoring='roc_auc')\n",
    "print('ROC AUC by cval: ', scores)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "09ecc591",
   "metadata": {},
   "source": [
    "Although the accuracy seems to be very high, it is due to low overall positivity (only around 8 %). We can expect that positivity cases will be sparse and therefore probability of positivity will be low and under 0.5. So we can shift threshold lower, say, to 0.35."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "504e4d96",
   "metadata": {},
   "outputs": [],
   "source": [
    "# performance on validation dataset\n",
    "predp=neigh.predict_proba(X_val) # probabilities of categories\n",
    "clas=neigh.predict(X_val) # result of classification\n",
    "df_hc_val.loc[:, 'pred'] = clas\n",
    "\n",
    "print('Validation data -- classification vs. true category:')\n",
    "print(pd.crosstab(df_hc_val['target'], df_hc_val['pred']))\n",
    "\n",
    "print('Accuracy on validation data: ', neigh.score(X_val, y_val))\n",
    "\n",
    "# moving classification threshold lower to get more positive classifications\n",
    "df_hc_val.loc[:, 'pred2'] = np.array(predp[:, 1]>0.35) * 1\n",
    "print('Validation data -- classification vs. true category (threshold 0.35):')\n",
    "pomtab = pd.crosstab(df_hc_val['target'], df_hc_val['pred2'])\n",
    "print(pomtab)\n",
    "\n",
    "print('Accuracy on validation data (threshold 0.35): ', (pomtab.loc[0, 0] + pomtab.loc[1, 1]) / pomtab.sum().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "95b3c07a",
   "metadata": {},
   "source": [
    "Now try yourself to add another predictor (e. g. age) and use different parametrization (number of neighbors, sample size, weighted aggregation of \"votes\" etc.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1b6d900d",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}