# Nearest neighbors

This is an example of the *k nearest neighbors (kNN)* model. We will work with `application_train.csv` data and try to predict if the target will be positive. We will use three scores (*score1*, *score2*, *score3*) and therefore only data with all three non-NA values.

* [Nearest Neighbors generally in scikit-learn](https://scikit-learn.org/stable/modules/neighbors.html)
* [kNN classification example](https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html)
* [KNeighborsClassifier documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

In [None]:
# Home Credit data reading and preparing (a little)
df_hc = pd.read_csv('application_train.csv') # adjust file path
df_hc.columns = df_hc.columns.str.lower()

# data reduction - selection of columns
df_hc_colnames = ['sk_id_curr', 'target', 'name_contract_type', 'code_gender', 'flag_own_car',
 'flag_own_realty', 'days_birth', 'ext_source_1',
 'ext_source_2', 'ext_source_3', 'cnt_children', 'cnt_fam_members', 'name_family_status']
df_hc = df_hc[df_hc_colnames]
# and renaming to simpler names
df_hc_colnames_new = ['id', 'target', 'loan_type', 'sex', 'has_car',
 'has_realty', 'age_days', 'score1',
 'score2', 'score3', 'cnt_children', 'cnt_fam_members', 'fam_status']
df_hc.columns = df_hc_colnames_new

# data transformation
# keep only rows with all scores known
df_hc = df_hc[df_hc['score1'].notna() & df_hc['score2'].notna() & df_hc['score3'].notna()]
df_hc['age_days'] = -df_hc['age_days']
df_hc['age'] = df_hc['age_days'] / 365.25
df_hc.drop(['age_days'], axis=1)
df_hc

From the base dataset we will take the train and validation sets as random samples. Before we compute distances, it is reasonable (but generally optional) to scale all columns to zero mean and same variance.

In [None]:
# making a sample for kNN
df_hc2 = df_hc.sample(n=10000)

# scaling scores to zero mean and SD 1
# StandardScaler from sklearn.preprocessing can be used, too
df_hc2['score1_std'] = (df_hc2['score1'] - df_hc2['score1'].mean()) / df_hc2['score1'].std()
df_hc2['score2_std'] = (df_hc2['score2'] - df_hc2['score2'].mean()) / df_hc2['score2'].std()
df_hc2['score3_std'] = (df_hc2['score3'] - df_hc2['score3'].mean()) / df_hc2['score3'].std()

# dividing to train/test and validation
df_hc_tt = df_hc2[:5000]
df_hc_val = df_hc2[5000:]

In [None]:
X = df_hc_tt[['score1_std', 'score2_std', 'score3_std']]
y = df_hc_tt['target']
X_val = df_hc_val[['score1_std', 'score2_std', 'score3_std']]
y_val = df_hc_val['target']

neigh = KNeighborsClassifier(n_neighbors=20)
neigh.fit(X, y)

In [None]:
# performance on train-test dataset
print('Accuracy on itself (train data): ', neigh.score(X,y))

# cross-validation
scores = cross_val_score(KNeighborsClassifier(n_neighbors=20), X, y, cv=5)
print('Accuracy by cval: ', scores)

scores = cross_val_score(KNeighborsClassifier(n_neighbors=20), X, y, cv=5,
 scoring='roc_auc')
print('ROC AUC by cval: ', scores)

Although the accuracy seems to be very high, it is due to low overall positivity (only around 8 %). We can expect that positivity cases will be sparse and therefore probability of positivity will be low and under 0.5. So we can shift threshold lower, say, to 0.35.

In [None]:
# performance on validation dataset
predp=neigh.predict_proba(X_val) # probabilities of categories
clas=neigh.predict(X_val) # result of classification
df_hc_val.loc[:, 'pred'] = clas

print('Validation data -- classification vs. true category:')
print(pd.crosstab(df_hc_val['target'], df_hc_val['pred']))

print('Accuracy on validation data: ', neigh.score(X_val, y_val))

# moving classification threshold lower to get more positive classifications
df_hc_val.loc[:, 'pred2'] = np.array(predp[:, 1]>0.35) * 1
print('Validation data -- classification vs. true category (threshold 0.35):')
pomtab = pd.crosstab(df_hc_val['target'], df_hc_val['pred2'])
print(pomtab)

print('Accuracy on validation data (threshold 0.35): ', (pomtab.loc[0, 0] + pomtab.loc[1, 1]) / pomtab.sum().sum())

Now try yourself to add another predictor (e. g. age) and use different parametrization (number of neighbors, sample size, weighted aggregation of "votes" etc.)