# Naive Bayes classifier

We will work with HomeCredit data (`application_train.csv`) and try to detect if an applicant is male or female. For this purpose we will use Categorical Naive Bayes model which requires all predictors to be categorical.

At the first stage we use three predictors: family status, count of children and car ownership. Then you can add other categorical predictors like realty ownership or loan_type or numerical predictors converted (binned) to categories (e. g. age).

* [Naive Bayes generally in scikit-learn](https://scikit-learn.org/stable/modules/naive_bayes.html)
* [CategoricalNB documentation](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html)

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.naive_bayes import CategoricalNB
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import cross_val_score

In [None]:
# Home Credit data reading and preparing (a little)
df_hc = pd.read_csv('application_train.csv') # adjust file path
df_hc.columns = df_hc.columns.str.lower()

# data reduction - selection of columns
df_hc_colnames = ['sk_id_curr', 'target', 'name_contract_type', 'code_gender', 'flag_own_car',
 'flag_own_realty', 'days_birth', 'ext_source_1',
 'ext_source_2', 'ext_source_3', 'cnt_children', 'cnt_fam_members', 'name_family_status']
df_hc = df_hc[df_hc_colnames]
# and renaming to simpler names
df_hc_colnames_new = ['id', 'target', 'loan_type', 'sex', 'has_car',
 'has_realty', 'age_days', 'score1',
 'score2', 'score3', 'cnt_children', 'cnt_fam_members', 'fam_status']
df_hc.columns = df_hc_colnames_new

# data transformation
df_hc = df_hc[df_hc['sex'] != 'XNA'] # drop rows with unknown sex
df_hc['cnt_children'] = np.minimum(df_hc['cnt_children'], 4) # 4+ children to one category
df_hc

We make train and validation sets as random samples from the whole dataset. Then we look into possible relationships between target and predictors.

In [None]:
# making a sample for Naive Bayes
df_hc2 = df_hc.sample(n=10000)

# data preparation for CategoricalNB - categorical features have to be labeled as numbers 0, 1, ... K.
enc = OrdinalEncoder()
df_hc2[['fam_status2', 'has_car2', 'cnt_children2']] = enc.fit_transform(df_hc2[['fam_status', 'has_car', 'cnt_children']])

# dividing to train/test and validation
df_hc_tt = df_hc2[:5000]
df_hc_val = df_hc2[5000:]

In [None]:
print(pd.crosstab(df_hc2['fam_status'], df_hc2['sex'], normalize='columns'))
print(pd.crosstab(df_hc2['cnt_children'], df_hc2['sex'], normalize='columns'))
print(pd.crosstab(df_hc2['has_car'], df_hc2['sex'], normalize='columns'))

It seems that being widow or separated indicates women and car ownership indicates men. Now we fit Naive Bayes on the train data (for each category of target, shares of values for each predictors are computed) and decide about prior. First we try an uniformed prior, i. e. *P(target is male) = P(target is female) = 1/2*.

In [None]:
# fitting Naive Bayes
X = df_hc_tt[['fam_status2', 'has_car2', 'cnt_children2']]
y = df_hc_tt['sex']
X_val = df_hc_val[['fam_status2', 'has_car2', 'cnt_children2']]
y_val = df_hc_val['sex']

clf = CategoricalNB(alpha=1, fit_prior=False) # uniformed prior
clf.fit(X, y)

### uniformed prior
# performance on train-test dataset
print('Accuracy on itself (train data): ', clf.score(X,y))
# cross-validation
scores = cross_val_score(CategoricalNB(alpha=1, fit_prior=False), X, y, cv=5)
print('Accuracy by cval: ', scores)

scores = cross_val_score(CategoricalNB(alpha=1, fit_prior=False), X, y, cv=5,
 scoring='roc_auc')
print('ROC AUC by cval: ', scores)

# performance on validation dataset
predp=clf.predict_proba(X_val) # probabilities of categories
clas=clf.predict(X_val) # result of classification

df_hc_val['sex_clas'] = clas

print('Validation data -- classification vs. true category:')
print(pd.crosstab(df_hc_val['sex'], df_hc_val['sex_clas']))

print('Accuracy on validation data: ', clf.score(X_val, y_val))

The prior can be given with more respect to the data. We check true share of female/male in the train data and will give a prior from it. 

In [None]:
df_hc_tt['sex'].value_counts() / df_hc_tt['sex'].count()

In [None]:
# Naive Bayes - imputed prior
clf2 = CategoricalNB(alpha=1, class_prior=[2/3, 1/3]) # imputed prior
clf2.fit(X, y)

# performance on validation dataset
predp2=clf2.predict_proba(X_val) # probabilities of categories
clas2=clf2.predict(X_val) # result of classification

df_hc_val['sex_clas2'] = clas2

print('Validation data -- classification vs. true category:')
print(pd.crosstab(df_hc_val['sex'], df_hc_val['sex_clas2']))

print('Accuracy on validation data: ', clf2.score(X_val, y_val))

Now try yourself to use another predictors and to give a different prior.