{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Naive Bayes classifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will work with HomeCredit data (`application_train.csv`) and try to detect if an applicant is male or female. For this purpose we will use Categorical Naive Bayes model which requires all predictors to be categorical.\n",
    "\n",
    "At the first stage we use three predictors: family status, count of children and car ownership. Then you can add other categorical predictors like realty ownership or loan_type or numerical predictors converted (binned) to categories (e. g. age).\n",
    "\n",
    "* [Naive Bayes generally in scikit-learn](https://scikit-learn.org/stable/modules/naive_bayes.html)\n",
    "* [CategoricalNB documentation](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import seaborn as sns\n",
    "\n",
    "from sklearn.naive_bayes import CategoricalNB\n",
    "from sklearn.preprocessing import OrdinalEncoder\n",
    "from sklearn.model_selection import cross_val_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Home Credit data reading and preparing (a little)\n",
    "df_hc = pd.read_csv('application_train.csv') # adjust file path\n",
    "df_hc.columns = df_hc.columns.str.lower()\n",
    "\n",
    "# data reduction - selection of columns\n",
    "df_hc_colnames = ['sk_id_curr', 'target', 'name_contract_type', 'code_gender', 'flag_own_car',\n",
    "                 'flag_own_realty', 'days_birth', 'ext_source_1',\n",
    "                 'ext_source_2', 'ext_source_3', 'cnt_children', 'cnt_fam_members', 'name_family_status']\n",
    "df_hc = df_hc[df_hc_colnames]\n",
    "# and renaming to simpler names\n",
    "df_hc_colnames_new = ['id', 'target', 'loan_type', 'sex', 'has_car',\n",
    "                 'has_realty', 'age_days', 'score1',\n",
    "                 'score2', 'score3', 'cnt_children', 'cnt_fam_members', 'fam_status']\n",
    "df_hc.columns = df_hc_colnames_new\n",
    "\n",
    "# data transformation\n",
    "df_hc = df_hc[df_hc['sex'] != 'XNA'] # drop rows with unknown sex\n",
    "df_hc['cnt_children'] = np.minimum(df_hc['cnt_children'], 4) # 4+ children to one category\n",
    "df_hc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We make train and validation sets as random samples from the whole dataset. Then we look into possible relationships between target and predictors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# making a sample for Naive Bayes\n",
    "df_hc2 = df_hc.sample(n=10000)\n",
    "\n",
    "# data preparation for CategoricalNB - categorical features have to be labeled as numbers 0, 1, ... K.\n",
    "enc = OrdinalEncoder()\n",
    "df_hc2[['fam_status2', 'has_car2', 'cnt_children2']] = enc.fit_transform(df_hc2[['fam_status', 'has_car', 'cnt_children']])\n",
    "\n",
    "# dividing to train/test and validation\n",
    "df_hc_tt = df_hc2[:5000]\n",
    "df_hc_val = df_hc2[5000:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(pd.crosstab(df_hc2['fam_status'], df_hc2['sex'], normalize='columns'))\n",
    "print(pd.crosstab(df_hc2['cnt_children'], df_hc2['sex'], normalize='columns'))\n",
    "print(pd.crosstab(df_hc2['has_car'], df_hc2['sex'], normalize='columns'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It seems that being widow or separated indicates women and car ownership indicates men. Now we fit Naive Bayes on the train data (for each category of target, shares of values for each predictors are computed) and decide about prior. First we try an uniformed prior, i. e. *P(target is male) = P(target is female) = 1/2*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# fitting Naive Bayes\n",
    "X = df_hc_tt[['fam_status2', 'has_car2', 'cnt_children2']]\n",
    "y = df_hc_tt['sex']\n",
    "X_val = df_hc_val[['fam_status2', 'has_car2', 'cnt_children2']]\n",
    "y_val = df_hc_val['sex']\n",
    "\n",
    "clf = CategoricalNB(alpha=1, fit_prior=False) # uniformed prior\n",
    "clf.fit(X, y)\n",
    "\n",
    "### uniformed prior\n",
    "# performance on train-test dataset\n",
    "print('Accuracy on itself (train data): ', clf.score(X,y))\n",
    "# cross-validation\n",
    "scores = cross_val_score(CategoricalNB(alpha=1, fit_prior=False), X, y, cv=5)\n",
    "print('Accuracy by cval: ', scores)\n",
    "\n",
    "scores = cross_val_score(CategoricalNB(alpha=1, fit_prior=False), X, y, cv=5,\n",
    "                         scoring='roc_auc')\n",
    "print('ROC AUC by cval: ', scores)\n",
    "\n",
    "# performance on validation dataset\n",
    "predp=clf.predict_proba(X_val) # probabilities of categories\n",
    "clas=clf.predict(X_val) # result of classification\n",
    "\n",
    "df_hc_val['sex_clas'] = clas\n",
    "\n",
    "print('Validation data -- classification vs. true category:')\n",
    "print(pd.crosstab(df_hc_val['sex'], df_hc_val['sex_clas']))\n",
    "\n",
    "print('Accuracy on validation data: ', clf.score(X_val, y_val))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The prior can be given with more respect to the data. We check true share of female/male in the train data and will give a prior from it. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_hc_tt['sex'].value_counts() / df_hc_tt['sex'].count()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Naive Bayes - imputed prior\n",
    "clf2 = CategoricalNB(alpha=1, class_prior=[2/3, 1/3]) # imputed prior\n",
    "clf2.fit(X, y)\n",
    "\n",
    "# performance on validation dataset\n",
    "predp2=clf2.predict_proba(X_val) # probabilities of categories\n",
    "clas2=clf2.predict(X_val) # result of classification\n",
    "\n",
    "df_hc_val['sex_clas2'] = clas2\n",
    "\n",
    "print('Validation data -- classification vs. true category:')\n",
    "print(pd.crosstab(df_hc_val['sex'], df_hc_val['sex_clas2']))\n",
    "\n",
    "print('Accuracy on validation data: ', clf2.score(X_val, y_val))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now try yourself to use another predictors and to give a different prior."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}