{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "42b920d8",
   "metadata": {},
   "source": [
    "# Game sessions - dimensionality reduction and clustering\n",
    "\n",
    "## Motivation\n",
    "\n",
    "We would like to know typical courses of gambler's playing in a gaming place. It can be described by, for example, length, game speed, breaks, bet amounts etc. There may be other characteristics as daytime, weektime, winnings etc.\n",
    "\n",
    "After we make the typology, we can check if one gambler uses similar playing patterns.\n",
    "\n",
    "## Data\n",
    "\n",
    "We have records of individual games (user, time, bet amount, place id, machine id, game id). From that table, game sessions were created and saved in the file *sessions_hra.csv*. The file contains many columns, they are described (in Czech) in the file *sessions_popis.pdf*. Here is an example of data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9af9b225",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Setup - basic packages\n",
    "import warnings\n",
    "warnings.simplefilter(action='ignore', category=FutureWarning)\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "%matplotlib inline\n",
    "pd.options.display.max_columns = None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "baa7176d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reading and showing a random sample\n",
    "D = pd.read_csv(\"sessions_hra.csv\")\n",
    "print('Number of rows:', len(D))\n",
    "print('Number of columns:', D.shape[1])\n",
    "D_sample = D.sample(n=20) # random sample\n",
    "D_sample"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8473284",
   "metadata": {},
   "source": [
    "For this practice I recommend to filter the dataset to get:\n",
    "* sessions with number of games above some threshold only,\n",
    "* sample of users (not of rows, to keep rows of one user together) and\n",
    "* sample of columns\n",
    "\n",
    "See the code below (there is a recommended list of columns)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "69e20b79",
   "metadata": {},
   "outputs": [],
   "source": [
    "c_min_games = 100 # minimal number of individual games played at the session\n",
    "c_num_users = 400 # number of users in the sample\n",
    "c_columns = [\"idsession_h\", \"df_v_konto_key\", \"dt_zac\", \"ses_delka\", \"pozic_unik_pocet\", \"pozice_top1_pocet_her\", \"hra_pocet\",\n",
    "           \"hra_unik_pocet\", \"hra_top1_pocet\", \"sazka_sum\", \"sazka_pocet5czk\", \"sazka_pocet10czk\", \"sazka_pocet20czk\",\n",
    "           \"sazka_pocet50czk\", \"sazka_pocet100czk\", \"vyhra_sum\", \"vyhra_pocet\", \"serie_pocet\", \"serie_delka_sum\",\n",
    "           \"serie_nad30m_pocet\", \"serie_delka_max\", \"serie_delka_med\", \"serie_pauza_nad30m_pocet\"]\n",
    "\n",
    "D[\"dt_zac\"] = pd.to_datetime(D[\"dt_zac\"], format=\"ISO8601\") # convert date and time from string to datetime\n",
    "D = D[D[\"hra_pocet\"] >= c_min_games][c_columns] # select rows above threshold and recommended columns only\n",
    "\n",
    "# find unique users and make a random sample\n",
    "D_users = D[[\"df_v_konto_key\"]].drop_duplicates().sample(n=c_num_users)\n",
    "D_users.set_index(\"df_v_konto_key\", inplace=True)\n",
    "D = D.join(D_users, on='df_v_konto_key', how='inner')\n",
    "\n",
    "D # show the DataFrame after sampling"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "584c1db5",
   "metadata": {},
   "source": [
    "## Methods\n",
    "\n",
    "1. dimensionality reduction\n",
    "   - PCA, UMAP\n",
    "2. clustering\n",
    "   - k-means, DBscan\n",
    "\n",
    "* [PCA tutorial](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)\n",
    "* [UMAP tutorial](https://umap-learn.readthedocs.io/en/latest/parameters.html), [package documentation](https://pypi.org/project/umap-learn/)\n",
    "* [k-means tutorial](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)\n",
    "* [DBscan tutorial](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)\n",
    "\n",
    "## Dimensionality reduction\n",
    "\n",
    "First of all, we must **select columns** for dimenzionality reduction and **prepare data**: check the distributions of columns, possibly transform and scale them. [Box-Cox power transformation](https://en.wikipedia.org/wiki/Power_transform) may be useful."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "26146112",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Importing packages for advanced data transformation\n",
    "from scipy import stats\n",
    "import seaborn as sns\n",
    "import umap\n",
    "from sklearn.cluster import DBSCAN, KMeans\n",
    "from sklearn.decomposition import PCA\n",
    "\n",
    "### z-score transformation (X-EX)/sd(X)\n",
    "#\n",
    "#   x = Series of numerical values\n",
    "def scale(x):\n",
    "    return ((x - x.mean()) / x.std())\n",
    "\n",
    "### Box-Cox standardized power transformation (by the geometric mean, see\n",
    "#   https://en.wikipedia.org/wiki/Power_transform)\n",
    "#   If we intend to scale the result, scipy's boxcox method is enough - we don't need this function\n",
    "#\n",
    "#   x = Series of numerical values\n",
    "#   lmbda = parameter of Box-Cox power transformation\n",
    "def boxcox_norm(x, lmbda):\n",
    "    hlp_gm = np.exp(np.mean(np.log(x)))\n",
    "    if lmbda==0:\n",
    "        return np.log(x) * hlp_gm\n",
    "    else:\n",
    "        return (np.power(x, lmbda-1) - 1) / (lmbda * np.power(hlp_gm, lmbda-1))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32ca18ba",
   "metadata": {},
   "source": [
    "Let's choose six features:\n",
    "* session length\n",
    "* share of session spent by playing\n",
    "* average games per minute\n",
    "* average bet per game\n",
    "* share of games played on the most favourite machine\n",
    "* share of games played on the most favourite game id\n",
    "\n",
    "Naturally, we could take more variables, e. g. share of particular amounts of bets (5, 10, 20 etc.). You can try yourself to use them instead of aggregated sum of bets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a77eb596",
   "metadata": {},
   "outputs": [],
   "source": [
    "# columns for dimensionality reduction (together with id columns not to lose a connection to original data)\n",
    "D2 = pd.DataFrame({\"idsession_h\": D[\"idsession_h\"], \"df_v_konto_key\": D[\"df_v_konto_key\"],\n",
    "                   \"ses_delka\": D[\"ses_delka\"],\n",
    "                   \"serie_delka_rel\": D[\"serie_delka_sum\"] / D[\"ses_delka\"],\n",
    "                   \"hra_na_min\": (D[\"hra_pocet\"] - D[\"serie_pocet\"]) / D[\"serie_delka_sum\"] * 60,\n",
    "                   \"sazka_na_hru\": D[\"sazka_sum\"] / D[\"hra_pocet\"],\n",
    "                   \"pozice_top1_podil\": D[\"pozice_top1_pocet_her\"] / D[\"hra_pocet\"],\n",
    "                   \"hra_top1_podil\": D[\"hra_top1_pocet\"] / D[\"hra_pocet\"]\n",
    "                  })"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d409fdf",
   "metadata": {},
   "source": [
    "Let's consider the *ses_delka* column. Should it be transformed and how? We can get the answer from the visualisation (e. g. histogram) and from the *boxcox* function (optimal lambda recommendation). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d696d7fe",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ses_delka (session length): probably skew, what would boxcox say?\n",
    "a, l = stats.boxcox(D2[\"ses_delka\"])\n",
    "print('Recommended lambda for ses_delka =', l)\n",
    "\n",
    "# recommended lambda is around 0, so we use logarithm (and scale the result)\n",
    "D2[\"ses_delka_norm\"] = scale(np.log(D2[\"ses_delka\"]))\n",
    "\n",
    "# make a chart of original and transformed distribution\n",
    "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(6,3), sharey=True)\n",
    "fig.suptitle('Distribution')\n",
    "\n",
    "sns.histplot(data=D2, x=\"ses_delka\", ax=ax1)\n",
    "ax1.set_title('Original data')\n",
    "\n",
    "sns.histplot(data=D2, x=\"ses_delka_norm\", ax=ax2)\n",
    "ax2.set_title('Normalized data')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79bd3d93",
   "metadata": {},
   "source": [
    "Similarly, we transform and normalize other columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a3553ec8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# D2[\"ses_delka_norm\"] = scale(np.log(D2[\"ses_delka\"])) - has been transformed above\n",
    "D2[\"serie_delka_rel_norm\"] = scale(stats.boxcox(D2[\"serie_delka_rel\"], 2.5))\n",
    "D2[\"hra_na_min_norm\"] = scale(stats.boxcox(D2[\"hra_na_min\"], 1.5))\n",
    "D2[\"sazka_na_hru_norm\"] = scale(np.log(D2[\"sazka_na_hru\"]))\n",
    "D2[\"pozice_top1_podil_norm\"] = scale(D2[\"pozice_top1_podil\"])\n",
    "D2[\"hra_top1_podil_norm\"] = scale(D2[\"hra_top1_podil\"])\n",
    "\n",
    "# columns for reduction and clustering\n",
    "c_sl = [\"ses_delka_norm\", \"serie_delka_rel_norm\", \"hra_na_min_norm\", \"sazka_na_hru_norm\",\n",
    "        \"pozice_top1_podil_norm\", \"hra_top1_podil_norm\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a800aaf3",
   "metadata": {},
   "source": [
    "For the sake of visualisation of reduced and clustered data, we prepare few other columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3ab423fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# columns for checking how meaningful the reduction and transformation is\n",
    "D2[\"hodina\"] = D[\"dt_zac\"].dt.hour # hour in day when session started\n",
    "D2[\"dvt\"] = D[\"dt_zac\"].dt.dayofweek # day of week when session started\n",
    "D2[\"dvm\"] = D[\"dt_zac\"].dt.day # day of month when session started\n",
    "D2[\"wknd\"] = (D2[\"dvt\"]>=4) & (D2[\"dvt\"]<=6) # session on weekend? (Fr, Sa, Su)\n",
    "D2[\"noc\"] = (D2[\"hodina\"]>=19) | (D2[\"hodina\"]<=4) # session in the night?\n",
    "D2[\"vp\"] = D[\"vyhra_sum\"] / D[\"sazka_sum\"] # ratio of winnings to bets (can be transformed to be not skewed)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3b3398d",
   "metadata": {},
   "source": [
    "To get some insight into relationships in data, we compute the correlation matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1fc3d11d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# correlation matrix\n",
    "D2[c_sl].corr()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34caf4b6",
   "metadata": {},
   "source": [
    "As we see, there are only two rather strong relationships: session length to its share spent by playing; share of games played on the most favourite machine to share played on the most favourite game id. The rest seems to be uncorrelated or only slightly correlated. It looks like dimension of 6 can be linearly reduced to 4 but probably we cannot expect less (3 or even 2 dimension)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2ece19aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "# PCA\n",
    "pca = PCA(n_components=4)\n",
    "pca.fit(D2[c_sl])\n",
    "\n",
    "print('Transformation matrix:')\n",
    "print(pca.components_)\n",
    "print('Variability share of individual components:')\n",
    "print(pca.explained_variance_ratio_)\n",
    "out = pca.fit_transform(D2[c_sl])\n",
    "\n",
    "# save components values to D2 DataFrame\n",
    "D2[\"pca1\"] = out[:, 0]\n",
    "D2[\"pca2\"] = out[:, 1]\n",
    "D2[\"pca3\"] = out[:, 2]\n",
    "D2[\"pca4\"] = out[:, 3]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d92441e9",
   "metadata": {},
   "source": [
    "Seems not too bad. Two dimension cover more than 50 % of total variability. But the chart shows that we do not see clear clusters formed by first two principle components. We use color to show how two principle components depends on session length. Try yourself to use another variable for color."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c8be8711",
   "metadata": {},
   "outputs": [],
   "source": [
    "# PCA1 x PCA2\n",
    "g = sns.relplot(data=D2, x=\"pca1\", y=\"pca2\", s=10, hue=\"ses_delka_norm\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2cbd2ea0",
   "metadata": {},
   "source": [
    "Let's try UMAP dimensionality reduction. After it is fit (runs rather **long time**), we plot a chart which shows us that UMAP can find similarities and differences more subtly. Again, we use color to show the relationship between UMAP result and session length.\n",
    "\n",
    "Try yourself to set other values of parameters and to use color for other variables to see how UMAP distincts by various variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7471fed6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# UMAP transformation - runs rather long\n",
    "c_n_neighbors = 17\n",
    "c_min_dist = 0.005\n",
    "c_metric = 'euclidean' # there are other options, e. g. cosine\n",
    "\n",
    "fit = umap.UMAP(\n",
    "    n_neighbors=c_n_neighbors,\n",
    "    min_dist=c_min_dist,\n",
    "    n_components=2,\n",
    "    metric=c_metric\n",
    ")\n",
    "u = fit.fit_transform(D2[c_sl])\n",
    "\n",
    "# coordinates of points after transformation\n",
    "D2[\"u_x\"] = u[:, 0]\n",
    "D2[\"u_y\"] = u[:, 1]\n",
    "D2[\"dummy\"] = 0 # auxiliary variable for later visualisation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "225e4c28",
   "metadata": {},
   "outputs": [],
   "source": [
    "# result of UMAP with color mapping\n",
    "g = sns.relplot(data=D2, x=\"u_x\", y=\"u_y\", s=15, hue=\"ses_delka_norm\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e1155044",
   "metadata": {},
   "source": [
    "## Clustering\n",
    "\n",
    "The clustering can be made:\n",
    "\n",
    "1. on original subset of columns (without dimensionality reduction) - generally not recommended but possible here, thanks to few (6) dimensions\n",
    "2. on principle components - first and second (good for visualisation) of 1-4 (covering more information)\n",
    "3. on UMAP 2D projection (points) - good for visualisation and keeping information of similarity\n",
    "\n",
    "We start with **UMAP** projection, which is best clustered by some kind of hierarchical clustering (**DBscan** here). The chart below shows ten biggest clusters in different colors, unassigned points in light grey and other clusters together in dark grey. Try yourself to set different parameters for DBscan."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d5b09c38",
   "metadata": {},
   "outputs": [],
   "source": [
    "# clustering\n",
    "clustering = DBSCAN(eps=0.2, min_samples=10).fit(D2[[\"u_x\", \"u_y\"]])\n",
    "D2[\"clst\"] = clustering.labels_\n",
    "\n",
    "# reindexing cluster labels so that they are sorted by size (except -1 which is cluster of unassigned points)\n",
    "D_cl = D2[D2[\"clst\"]>=0].groupby(\"clst\").agg(cnt=pd.NamedAgg(column=\"clst\", aggfunc=\"count\"))\n",
    "D_cl[\"clst_new\"] = D_cl.rank(method='first', ascending=False).astype('int') - 1\n",
    "D_cl = D_cl.drop(columns=[\"cnt\"])\n",
    "D_cl = pd.concat([D_cl, pd.DataFrame({\"clst_new\": [-1]}, index=[-1])])\n",
    "\n",
    "D2 = D2.join(D_cl, on=\"clst\", how=\"inner\")\n",
    "D2 = D2.drop(columns=[\"clst\"]).rename(columns={\"clst_new\": \"clst\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d6b3a0c0",
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "plt.figure()\n",
    "sns.scatterplot(data=D2, x=\"u_x\", y=\"u_y\", s=15, hue=\"dummy\", palette=[\"grey\"], legend=False)\n",
    "sns.scatterplot(data=D2[D2[\"clst\"]==-1], x=\"u_x\", y=\"u_y\", s=15, hue=\"dummy\", palette=[\"lightgrey\"], legend=False)\n",
    "sns.scatterplot(data=D2[(D2[\"clst\"]>=0) & (D2[\"clst\"]<10)], x=\"u_x\", y=\"u_y\", s=15, hue=\"clst\", palette=\"deep\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba604082",
   "metadata": {},
   "source": [
    "Do cluster represent different behaviour pattern? We can check distributions of variables both within a cluster and overall. We make a table for all clusters and a chart for the cluster #7 (random pick of moderate size clusters). Try yourself to analyse other clusters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "42372f38",
   "metadata": {},
   "outputs": [],
   "source": [
    "# means/medians of variables overall and within individual clusters\n",
    "Dm = pd.DataFrame({\"cnt\": len(D2), \"ses_delka\": D2[\"ses_delka\"].median(), \"serie_delka_rel\": D2[\"serie_delka_rel\"].mean(),\n",
    "                   \"hra_na_min\": D2[\"hra_na_min\"].median(), \"sazka_na_hru\": D2[\"sazka_na_hru\"].median(),\n",
    "                   \"wknd\": D2[\"wknd\"].mean(), \"noc\": D2[\"noc\"].mean(), \"vp\": D2[\"vp\"].median()}, index=[0])\n",
    "\n",
    "Dg = D2.groupby(\"clst\").agg({\"clst\": \"count\", \"ses_delka\": \"median\", \"serie_delka_rel\": \"mean\",\n",
    "                        \"hra_na_min\": \"median\", \"sazka_na_hru\": \"median\",\n",
    "                        \"wknd\": \"mean\", \"noc\": \"mean\", \"vp\": \"median\"}).rename(columns={\"clst\": \"cnt\"}).reset_index()\n",
    "\n",
    "Dg[\"clst\"] = Dg[\"clst\"].astype('str')\n",
    "Dm[\"clst\"] = \"overall\"\n",
    "\n",
    "D_stat = pd.concat([Dm, Dg], ignore_index=True)\n",
    "# reordering columns...\n",
    "hlp = ['clst'] + D_stat.columns[:-1].tolist()\n",
    "D_stat = D_stat[hlp]\n",
    "\n",
    "D_stat"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1abc8a2f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# show charts with statistics\n",
    "\n",
    "clst_id = 7\n",
    "# compare cluster stats with overall stats\n",
    "# make an auxiliary table with all points and the cluster added\n",
    "D_hlp = pd.concat([D2.assign(grp=\"all\"), D2[D2[\"clst\"]==clst_id].assign(grp=\"cluster\")], ignore_index=True)\n",
    "\n",
    "fig, ax = plt.subplots(2, 3, figsize=(9,6))\n",
    "fig.suptitle(\"Cluster \"+str(clst_id)+\" vs. all data\")\n",
    "\n",
    "g = sns.scatterplot(data=D_hlp, x=\"u_x\", y=\"u_y\", s=15, hue=\"grp\", palette=[\"lightgrey\", \"red\"], legend=False, ax=ax[0][0])\n",
    "g.set_title(\"\")\n",
    "g.set_xlabel(\"\")\n",
    "g.set_ylabel(\"\")\n",
    "\n",
    "g = sns.boxplot(data=D_hlp, y=\"ses_delka\", x=\"grp\", palette=[\"lightgrey\", \"red\"],\n",
    "                ax=ax[0][1])\n",
    "g.set_title(\"Session length [seconds]\")\n",
    "g.set_xlabel(\"\")\n",
    "g.set_ylabel(\"\")\n",
    "g.set_yscale('log')\n",
    "\n",
    "g = sns.kdeplot(data=D_hlp, hue=\"grp\", x=\"hra_na_min\", palette=[\"lightgrey\", \"red\"], legend=False,\n",
    "                common_norm=False, ax=ax[0][2])\n",
    "g.set_title(\"Games per minute\")\n",
    "g.set_xlabel(\"\")\n",
    "g.set_ylabel(\"\")\n",
    "\n",
    "g = sns.histplot(data=D_hlp, hue=\"grp\", x=\"sazka_na_hru\", palette=[\"lightgrey\", \"red\"], legend=False,\n",
    "                stat=\"density\", common_norm=False, ax=ax[1][0])\n",
    "g.set_title(\"Bet per game [CZK]\")\n",
    "g.set_xlabel(\"\")\n",
    "g.set_ylabel(\"\")\n",
    "\n",
    "g = sns.barplot(data=D_hlp, x=\"grp\", y=\"wknd\", palette=[\"lightgrey\", \"red\"], errorbar=None, ax=ax[1][1])\n",
    "g.set_title(\"Ratio of weekend sessions\")\n",
    "g.set_xlabel(\"\")\n",
    "g.set_ylabel(\"\")\n",
    "\n",
    "g = sns.barplot(data=D_hlp, x=\"grp\", y=\"noc\", palette=[\"lightgrey\", \"red\"], errorbar=None, ax=ax[1][2])\n",
    "g.set_title(\"Ratio of night sessions\")\n",
    "g.set_xlabel(\"\")\n",
    "g.set_ylabel(\"\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "efc98620",
   "metadata": {},
   "source": [
    "Finally let's check if sessions of one user id are concentrated in one or few cluster(s). One can expect that playing pattern of  the particular user is mostly the same. We take the user with 3rd highest number of sessions and plot his sessions in the UMAP projection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "44316381",
   "metadata": {},
   "outputs": [],
   "source": [
    "# show individual user\n",
    "D_u = D2.groupby(\"df_v_konto_key\").agg(cnt=pd.NamedAgg(column=\"idsession_h\", aggfunc=\"count\")).sort_values(\"cnt\", ascending=False)\n",
    "c_user = D_u.index[2] # user with 3rd highest number of sessions\n",
    "\n",
    "plt.figure()\n",
    "sns.scatterplot(data=D2, x=\"u_x\", y=\"u_y\", s=15, hue=\"dummy\", palette=[\"lightgrey\"], legend=False)\n",
    "sns.scatterplot(data=D2[D2[\"df_v_konto_key\"]==c_user], x=\"u_x\", y=\"u_y\", s=15, hue=\"dummy\", palette=[\"red\"], legend=False)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "681409d9",
   "metadata": {},
   "source": [
    "Now we use **k-means** clustering. It is a global method, so it isn't reasonable to cluster points after UMAP projection which keeps only local similarities (and deforms distances). That's why we cluster sessions based on PCA output.\n",
    "\n",
    "We fit clusters both on all 4 PCA dimension and on two strongest dimension. The former keeps more information, the latter enables prettier visualisation.\n",
    "\n",
    "*Note: You can try yourself to cluster original points without dimensionality reduction.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b0cf428f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# fitting k-means\n",
    "c_n_clst = 6 # number of clusters\n",
    "c_pca2 = [\"pca1\", \"pca2\"]\n",
    "c_pca4 = [\"pca1\", \"pca2\", \"pca3\", \"pca4\"]\n",
    "km2 = KMeans(n_clusters=c_n_clst, random_state=0, n_init=\"auto\").fit(D2[c_pca2])\n",
    "km4 = KMeans(n_clusters=c_n_clst, random_state=0, n_init=\"auto\").fit(D2[c_pca4])\n",
    "\n",
    "D2[\"km2_clst\"] = km2.labels_\n",
    "D2[\"km4_clst\"] = km4.labels_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a0378284",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8,4))\n",
    "fig.suptitle('K-mean on PCA')\n",
    "\n",
    "g = sns.scatterplot(data=D2, x=\"pca1\", y=\"pca2\", s=10, hue=\"km2_clst\", palette=\"deep\", ax=ax1)\n",
    "ax1.set_title('Based on 2 dimensions')\n",
    "\n",
    "g = sns.scatterplot(data=D2, x=\"pca1\", y=\"pca2\", s=10, hue=\"km4_clst\", palette=\"deep\", ax=ax2)\n",
    "ax2.set_title('Based on 4 dimensions')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93d3469d",
   "metadata": {},
   "source": [
    "Now you can check if clustering by k-means on PCA divides sessions to reasonable clusters. Try to compare cases in each cluster to overall statistics as above at DBscan on UMAP clustering."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "77488c36",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}