{ "cells": [ { "cell_type": "markdown", "id": "2d17e9c1", "metadata": {}, "source": [ "# Hazard sessions\n", "\n", "For the practice of data transformation, we use data from hazard games.\n", "\n", "## Data\n", "Download *hazard.zip* unpack and read two files:\n", "\n", "* ***hra-upr.csv***: (BIG!) table of individual games playes by a gambler, sorted by gambler id and time\n", " - *df_v_misto_key*: place id \n", " - *df_v_konto_key*: gambler id\n", " - *df_v_herni_pozice_key*: machine id\n", " - *sazkavysepuvodni*: bet amount (in CZK)\n", " - *sazkaprijeticas*: exact date and time of playing a game\n", " - *vyhravysepuvodni*: winnings amount (in CZK)\n", " - *df_v_evidence_her_key*: game id\n", " - *zmena_konto_misto*: flag if the current record has different gambler or place id from the previous (change of place or start of records of another gambler)\n", " - *cas_pred*: time difference from previous record's *sazkaprijeticas*\n", "* ***misto-upr.csv***: table of places\n", " - *df_v_misto_key*: place id\n", " - *jtsk_x*, *jtsk_y*: coordinates in square localization system (different from GPS)\n", " - *obec*, *ulice*, *psc*, *cp*: address of the place\n", " - *kraj*: region of the Czech Republic\n", " - *typmisto*: place type\n", " - *sidlokodruian*: address code in the RUIAN register" ] }, { "cell_type": "code", "execution_count": 32, "id": "9ed66ea3", "metadata": {}, "outputs": [], "source": [ "### Setup\n", "%matplotlib inline\n", "# should enable plotting without explicit call .show()\n", "\n", "# Import libraries\n", "import pandas as pd\n", "import numpy as np\n", "import seaborn as sns\n", "import seaborn.objects as so\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": 33, "id": "90ec2dbc", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
df_v_misto_keydf_v_konto_keydf_v_herni_pozice_keysazkavysepuvodnisazkaprijeticasvyhravysepuvodnidf_v_evidence_her_keyzmena_konto_mistocas_pred
0127018397313476750751576569001002023-12-18T15:16:17Z0507989000TrueNaN
1127018397313476750751576569001002023-12-18T15:16:19.600Z0507989000False2.6
2127018397313476750751576569001002023-12-18T15:16:22.100Z0507989000False2.5
3127018397313476750751576569001002023-12-18T15:16:24.600Z0507989000False2.5
4127018397313476750751576569001002023-12-18T15:16:27.100Z0507989000False2.5
..............................
74201652771795957981676789016081222152023-12-31T18:02:37Z0420825383False2.0
74201662771795957981676789016081222152023-12-31T18:02:40Z0420825383False3.0
74201672771795957981676789016081222152023-12-31T18:02:42Z0420825383False2.0
74201682771795957981676789016081222152023-12-31T18:02:44Z0420825383False2.0
74201692771795957981676789016081222152023-12-31T18:02:46Z0420825383False2.0
\n", "

7420170 rows × 9 columns

\n", "
" ], "text/plain": [ " df_v_misto_key df_v_konto_key df_v_herni_pozice_key \\\n", "0 127018397 31347675075 157656900 \n", "1 127018397 31347675075 157656900 \n", "2 127018397 31347675075 157656900 \n", "3 127018397 31347675075 157656900 \n", "4 127018397 31347675075 157656900 \n", "... ... ... ... \n", "7420165 277179595 79816767890 160812221 \n", "7420166 277179595 79816767890 160812221 \n", "7420167 277179595 79816767890 160812221 \n", "7420168 277179595 79816767890 160812221 \n", "7420169 277179595 79816767890 160812221 \n", "\n", " sazkavysepuvodni sazkaprijeticas vyhravysepuvodni \\\n", "0 100 2023-12-18T15:16:17Z 0 \n", "1 100 2023-12-18T15:16:19.600Z 0 \n", "2 100 2023-12-18T15:16:22.100Z 0 \n", "3 100 2023-12-18T15:16:24.600Z 0 \n", "4 100 2023-12-18T15:16:27.100Z 0 \n", "... ... ... ... \n", "7420165 5 2023-12-31T18:02:37Z 0 \n", "7420166 5 2023-12-31T18:02:40Z 0 \n", "7420167 5 2023-12-31T18:02:42Z 0 \n", "7420168 5 2023-12-31T18:02:44Z 0 \n", "7420169 5 2023-12-31T18:02:46Z 0 \n", "\n", " df_v_evidence_her_key zmena_konto_misto cas_pred \n", "0 507989000 True NaN \n", "1 507989000 False 2.6 \n", "2 507989000 False 2.5 \n", "3 507989000 False 2.5 \n", "4 507989000 False 2.5 \n", "... ... ... ... \n", "7420165 420825383 False 2.0 \n", "7420166 420825383 False 3.0 \n", "7420167 420825383 False 2.0 \n", "7420168 420825383 False 2.0 \n", "7420169 420825383 False 2.0 \n", "\n", "[7420170 rows x 9 columns]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "D_hra = pd.read_csv(\"hra-upr.csv\")\n", "D_hra" ] }, { "cell_type": "code", "execution_count": 34, "id": "0567f039", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
df_v_misto_keytypmistosidlokodruianjtsk_yjtsk_xkrajobecpsculicecp
0145526315HNaNNaNNaNLBKSemily51301Komenského nám.119
1258252794HNaNNaNNaNJHMVyškov68201Masarykovo náměstí39
2271154701HNaNNaNNaNZLKRožnov pod Radhoštěm75661Svazarmovská1682
3130950653H951749.0824561.291091140.86PLKStod33301nám. ČSA72
4130950654H1498967.0763128.561049512.75STCLoděnice26712Plzeňská44
5127018394H2771217.0517707.491082628.66MSKÚvalno79391Úvalno246
6130950651H5541751.0773463.881126872.63JHCMirovice39806Masarykovo náměstí44
7130950656H5918901.0783349.251061108.00STCŽebrák26753Náměstí11
8130950647H7655762.0632009.891058808.04PAKPardubice53002Teplého1375
9127018395H11545755.0700196.471071764.05STCKutná Hora28401Václavské náměstí177
10127018397H18167829.0464086.171103583.85STCLysá nad Labem28922Smetanova789
11130950652H20427751.0NaNNaNOLKJavorník79070nám. Svobody415
12130950649H23638907.0NaNNaNJHMZbýšov66411Masarykova54
13277179595H26321823.0756748.40989637.48ULKLitoměřice41201Pokratická1851
14127018396H27911420.0719473.70949421.61JHMHustopeče69301Dukelské nám.1326
\n", "
" ], "text/plain": [ " df_v_misto_key typmisto sidlokodruian jtsk_y jtsk_x kraj \\\n", "0 145526315 H NaN NaN NaN LBK \n", "1 258252794 H NaN NaN NaN JHM \n", "2 271154701 H NaN NaN NaN ZLK \n", "3 130950653 H 951749.0 824561.29 1091140.86 PLK \n", "4 130950654 H 1498967.0 763128.56 1049512.75 STC \n", "5 127018394 H 2771217.0 517707.49 1082628.66 MSK \n", "6 130950651 H 5541751.0 773463.88 1126872.63 JHC \n", "7 130950656 H 5918901.0 783349.25 1061108.00 STC \n", "8 130950647 H 7655762.0 632009.89 1058808.04 PAK \n", "9 127018395 H 11545755.0 700196.47 1071764.05 STC \n", "10 127018397 H 18167829.0 464086.17 1103583.85 STC \n", "11 130950652 H 20427751.0 NaN NaN OLK \n", "12 130950649 H 23638907.0 NaN NaN JHM \n", "13 277179595 H 26321823.0 756748.40 989637.48 ULK \n", "14 127018396 H 27911420.0 719473.70 949421.61 JHM \n", "\n", " obec psc ulice cp \n", "0 Semily 51301 Komenského nám. 119 \n", "1 Vyškov 68201 Masarykovo náměstí 39 \n", "2 Rožnov pod Radhoštěm 75661 Svazarmovská 1682 \n", "3 Stod 33301 nám. ČSA 72 \n", "4 Loděnice 26712 Plzeňská 44 \n", "5 Úvalno 79391 Úvalno 246 \n", "6 Mirovice 39806 Masarykovo náměstí 44 \n", "7 Žebrák 26753 Náměstí 11 \n", "8 Pardubice 53002 Teplého 1375 \n", "9 Kutná Hora 28401 Václavské náměstí 177 \n", "10 Lysá nad Labem 28922 Smetanova 789 \n", "11 Javorník 79070 nám. Svobody 415 \n", "12 Zbýšov 66411 Masarykova 54 \n", "13 Litoměřice 41201 Pokratická 1851 \n", "14 Hustopeče 69301 Dukelské nám. 1326 " ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "D_misto = pd.read_csv(\"misto-upr.csv\")\n", "D_misto" ] }, { "cell_type": "markdown", "id": "18473a5d", "metadata": {}, "source": [ "## Goal\n", "We want to detect gambler ids suspected of the misuse - that is, more players share one id. For that purpose:\n", "\n", "* we divide records to sessions (one session contains records of one gambler's visit in a place)\n", "* we assign session id to each record\n", "* we make a table of sessions with aggregated statistics that help us detect suspicious sessions\n", "* (if needed) we transform computed statistics\n", "* (recommended) we make an exploration of statistics to find usual and unusual values\n", "* we assign a flags or score of suspection to each session\n", "* we aggregate that score for each gambler id and catch the decievers :-)\n", "\n", "The output of your work should be the table of sessions, each session containing session id, gambler id, place id, statistics of the session and some metrics of suspicion." ] }, { "cell_type": "markdown", "id": "02432c28", "metadata": {}, "source": [ "## Useful functions and methods\n", "\n", "* `pd.to_datetime` - converts string to datetime, here format is \"ISO8601\", see [tutorial](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html)\n", "* time difference in seconds - apply `total_seconds()` method on a difference (datetime2 - datetime1)\n", "* conversion of a timezone - apply `dt.tz_convert(tz=[timezone_string])` on a Series, see [tutorial](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.tz_convert.html)\n", "* shift a Series by *n* elements - see user function definition below\n", "\n", "```\n", "# s [Series]: Series to be shifted\n", "# n [integer]: number of steps to shift the Series ahead\n", "def lag(s, n=1):\n", " res = pd.concat([pd.Series([pd.NA]*n), s.head(-n)])\n", " return res\n", "\n", "### examples\n", "# lag(pd.Series(np.arange(10))) # returns Series NA, 0, 1, ..., 8\n", "# lag(pd.Series(np.arange(10)), 2) # returns Series NA, NA, 0, 1, ..., 7\n", "```\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 5 }