# Hazard sessions

For the practice of data transformation, we use data from hazard games.

## Data
Download *hazard.zip* unpack and read two files:

* ***hra-upr.csv***: (BIG!) table of individual games playes by a gambler, sorted by gambler id and time
  - *df_v_misto_key*: place id 
  - *df_v_konto_key*: gambler id
  - *df_v_herni_pozice_key*: machine id
  - *sazkavysepuvodni*: bet amount (in CZK)
  - *sazkaprijeticas*: exact date and time of playing a game
  - *vyhravysepuvodni*: winnings amount (in CZK)
  - *df_v_evidence_her_key*: game id
  - *zmena_konto_misto*: flag if the current record has different gambler or place id from the previous (change of place or start of records of another gambler)
  - *cas_pred*: time difference from previous record's *sazkaprijeticas*
* ***misto-upr.csv***: table of places
  - *df_v_misto_key*: place id
  - *jtsk_x*, *jtsk_y*: coordinates in square localization system (different from GPS)
  - *obec*, *ulice*, *psc*, *cp*: address of the place
  - *kraj*: region of the Czech Republic
  - *typmisto*: place type
  - *sidlokodruian*: address code in the RUIAN register

In [32]:
### Setup
%matplotlib inline
# should enable plotting without explicit call .show()

# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import seaborn.objects as so
import matplotlib.pyplot as plt

In [33]:
D_hra = pd.read_csv("hra-upr.csv")
D_hra

Unnamed: 0,df_v_misto_key,df_v_konto_key,df_v_herni_pozice_key,sazkavysepuvodni,sazkaprijeticas,vyhravysepuvodni,df_v_evidence_her_key,zmena_konto_misto,cas_pred
0,127018397,31347675075,157656900,100,2023-12-18T15:16:17Z,0,507989000,True,
1,127018397,31347675075,157656900,100,2023-12-18T15:16:19.600Z,0,507989000,False,2.6
2,127018397,31347675075,157656900,100,2023-12-18T15:16:22.100Z,0,507989000,False,2.5
3,127018397,31347675075,157656900,100,2023-12-18T15:16:24.600Z,0,507989000,False,2.5
4,127018397,31347675075,157656900,100,2023-12-18T15:16:27.100Z,0,507989000,False,2.5
...,...,...,...,...,...,...,...,...,...
7420165,277179595,79816767890,160812221,5,2023-12-31T18:02:37Z,0,420825383,False,2.0
7420166,277179595,79816767890,160812221,5,2023-12-31T18:02:40Z,0,420825383,False,3.0
7420167,277179595,79816767890,160812221,5,2023-12-31T18:02:42Z,0,420825383,False,2.0
7420168,277179595,79816767890,160812221,5,2023-12-31T18:02:44Z,0,420825383,False,2.0


In [34]:
D_misto = pd.read_csv("misto-upr.csv")
D_misto

Unnamed: 0,df_v_misto_key,typmisto,sidlokodruian,jtsk_y,jtsk_x,kraj,obec,psc,ulice,cp
0,145526315,H,,,,LBK,Semily,51301,Komenského nám.,119
1,258252794,H,,,,JHM,Vyškov,68201,Masarykovo náměstí,39
2,271154701,H,,,,ZLK,Rožnov pod Radhoštěm,75661,Svazarmovská,1682
3,130950653,H,951749.0,824561.29,1091140.86,PLK,Stod,33301,nám. ČSA,72
4,130950654,H,1498967.0,763128.56,1049512.75,STC,Loděnice,26712,Plzeňská,44
5,127018394,H,2771217.0,517707.49,1082628.66,MSK,Úvalno,79391,Úvalno,246
6,130950651,H,5541751.0,773463.88,1126872.63,JHC,Mirovice,39806,Masarykovo náměstí,44
7,130950656,H,5918901.0,783349.25,1061108.0,STC,Žebrák,26753,Náměstí,11
8,130950647,H,7655762.0,632009.89,1058808.04,PAK,Pardubice,53002,Teplého,1375
9,127018395,H,11545755.0,700196.47,1071764.05,STC,Kutná Hora,28401,Václavské náměstí,177


## Goal
We want to detect gambler ids suspected of the misuse - that is, more players share one id. For that purpose:

* we divide records to sessions (one session contains records of one gambler's visit in a place)
* we assign session id to each record
* we make a table of sessions with aggregated statistics that help us detect suspicious sessions
* (if needed) we transform computed statistics
* (recommended) we make an exploration of statistics to find usual and unusual values
* we assign a flags or score of suspection to each session
* we aggregate that score for each gambler id and catch the decievers :-)

The output of your work should be the table of sessions, each session containing session id, gambler id, place id, statistics of the session and some metrics of suspicion.

## Useful functions and methods

* `pd.to_datetime` - converts string to datetime, here format is "ISO8601", see [tutorial](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html)
* time difference in seconds - apply `total_seconds()` method on a difference (datetime2 - datetime1)
* conversion of a timezone - apply `dt.tz_convert(tz=[timezone_string])` on a Series, see [tutorial](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.tz_convert.html)
* shift a Series by *n* elements - see user function definition below

```
# s [Series]: Series to be shifted
# n [integer]: number of steps to shift the Series ahead
def lag(s, n=1):
    res = pd.concat([pd.Series([pd.NA]*n), s.head(-n)])
    return res

### examples
# lag(pd.Series(np.arange(10))) # returns Series NA, 0, 1, ..., 8
# lag(pd.Series(np.arange(10)), 2) # returns Series NA, NA, 0, 1, ..., 7
```
