Technologies I

for DataScience

Outline

  • Pick & setup IDE
  • A tour of Python ecosystem
  • Pandas in a nutshell

DataScience process

Getting Python

Developing in python

  • Jupyter’s family
    • Jupyter notebook
    • Jupyter Hub - multi user, designed for companies, classrooms, …
    • Jupyter Lab - next-gen notebook interface
  • VSCode
    • plus extensions for Jupyter, Quarto etc.
  • Some alternatives:
    • JetBrains PyCharm (free for students)
    • Jetbrains DataSpell
    • Spider (RStudio for Python)

Jupyter

  • Web-based IDE
    • jupyter stands for Julia + Python + R
  • Report oriented
    • Code & Markdown cells
    • But handles .py etc. as well
  • Install & run
    • pip install jupyterlab
    • jupyter lab
    • (… or use anaconda distribution)
  • Kernel (Docs)
    • The interpreter running your commands
  • Two modes: Edit and Command

Jupyter - shortcuts

  • Esc / Enter (Edit <–> Command)
  • Command mode:
    • Select: arrow up/down
    • Add cell: Above, Below; DDelete cell
    • Make cell Markdown, (Y)code
  • Edit mode:
    • Tab - code completion
    • Ctrl + Z/Y - undo/redo
  • Both modes:
    • Ctrl + enter - to run a cell
    • Shift + enter - run & select next
  • Check out some cheat sheets, e.g. this one

VSCode cool things

  • Devtools integration: 
    • git, docker, pylint, …
  • Quite smart python autocompletion 
    • plus Copilot/Tabnine integration
  • Interactive debugger 
  • Database plugin 
  • Unit-tests runner 
  • Remote development
  • Quarto integration 
    • Not available in JetBrains 😞

VSCode setup

  • Install & run:
  • Setup project dir
    • Open folder
    • (Connect to for remote development), …
  • Install extensions
    • python
    • jupyter
    • Python Debugger
    • Optional: path intellisense, pylint, black, yaml, wsl, …
  • Key Shortcut:
    • ctrl + shift + p – command palette
    • Just type-in what you need to do

Note

You can use your favourite IDE, of course.

Jupyter in VSCode

  • Just create a new .ipynb file (or open an existing one)
    • Command palette (ctrl + shift + p) –> (type) New file –> Jupyter Notebook
  • Setup kernel
    • Note: by default, VSCode uses env virtual environment in project directory
    • Click Select kernel in top right corner & follow instructions
    • Or you can create one on your own (e.g., terminal -> python -m venv env)
  • Work with notebook as usual
    • Same shortcuts (you can adjust them as well) etc.

Debugging Jupyter in VSCode

  • Add a breakpoint
    • In a code cell, click on the left on the line you’d like to debug
  • Run cell with Debug (see below)
  • Now you can
    • Explore variables, call stack, breakpoints, … (left panel)
    • Interact with interpreter (Debug Console; bottom panel)
    • Execute lines step by step (floating panel / shortcuts)

Jupyter nbconvert

Do not share your resutls as a source code.

  • nbconvert help you export the
  • VSCode
    • Navigate ... (top panel) –> Export –> html file
  • CMD
    • (Terminal in bottom panel)
    • (Be sure you have jupyter installed in python env)
    • jupyter nbconvert <path/to/ntb.ipynb> --to html
      • hide code cells via --no-input
  • TIP: pretty-jupyter package
    • Make your output prettier
    • Just add --template pj
  • TIP: Quarto can do that for you as well (spoiler alert, see the next slide)
    • Command palette -> (type) Quarto render -> select HTML
    • Shortcut: ctrl + shift + k

Quarto

  • An open-source scientific and technical publishing system
  • It’s like Jupyter, but better 😃
    • Works with Python, R, Julia and Observable
    • Based on Pandoc; One input, many outputs
  • Install & run
    • Download & install quarto app
    • Install VSCode extension: Quarto
      • Not available for PyCharm 😞.
      • Therefore, I’m using VSCode

  • Stay tuned, details will be covered in DS Reporting lecture.

Python toolkit

Python toolkit [1]

  • Data manipulation
    • Pandas 
    • Apache Spark (pyspark) 
    • Numpy
  • Data visualization
    • matplotlib
    • seaborn
    • seaborn.objects
    • plotly
    • plotnine

At least one of each section is a must.

Python toolkit [2]

  • Stat and Machine learning
    • scikit-learn (and scikit-* family) 
    • statsmodels
    • many others: prophet (time-series data), h2o, …
  • Gradient boosting
    • catboost
    • lightgbm
    • Xgboost
  • MLops
    • mlflow
    • metaflow

Python toolkit [3]

  • NLP
    • nltk
    • spacy
    • gensim
  • Vision
    • opencv (cv2)
    • scikit-image 
  • Graphs
    • networkX
    • snap-stanford

Extras. Highly depends on your projects.

Pandas

Pandas basics

  • import pandas as pd
  • pd.Series
    • 1d labeled array
    • May contain mixed data types
  • pd.DataFrame
    • 2d array, aka table
    • a dict like container of pd.Series
  • Axes
    • 0 – columns (column-wise)
    • 1 – index (row-wise)

Pandas indices and columns

  • Index
    • aka primary key for pd.Series and pd.DataFrame (indicating rows)
    • without UNIQUE constraint!
    • df.index – access the index
    • df.set_index(col_name), df.reset_index()
  • Columns - column index of a pd.DataFrame
    • df.columns
    • Change colnames via assigning:
      • df.columns = ["a", "b", "c"]
      • df.columns = [c.lower() for c in df.columns]

DataFrame - example

DataFrame - example

import pandas as pd
import seaborn as sns  

s = pd.Series(["A", "B", 1, 2, 3], index=["a", "b", "c", "d", "e"])
s
a    A
b    B
c    1
d    2
e    3
dtype: object
df = sns.load_dataset("titanic")
df.head()
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
df.index
RangeIndex(start=0, stop=891, step=1)
df.columns
Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

Data types

  • Available data types
    • Numbers – Integers x Floating-point
    • Booleans 
    • Dates
    • Categories
    • Text
  • Explore dtypes: df.dtypes
  • Select by dtype: df.select_dtypes
  • Change dtype: df.astype({col: type})

Data types - example

df.dtypes
survived          int64
pclass            int64
sex              object
age             float64
sibsp             int64
parch             int64
fare            float64
embarked         object
class          category
who              object
adult_male         bool
deck           category
embark_town      object
alive            object
alone              bool
dtype: object
df.select_dtypes("float")
age fare
0 22.0 7.2500
1 38.0 71.2833
2 26.0 7.9250
3 35.0 53.1000
4 35.0 8.0500
... ... ...
886 27.0 13.0000
887 19.0 30.0000
888 NaN 23.4500
889 26.0 30.0000
890 32.0 7.7500

891 rows × 2 columns

df.astype({"age": "string"}).dtypes
survived                int64
pclass                  int64
sex                    object
age            string[python]
sibsp                   int64
parch                   int64
fare                  float64
embarked               object
class                category
who                    object
adult_male               bool
deck                 category
embark_town            object
alive                  object
alone                    bool
dtype: object

Pandas I/O

  • Files
    • json, csv, tsv –> pd.read_csv() / pd.read_json()
    • parquet - (pyarrow / fastparquet) –> pd.read_parquet()
    • excel - (openpyxl) –> pd.read_excel()
    • pickle  –> pd.read_pickle()
  • Database
    • sqlalchemy –> create_engine –> pandas
      • pyodbc string needed
      • e.g.: pd.read_sql_table(tabname, con=engine) or pd.read_sql_query()
  • Writing
    • to_*() family of functions (csv, json, parquet, excel)
    • use compression to reduce file-size and speedup IO (.gz, .zip)

Pandas I/O - example

df.to_csv("tips.csv")

df2 = pd.read_csv("tips.csv")
df2.head()
Unnamed: 0 survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True

Getting the right data

  • By columns
    • A single column df[“age”]  –> pd.Series
    • Many columns df[[“Age”, “Sex”]]  –> pd.DataFrame
    • By index of a column df.iloc[:, 0], df.iloc[:, [2,3]], df.iloc[:, 2:6]
  • By rows
    • By index value df.loc[0], df.loc[0:5]
    • By integer df.iloc[0], df.iloc[0:3]
  • Subset
    • By condition df[df["age"] >= 30]
    • By multiple conditions df[(df["age"] >= 30) & (df["sex"] == "female")]
    • Boolean indexing operators & (and), | (or), ~ (not)
    • Tip: .query() method

Getting the right data - example

# single column -> pd.Series
print(type(df["age"]))
df["age"].head()  
<class 'pandas.core.series.Series'>
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: age, dtype: float64
# several columns -> pd.DataFrame
print(type(df[["age", "sex"]]))
df[["age", "sex"]].head()  
<class 'pandas.core.frame.DataFrame'>
age sex
0 22.0 male
1 38.0 female
2 26.0 female
3 35.0 female
4 35.0 male
# by column order
df.iloc[:,0]  # =  every rows (:) and first column (0)
0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: survived, Length: 891, dtype: int64
df.iloc[0,:] # = first row, all columns
survived                 0
pclass                   3
sex                   male
age                   22.0
sibsp                    1
parch                    0
fare                  7.25
embarked                 S
class                Third
who                    man
adult_male            True
deck                   NaN
embark_town    Southampton
alive                   no
alone                False
Name: 0, dtype: object
# select rows with given index (row pk)
df.set_index("sex").loc["male",:]  # wierd, but functional
survived pclass age sibsp parch fare embarked class who adult_male deck embark_town alive alone
sex
male 0 3 22.0 1 0 7.2500 S Third man True NaN Southampton no False
male 0 3 35.0 0 0 8.0500 S Third man True NaN Southampton no True
male 0 3 NaN 0 0 8.4583 Q Third man True NaN Queenstown no True
male 0 1 54.0 0 0 51.8625 S First man True E Southampton no True
male 0 3 2.0 3 1 21.0750 S Third child False NaN Southampton no False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
male 0 2 28.0 0 0 10.5000 S Second man True NaN Southampton no True
male 0 3 25.0 0 0 7.0500 S Third man True NaN Southampton no True
male 0 2 27.0 0 0 13.0000 S Second man True NaN Southampton no True
male 1 1 26.0 0 0 30.0000 C First man True C Cherbourg yes True
male 0 3 32.0 0 0 7.7500 Q Third man True NaN Queenstown no True

577 rows × 14 columns

DataFrame operations & attributes

  • df.head(), df.tail()df.sample(n=12) - a quick glimps at data

  • df.columns, df.shape - data dimmensions

  • df.count()

  • df.describe() - summary stats for numeric columns

  • df["sex"].value_counts() - frequency table

  • df.sort_index() or df.sort_values("column", ascending=True)

    • You can sort by multiple columns – df.sort_values(["a", "b"])
  • df[“val”].astype("newtype") - change dtype

Example

df.shape
(891, 15)
df.iloc[:,:4].count()
survived    891
pclass      891
sex         891
age         714
dtype: int64
df.describe()
survived pclass age sibsp parch fare
count 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
df["pclass"].value_counts()
pclass
3    491
1    216
2    184
Name: count, dtype: int64
df.sort_values("fare")[["fare"]]
fare
806 0.0000
413 0.0000
481 0.0000
466 0.0000
179 0.0000
... ...
341 263.0000
438 263.0000
679 512.3292
258 512.3292
737 512.3292

891 rows × 1 columns

DataFrame operations & attributes

  • df.drop(labels, axis="columns")

  • df.drop_duplicates()

    • You can specify which columns check for duplicities via ‘subset’
  • df.rename({"A": "B"}, axis="columns") - rename

  • df["sex"].map({"male": 0, "female": 1}) - relevel

  • df["age"].replace(0, 999) - replace

  • pd.cut(df["age"], bins=[17, 21, 35, 45, 100])

Example

df.drop(["age", "sex"], axis="columns").head()
survived pclass sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 0 0 8.0500 S Third man True NaN Southampton no True
df.drop([0], axis="rows").head()
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
5 0 3 male NaN 0 0 8.4583 Q Third man True NaN Queenstown no True
df.drop_duplicates().shape
(784, 15)
df.drop_duplicates(["pclass"]).shape
(3, 15)
df[["pclass"]].replace({"pclass": {1: "1st", 2: "2nd", 3: "3rd"}}).head()
pclass
0 3rd
1 1st
2 3rd
3 1st
4 3rd
pd.cut(df["age"], [0, 20, 40, 80, 100]).value_counts()
age
(20, 40]     385
(0, 20]      179
(40, 80]     150
(80, 100]      0
Name: count, dtype: int64

DataFrame operations & attributes

  • df.assign() - ❗ Deriving new columns
  • syntax of params:
    • new_column_name = fun, where fun is a function operating on the df
  • You can work with columns derived on preceding rows of .assign
import numpy as np 

df[["fare", "alive"]].assign(
    is_alive = lambda d: np.where(d["alive"] == "yes", 1, 0),
    fare_sq = lambda d: d["fare"]**2,
    fare_sq_sqrt = lambda d: d["fare_sq"]**(1/2)
).head()
fare alive is_alive fare_sq fare_sq_sqrt
0 7.2500 no 0 52.562500 7.2500
1 71.2833 yes 1 5081.308859 71.2833
2 7.9250 yes 1 62.805625 7.9250
3 53.1000 yes 1 2819.610000 53.1000
4 8.0500 no 0 64.802500 8.0500

DataFrame piping

df2["is_male"] = df2["gender"] == 'male'
male_proportion_by_class = df2.groupby("pclass").agg({"is_male": "mean"})
male_proportion_by_class.sort_values().head(1)

vs

(
    df
    .assign(is_male = lambda d: d["sex"] == 'male')
    .groupby("pclass")
    .agg({"is_male": "mean"})
    .sort_values()
    .head(1)
)

Missing data - example

Missing data

  • Missing completly at random
    • errors during data collection or data processing, in a non systematic way
  • Missing at random
    • Missings caused by known facts only (e.g., not having a wife –> unknown wife’s age)
  • Missing not at random - Missings caused by unkown variables, too (e.g., rich people not motivated enough to fill a poll –> bias)
  • Typical strategies to deal with missing data
    • Drop column - df.drop(columns=["colA"])
    • Drop rows  - df.dropna(subset=["colA"])
    • Impute with a constant (mean, mode, 0), e.g.: df.fillna({"colA": 0})
    • Impute with a model 

Missings in Pandas

  • None – python general representation of missing value

  • np.nan – Numpy’s NaN is usually used in pandas 

  • Note: By default, NaNs are excluded from aggregate functions 

  • To check whether a value is missing

    • df["age"].isna() or df["age"].notna()
  • We can drop rows with missing values 

    • df.dropna(subset=["age"])
  • We can fill missing values with a constant 

    • df["age"].fillna(val)

Missing data - example

df.dropna().shape
(182, 15)
df.dropna(subset = ["age"]).shape
(714, 15)
df[["age"]].value_counts(dropna=False).head()
age 
NaN     177
24.0     30
22.0     27
18.0     26
19.0     25
Name: count, dtype: int64
df[["age"]].fillna(-1).value_counts().head()
age  
-1.0     177
 24.0     30
 22.0     27
 18.0     26
 19.0     25
Name: count, dtype: int64
df[["age"]].ffill().value_counts().head() # wrong in this case
age 
24.0    39
19.0    38
21.0    33
22.0    32
28.0    32
Name: count, dtype: int64

Summary tables

  • How to produce a summary table for two or more categorical variables?
    • pd.cross_table – frequency table
    • df.pivot_table – any aggregation fn

Combining tables - merge Docs

  • Database like join – df.merge 
  • Either on (or a pair left_on, right_on) defines colname(s) to proceed the merge with
    • You can use left_index = True / right_index as well
  • how defines the type of the join
    • left, right, outer, inner, cross

Merge - example

ports = (
    df[["embarked"]]
    .assign(port_label = lambda d: d["embarked"].astype(str).str.lower())
    .drop_duplicates()
    .set_index("embarked")
)
ports
port_label
embarked
S s
C c
Q q
NaN nan
(
    df
    .merge(ports, how="left", left_on="embarked", right_index=True)
    .filter(items = ["survived", "embarked", "port_label"])
    .head()
)
survived embarked port_label
0 0 S s
1 1 C c
2 1 S s
3 1 S s
4 0 S s

Combining tables - concat Docs

  • We can also glue dataframes together by rows/columns using pd.concat 

pd.concat(([df1, df2, df3])

pd.concat([df1, df2], axis="column")

Concat - example

df1 = df[:10]
df2 = df[10:]

print(df.shape)
print(df1.shape)
print(df2.shape)
(891, 15)
(10, 15)
(881, 15)
pd.concat([df1, df2], axis="rows").shape
(891, 15)
pd.concat([df1, df2], axis="columns").shape
(891, 30)

Aggregations

  • Split a data into groups and compute summary statistics for each group 
    • Tip: use as_index=False to keep the grouping col(s)
(
    df
    .groupby([group_key, another_one])
    .agg(example = ("column_name", "agg_function"))
)
  • Aggregate functions
    • min, max, mean, nunique, sum, size, count, var, sem (standard error of the mean), descrie
    • first, last, nth

Aggregations - example

(
    df.groupby(["pclass"], as_index=False).agg(
        n=("survived", "count"),
        mean_fare=("fare", "mean"),
        custom_agg_fun=("age", lambda x: x[~x.isna()].count()),
    )
)
pclass n mean_fare custom_agg_fun
0 1 216 84.154687 186
1 2 184 20.662183 173
2 3 491 13.675550 355