import pandas as pd
import seaborn as sns
s = pd.Series(["A", "B", 1, 2, 3], index=["a", "b", "c", "d", "e"])
s
a A
b B
c 1
d 2
e 3
dtype: object
for DataScience
jupyter
stands for Ju
lia + Py
thon + R
.py
etc. as wellpip install jupyterlab
jupyter lab
Esc
/ Enter
(Edit <–> Command)A
bove, B
elow; DD
elete cellM
arkdown, (Y
)codeTab
- code completionCtrl + Z/Y
- undo/redoCtrl + enter
- to run a cellShift + enter
- run & select nextgit
, docker
, pylint
, …Open folder
Connect to
for remote development), …python
jupyter
Python Debugger
path intellisense
, pylint
, black
, yaml
, wsl
, …ctrl + shift + p
– command paletteNote
You can use your favourite IDE, of course.
.ipynb
file (or open an existing one)
ctrl + shift + p
) –> (type) New file
–> Jupyter Notebook
env
virtual environment in project directorySelect kernel
in top right corner & follow instructionsDebug Console
; bottom panel)Do not share your resutls as a source code.
nbconvert
help you export the...
(top panel) –> Export
–> html file
Terminal
in bottom panel)jupyter
installed in python env)jupyter nbconvert <path/to/ntb.ipynb> --to html
--no-input
pretty-jupyter
package
--template pj
Quarto render
-> select HTML
ctrl + shift + k
quarto
app
VSCode
extension: Quarto
At least one of each section is a must.
Extras. Highly depends on your projects.
import pandas as pd
pd.Series
and pd.DataFrame
(indicating rows)df.index
– access the indexdf.set_index(col_name)
, df.reset_index()
pd.DataFrame
df.columns
df.columns = ["a", "b", "c"]
df.columns = [c.lower() for c in df.columns]
import pandas as pd
import seaborn as sns
s = pd.Series(["A", "B", 1, 2, 3], index=["a", "b", "c", "d", "e"])
s
a A
b B
c 1
d 2
e 3
dtype: object
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
df.dtypes
df.select_dtypes
df.astype({col: type})
survived int64
pclass int64
sex object
age float64
sibsp int64
parch int64
fare float64
embarked object
class category
who object
adult_male bool
deck category
embark_town object
alive object
alone bool
dtype: object
age | fare | |
---|---|---|
0 | 22.0 | 7.2500 |
1 | 38.0 | 71.2833 |
2 | 26.0 | 7.9250 |
3 | 35.0 | 53.1000 |
4 | 35.0 | 8.0500 |
... | ... | ... |
886 | 27.0 | 13.0000 |
887 | 19.0 | 30.0000 |
888 | NaN | 23.4500 |
889 | 26.0 | 30.0000 |
890 | 32.0 | 7.7500 |
891 rows × 2 columns
pd.read_csv()
/ pd.read_json()
pyarrow
/ fastparquet
) –> pd.read_parquet()
openpyxl
) –> pd.read_excel()
pd.read_pickle()
sqlalchemy
–> create_engine
–> pandas
pd.read_sql_table(tabname, con=engine)
or pd.read_sql_query()
to_*()
family of functions (csv, json, parquet, excel)Unnamed: 0 | survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
1 | 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
2 | 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
3 | 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
4 | 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
df[“age”]
–> pd.Series
df[[“Age”, “Sex”]]
–> pd.DataFrame
df.iloc[:, 0]
, df.iloc[:, [2,3]]
, df.iloc[:, 2:6]
df.loc[0], df.loc[0:5]
df.iloc[0], df.iloc[0:3]
df[df["age"] >= 30]
df[(df["age"] >= 30) & (df["sex"] == "female")]
&
(and
), |
(or
), ~
(not).query()
method<class 'pandas.core.series.Series'>
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: age, dtype: float64
<class 'pandas.core.frame.DataFrame'>
age | sex | |
---|---|---|
0 | 22.0 | male |
1 | 38.0 | female |
2 | 26.0 | female |
3 | 35.0 | female |
4 | 35.0 | male |
0 0
1 1
2 1
3 1
4 0
..
886 0
887 1
888 0
889 1
890 0
Name: survived, Length: 891, dtype: int64
survived 0
pclass 3
sex male
age 22.0
sibsp 1
parch 0
fare 7.25
embarked S
class Third
who man
adult_male True
deck NaN
embark_town Southampton
alive no
alone False
Name: 0, dtype: object
survived | pclass | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
sex | ||||||||||||||
male | 0 | 3 | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
male | 0 | 3 | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
male | 0 | 3 | NaN | 0 | 0 | 8.4583 | Q | Third | man | True | NaN | Queenstown | no | True |
male | 0 | 1 | 54.0 | 0 | 0 | 51.8625 | S | First | man | True | E | Southampton | no | True |
male | 0 | 3 | 2.0 | 3 | 1 | 21.0750 | S | Third | child | False | NaN | Southampton | no | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
male | 0 | 2 | 28.0 | 0 | 0 | 10.5000 | S | Second | man | True | NaN | Southampton | no | True |
male | 0 | 3 | 25.0 | 0 | 0 | 7.0500 | S | Third | man | True | NaN | Southampton | no | True |
male | 0 | 2 | 27.0 | 0 | 0 | 13.0000 | S | Second | man | True | NaN | Southampton | no | True |
male | 1 | 1 | 26.0 | 0 | 0 | 30.0000 | C | First | man | True | C | Cherbourg | yes | True |
male | 0 | 3 | 32.0 | 0 | 0 | 7.7500 | Q | Third | man | True | NaN | Queenstown | no | True |
577 rows × 14 columns
df.head()
, df.tail()
, df.sample(n=12)
- a quick glimps at data
df.columns
, df.shape
- data dimmensions
df.count()
df.describe()
- summary stats for numeric columns
df["sex"].value_counts()
- frequency table
df.sort_index()
or df.sort_values("column", ascending=True)
df.sort_values(["a", "b"])
df[“val”].astype("newtype")
- change dtype
survived | pclass | age | sibsp | parch | fare | |
---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
df.drop(labels, axis="columns")
df.drop_duplicates()
df.rename({"A": "B"}, axis="columns")
- rename
df["sex"].map({"male": 0, "female": 1})
- relevel
df["age"].replace(0, 999)
- replace
pd.cut(df["age"], bins=[17, 21, 35, 45, 100])
survived | pclass | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
1 | 1 | 1 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
2 | 1 | 3 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
3 | 1 | 1 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
4 | 0 | 3 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
5 | 0 | 3 | male | NaN | 0 | 0 | 8.4583 | Q | Third | man | True | NaN | Queenstown | no | True |
pclass | |
---|---|
0 | 3rd |
1 | 1st |
2 | 3rd |
3 | 1st |
4 | 3rd |
df.assign()
- ❗ Deriving new columnsnew_column_name = fun
, where fun
is a function operating on the df.assign
import numpy as np
df[["fare", "alive"]].assign(
is_alive = lambda d: np.where(d["alive"] == "yes", 1, 0),
fare_sq = lambda d: d["fare"]**2,
fare_sq_sqrt = lambda d: d["fare_sq"]**(1/2)
).head()
fare | alive | is_alive | fare_sq | fare_sq_sqrt | |
---|---|---|---|---|---|
0 | 7.2500 | no | 0 | 52.562500 | 7.2500 |
1 | 71.2833 | yes | 1 | 5081.308859 | 71.2833 |
2 | 7.9250 | yes | 1 | 62.805625 | 7.9250 |
3 | 53.1000 | yes | 1 | 2819.610000 | 53.1000 |
4 | 8.0500 | no | 0 | 64.802500 | 8.0500 |
df2["is_male"] = df2["gender"] == 'male'
male_proportion_by_class = df2.groupby("pclass").agg({"is_male": "mean"})
male_proportion_by_class.sort_values().head(1)
vs
df.drop(columns=["colA"])
df.dropna(subset=["colA"])
df.fillna({"colA": 0})
None
– python general representation of missing value
np.nan
– Numpy’s NaN is usually used in pandas
Note: By default, NaNs are excluded from aggregate functions
To check whether a value is missing
df["age"].isna()
or df["age"].notna()
We can drop rows with missing values
df.dropna(subset=["age"])
We can fill missing values with a constant
df["age"].fillna(val)
age
NaN 177
24.0 30
22.0 27
18.0 26
19.0 25
Name: count, dtype: int64
age
-1.0 177
24.0 30
22.0 27
18.0 26
19.0 25
Name: count, dtype: int64
pd.cross_table
– frequency tabledf.pivot_table
– any aggregation fnmerge
Docsdf.merge
on
(or a pair left_on
, right_on
) defines colname(s) to proceed the merge with
left_index = True
/ right_index
as wellhow
defines the type of the join
left
, right
, outer
, inner
, cross
concat
Docspd.concat
pd.concat(([df1, df2, df3])
pd.concat([df1, df2], axis="column")
(891, 15)
(10, 15)
(881, 15)
as_index=False
to keep the grouping col(s)min
, max
, mean
, nunique
, sum
, size
, count
, var
, sem
(standard error of the mean), descrie
first
, last
, nth
(
df.groupby(["pclass"], as_index=False).agg(
n=("survived", "count"),
mean_fare=("fare", "mean"),
custom_agg_fun=("age", lambda x: x[~x.isna()].count()),
)
)
pclass | n | mean_fare | custom_agg_fun | |
---|---|---|---|---|
0 | 1 | 216 | 84.154687 | 186 |
1 | 2 | 184 | 20.662183 | 173 |
2 | 3 | 491 | 13.675550 | 355 |