flowchart LR a["Version 1"] --> b["Version 2"] b --> c["Version 3"] c --> d["Version 4"]
Modeling and evaluation
reports/
src/ # keep that in this folder
data_flow.py # stay tuned :)
my_package/ # one repo = one py package
__init__.py
my_module.py # you can enter any number of submodules
my_subpackage/ #... and subpackages
__init__.py
...
...
.gitignore
src/ # python source codes
my_package/
__init__.py
my_module.py
my_subpackage/
__init__.py
...
tests/
test_my_module.py
...
.gitignore # git setup
pyproject.toml # py package configuration
requirements.txt # specify python env
README.md # write down project info & run instructions
...
Tracking your progress
mlflow
mlflow
is a repository for various ML artifacts:
Comet ML
, neptune.ai
, …Why MLflow?
MLflow terminology
mlflow
python package first.MLFLOW_TRACKING_URI
environment variable as wellmlflow
tracking API:
# configure mlflow
mlflow.set_tracking_uri("http://127.0.0.1:8007")
mlflow.set_experiment("my_experiment")
with mlflow.start_run():
# train model
model = LogisticRegression(...)
model.fit(train_X, train_y)
# store models and its params to mlflow
mlflow.sklearn.log_model(model, "model")
mlflow.log_param("coefficients", ",".join(list(model.coefs_.squeeze())))
mlflow.log_param("L2", model.C_)
mlflow.log_param("train_prevalence", train_y.mean())
mlflow.log_param("test_prevalence", test_y.mean())
# compute and log metrics
train_f1 = sklearn.metrics.f1_score(train_y, model.predict(X_train))
test_f1 = sklearn.metrics.f1_score(test_y, model.predict(X_test))
mlflow.log_metric("train_f1", train_f1)
mlflow.log_metric("test_f1", test_f1)
mlflow.autolog
flowchart LR a["Version 1"] --> b["Version 2"] b --> c["Version 3"] c --> d["Version 4"]
stage: new
)stage: staging
)stage: prod
)stage: archived
)flowchart LR 1["Version 1 (Archived)"] --> 2["Version 2 (Prod)"] 2 --> 3["Version 3 (Staging)"] 3 --> 4["Version 4"] style 1 fill:#ededed,stroke:lightgrey style 2 fill:lightgreen,stroke:black style 3 fill:yellow,stroke:black
flowchart LR A[Login form] --login/password----> B[Authentication service]
flowchart LR A[Load data] --> B[Preprocess data] --> C[Train model] --> D[Predict]
data = load_dataset("/path/to/file")
feature1 = prepare_feature1(data)
feature2 = prepare_feature2(data)
features = join(feature1, feature2)
model = train_model(features)
Graph representation:
flowchart LR load_dataset --> prepare_feature1 --> prepare_feature2 --> join --> train_model
Parallelizing:
flowchart LR load_dataset --> prepare_feature1 load_dataset --> prepare_feature2 prepare_feature1 --> join prepare_feature2 --> join join --> train_model
# file: flow.py
from metaflow import FlowSpec, step
class BasicFlow(FlowSpec):
@step
def start(self):
import seaborn as sns
print("start")
self.x = 10
self.next(self.a)
@step
def a(self):
self.x = self.x + 1
self.y = 5
self.next(self.end)
@step
def end(self):
print("end")
print(f"x: {self.x}, y: {self.y}")
if __name__ == "__main__":
BasicFlow()
Important
Metaflow has some troubles running on Windows. But it works fine on WSL / Linux.
Creating local datastore in current directory (/home/jpalasek/mlops-skoleni/lectures/.metaflow)
Validating your flow...
The graph looks good!
Running pylint...
Pylint is happy!
2023-04-13 14:43:01.233 Workflow starting (run-id 1681389781228269):
2023-04-13 14:43:01.236 [1681389781228269/start/1 (pid 849611)] Task is starting.
2023-04-13 14:43:02.652 [1681389781228269/start/1 (pid 849611)] start
2023-04-13 14:43:02.925 [1681389781228269/start/1 (pid 849611)] Task finished successfully.
2023-04-13 14:43:02.933 [1681389781228269/a/2 (pid 850507)] Task is starting.
2023-04-13 14:43:03.846 [1681389781228269/a/2 (pid 850507)] Task finished successfully.
2023-04-13 14:43:03.854 [1681389781228269/end/3 (pid 850522)] Task is starting.
2023-04-13 14:43:04.335 [1681389781228269/end/3 (pid 850522)] end
2023-04-13 14:43:04.622 [1681389781228269/end/3 (pid 850522)] x: 11, y: 5
2023-04-13 14:43:04.770 [1681389781228269/end/3 (pid 850522)] Name: sex, dtype: object
2023-04-13 14:43:04.770 [1681389781228269/end/3 (pid 850522)] Task finished successfully.
2023-04-13 14:43:04.771 Done!
metaflow.Parameter
python flow.py resume
– last failed step in the last runpython flow.py resume a
– specific step in the last runpython flow.py resume --origin-run-id 1692086818954876
– specific runclass BranchFlow(FlowSpec):
@step
def start(self):
self.next(self.a, self.b)
@step
def a(self):
self.x = 1
self.next(self.join)
@step
def b(self):
self.x = 2
self.next(self.join)
@step
def join(self, inputs):
print('a is %s' % inputs.a.x)
print('b is %s' % inputs.b.x)
print('total is %d' % sum(input.x for input in inputs))
self.next(self.end)
@step
def end(self):
pass
--max-workers
(default 4).@retry(times=3)
@catch(var='compute_failed')
@timeout
@ignore_failure
@conda
, @conda_base
@schedule
) using AWS Step functionsOverall goal: use data provided to build a model to predict the TARGET
TARGET
application_train.csv
file (will be specified)
TARGET
is the target variable.ydata-profiling
to get initial overview of datasklearn.model_selection.train_test_split
)catboost
)sklearn.metrics.roc_auc_score
)mlflow
pakcagemlflow server
nohup
to run the serverhttp://127.0.0.1:5000
)mlflow.create_experiment
or using UI)mlflow.set_tracking_uri("file:./mlruns")
)mlflow.set_experiment
)mlflow.run
context manager (with mlflow.start_run() as run
)mlflow.sklearn.log_model
)mlflow.log_metrics
)mlflow.log_params
)argparse
to pass the paramsmlflow.autolog
FlowSpec
@step
) to the class
start
modelling
end
self.next(self.<step_name>)
python flow.py run
)start
modelling
start -> <your> -> modelling
)start
and modelling
)