Competing in Kaggle

Modeling and evaluation

Outline

  • First steps in a project
  • Tracking your progress
  • ML pipelines with Python
  • MLOps tips

Motivation

Motivation

Example: Kaggle / Hackathon

  • Kaggle competition - link
  • Why
    • Narrowed project setup
    • Known goal
    • Known metrics
    • Limited data/scope

Example: Leaderboard

Example - 2nd place solution overview

First steps in a project

Identify project scope

  • Do you know your goal? E.g.:
    • Predict whether a user will click on an ad
    • Detect malicious sessions
  • Do you know how is your model going to be used? E.g.:
    • Batch inference / Real-time inference
    • On what population will it be used?
  • Investigate what is at your disposal
    • Data description report
    • Initial exploration report
    • Goal: set up (fix) a population

Create modelling population

  • It should be as close to real world use-case as possible.
  • Implement a script that fixes the population
    • Parametrized (period of time, inclusion criteria, …)
    • Ad hoc conduction / materialized table
    • Watch out reproducibility!
  • Think ahead: Train x Test split
    • Random split?
    • Time-based split? Gap period? Period sizes?
  • In case you have plenty of data
    • start small
    • find a period that is not affected by some interventions (e.g., campaigns)
    • prefer recent data (to avoid some hidden drift)

Reference model

  • Goal:
    • Have a clue about target predictability
    • Have a reference for comparison
    • Have something at hand
      • to experiment with
      • to let downstream developers to work with
      • as a fallback result
  • Performance is not of the first concern
    • Moreover, it’s great to watch things are getting better 😄

Use version control

  • Build production ready code
  • Use GIT
    • advantages
      • history
      • branches
      • collaboration
    • disadvantages
      • ???
  • Use branches
    • 4 eyes principle
    • Use it for reports as well

Setup project structure

reports/
src/                    # keep that in this folder
  data_flow.py          # stay tuned :)
  my_package/           # one repo = one py package
    __init__.py
    my_module.py        # you can enter any number of submodules
    my_subpackage/      #... and subpackages
      __init__.py
      ...
    ... 
.gitignore
src/                   # python source codes
  my_package/
    __init__.py
    my_module.py
    my_subpackage/
        __init__.py
    ...
tests/
  test_my_module.py
  ...
.gitignore             # git setup
pyproject.toml         # py package configuration
requirements.txt       # specify python env
README.md              # write down project info & run instructions
...

MLflow

Tracking your progress

Motivation

How to track your progress?

  • ❌ Use Excel & make notes
  • ✅ Use an Experiment tracking toolmlflow
  • mlflow is a repository for various ML artifacts:
    • (trained) models,
    • metrics,
    • parameters,
    • other artifacts (files,…).
  • Alternatives e.g. Comet ML, neptune.ai, …

Why MLflow?

MLflow GUI

MLflow terminology

  • Run: ~ data about one model training.
  • Experiment: group of (related) model runs.

MLflow tracking server

  • Setup
    • You need to install mlflow python package first.
mlflow server --host 127.0.0.1 --port 8007
  • Access it from browser: http://localhost:8007
  • Access it from python
    • You can use MLFLOW_TRACKING_URI environment variable as well
mport mlflow

mlflow.set_tracking_uri(uri="localhost:8007")

MLflow API

mlflow tracking API:

import mlflow

mlflow.create_experiment("my_experiment")
# mlflow.set_experiment("my_experiment")

with mlflow.start_run() as run:
    mlflow.log_metric("train_roc_auc", 0.74)

MlflowClient:

from mlflow import MlflowClient

client = MlflowClient()
experiment_id = client.create_experiment(
    "my_experiment")
run = client.create_run(experiment_id)
client.log_metric(run.info.run_id, "train_roc_auc", 0.74)

Example – logging run manually

# configure mlflow
mlflow.set_tracking_uri("http://127.0.0.1:8007")
mlflow.set_experiment("my_experiment")

with mlflow.start_run():
    # train model
    model = LogisticRegression(...)
    model.fit(train_X, train_y)

    # store models and its params to mlflow
    mlflow.sklearn.log_model(model, "model")
    mlflow.log_param("coefficients", ",".join(list(model.coefs_.squeeze())))
    mlflow.log_param("L2", model.C_)
    mlflow.log_param("train_prevalence", train_y.mean())
    mlflow.log_param("test_prevalence", test_y.mean())

    # compute and log metrics
    train_f1 = sklearn.metrics.f1_score(train_y,  model.predict(X_train))
    test_f1 = sklearn.metrics.f1_score(test_y, model.predict(X_test))
    mlflow.log_metric("train_f1", train_f1)
    mlflow.log_metric("test_f1", test_f1)

Example - mlflow.autolog

mlflow.set_tracking_uri("http://127.0.0.1:8007")
mlflow.set_experiment("my_experiment")
mlflow.autolog()

with mlflow.start_run():
    model = LogisticRegression(...)
    model.fit(train_X, train_y)

Retrieving information

import mlflow
mlflow.set_tracking_uri("http://127.0.0.1:8007")

experiment_ids = "my_experiment"
run_id = "..."


df = mlflow.search_runs(experiment_ids)
run = mlflow.get_run(run_id)
experiment = mlflow.get_experiment(experiment_id)
from mlflow import MlflowClient
mlflow.set_tracking_uri("http://127.0.0.1:8007")

experiment_ids = "my_experiment"
run_id = "..."

client = MLflowClient()
df = client.search_runs(experiment_ids)
run = client.get_run(run_id)
experiment = client.get_experiment(experiment_ids)

Models Management

  • A lot of stored models in our Mlflow. Multiple types.
  • Need for organization.
    • Working model?
    • Candidate for production?
  • ➡️ named models
    • Assign name to a model you pick.
    • Defines model lineage.
    • Use versions as well!
  • Can be assigned in our flow or manually in GUI.
    • CI/CD integration

Example - register model manually

flowchart LR
    a["Version 1"] --> b["Version 2"]
    b --> c["Version 3"]
    c --> d["Version 4"]

Model versions chain.

with mlflow.start_run() as run:
    ...
    mlflow.sklearn.log_model(.., "artifact-name") 

# create model version
model_name = "propensity-model"
mlflow.register_model("runs:/" + run.info.run_id + "/artifact-name", model_name)

Use model for interventions

  • MLflow UI provides you hints on how to use models
    • Just navigate to Artifacts & select logged model.
model_name = "propensity-model"
model_version = "latest"

# Load the model from the Model Registry
model_uri = f"models:/{model_name}/{model_version}"
model = mlflow.sklearn.load_model(model_uri)

# Use the model
model.predict(X_test)

Model Tags

  • Motivation
    • We want to detect which models are being tested / deployed to production / archived etc.
  • Tags are kew-value maps. Stages example:
    1. new: newly created version. (stage: new)
    2. staging: candidate for production. (stage: staging)
    3. prod: model in production. (stage: prod)
    4. archived: old, irrelevant models. (stage: archived)

flowchart LR
    1["Version 1 (Archived)"] --> 2["Version 2 (Prod)"]
    2 --> 3["Version 3 (Staging)"]
    3 --> 4["Version 4"]

    style 1 fill:#ededed,stroke:lightgrey
    style 2 fill:lightgreen,stroke:black
    style 3 fill:yellow,stroke:black

Model versions with stages for ‘accurate’ model.

with mlflow.start_run():
    mlflow.set_tag("release.version", "2.2.0")

Pipelines

Pipelines

  • Representing algorithm as a specific graph.
  • Example: login

flowchart LR
    A[Login form] --login/password----> B[Authentication service]

  • Example: modelling

flowchart LR
    A[Load data] --> B[Preprocess data] --> C[Train model] --> D[Predict]

  • DataFlow - pipeline dealing with some data

Example

data = load_dataset("/path/to/file")
feature1 = prepare_feature1(data)
feature2 = prepare_feature2(data)
features = join(feature1, feature2)
model = train_model(features)

Graph representation:

flowchart LR
    load_dataset --> prepare_feature1 --> prepare_feature2 --> join --> train_model

Parallelizing:

flowchart LR
    load_dataset --> prepare_feature1
    load_dataset --> prepare_feature2
    prepare_feature1 --> join
    prepare_feature2 --> join
    join --> train_model

Why to use pipelines

  • Reproducibility
    • Automated workflows reduce human error
    • Ensures consistent execution every time
    • Makes experiments truly repeatable
  • Modularization
    • Complex workflows split into manageable pieces
    • Clear separation of concerns
    • Easier maintenance and updates
  • Testability
    • Individual components can be tested separately
  • Reusability
    • Components can be shared across projects
  • Handling failures
  • Automatic parallelization
  • Scaling
  • Deployment

Types of DataFlows

  • Batch
    • Big chunks of data.
    • Example: SQL databases.
    • Big overhead.
  • Real-time
    • Small chunks of data.
    • Small overhead.
    • Example: Kafka, RabbitMQ, Spark streaming.

Piping via Metaflow

image/svg+xml

  • Metaflow
    • Python library.
    • Helps with data-intensive applications
    • Helps develop, deploy and operate
    • Has more features than we’re going to cover
      • E.g., running time-consuming steps on AWS Batch etc.

MetaFlow alternatives

  • Data Version Control or DVC
    • CmdLine tool, file-oriented
  • Apache AirFlow
    • A platform to author, schedule and monitor data pipelines
  • Luigi
    • Python package, helps build complex piepelines
  • Kubeflow

Metaflow example

# file: flow.py
from metaflow import FlowSpec, step

class BasicFlow(FlowSpec):
    @step
    def start(self):
        import seaborn as sns

        print("start")
        self.x = 10
        self.next(self.a)

    @step
    def a(self):
        self.x = self.x + 1
        self.y = 5
        self.next(self.end)

    @step
    def end(self):
        print("end")
        print(f"x: {self.x}, y: {self.y}")

if __name__ == "__main__":
    BasicFlow()

Example: Result

Important

Metaflow has some troubles running on Windows. But it works fine on WSL / Linux.

python flow.py run  # run the dataflow
Creating local datastore in current directory (/home/jpalasek/mlops-skoleni/lectures/.metaflow)
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint is happy!
2023-04-13 14:43:01.233 Workflow starting (run-id 1681389781228269):
2023-04-13 14:43:01.236 [1681389781228269/start/1 (pid 849611)] Task is starting.
2023-04-13 14:43:02.652 [1681389781228269/start/1 (pid 849611)] start
2023-04-13 14:43:02.925 [1681389781228269/start/1 (pid 849611)] Task finished successfully.
2023-04-13 14:43:02.933 [1681389781228269/a/2 (pid 850507)] Task is starting.
2023-04-13 14:43:03.846 [1681389781228269/a/2 (pid 850507)] Task finished successfully.
2023-04-13 14:43:03.854 [1681389781228269/end/3 (pid 850522)] Task is starting.
2023-04-13 14:43:04.335 [1681389781228269/end/3 (pid 850522)] end
2023-04-13 14:43:04.622 [1681389781228269/end/3 (pid 850522)] x: 11, y: 5
2023-04-13 14:43:04.770 [1681389781228269/end/3 (pid 850522)] Name: sex, dtype: object
2023-04-13 14:43:04.770 [1681389781228269/end/3 (pid 850522)] Task finished successfully.
2023-04-13 14:43:04.771 Done!

Parametrize MLflow flow

  • No need for argparse etc.
  • Just add class atribute of type metaflow.Parameter
from metaflow import FlowSpec, step, Parameter

class BasicFlow(FlowSpec):

    data_path = Parameter(
        "data_path",
        default="data/homecredit_kaggle_train/application_train.csv",
        help="Source data path",
    )

    @step
    def start(self):
        ...

Proposed project structure

...
analyses/
    topic1/
        exploration_a.qmd
        exploration_b.qmd
    ...
src/
    my_flow.py  # flow
    my_package/
        included.py
        __init__.py
pyproject.toml
...

Metaflow features

  • DataStore
    • = Directory with all info about runs
    • Data, metadata, params; but no code
  • Resume
    • python flow.py resume – last failed step in the last run
    • python flow.py resume a – specific step in the last run
    • python flow.py resume --origin-run-id 1692086818954876 – specific run
  • Inspection
from metaflow import Flow
flow = Flow("BasicFlow")
run = flow[run_id]
step = run[step_name]  # e.g. step_name = "start"
task = step.task
data = task.data

graph LR
    Flow --> Run --> Step --> Task --> D["Data Artifact"]

Simple branching

class BranchFlow(FlowSpec):
    @step
    def start(self):
        self.next(self.a, self.b)
    @step
    def a(self):
        self.x = 1
        self.next(self.join)
    @step
    def b(self):
        self.x = 2
        self.next(self.join)
    @step
    def join(self, inputs):
        print('a is %s' % inputs.a.x)
        print('b is %s' % inputs.b.x)
        print('total is %d' % sum(input.x for input in inputs))
        self.next(self.end)
    @step
    def end(self):
        pass
  • It can handle dynamic branching as well
  • E.g. --max-workers (default 4).

Decorators to adjust step behavior

  • @retry(times=3)
  • @catch(var='compute_failed')
    • catch step’s error and gracefully handle in the second step
  • @timeout
  • @ignore_failure

Example: Decorators

class RetryFlow(FlowSpec):
    @retry
    @step
    def start(self):
        import time
        if int(time.time()) % 2 == 0:
            raise Exception("Bad luck!")
        else:
            print("Lucky you!")
        self.next(self.end)

    @step
    def end(self):
        print("Phew!")

Scaling up

  • @conda, @conda_base
    • Set up dependencies on a step level.
    • Watch out, it is slow! (But you can use custom image)
  • run on AWS
    • you can even deploy (and @schedule) using AWS Step functions
python flow.py --environment conda --datastore s3 run --with batch

Exercises

Overall goal: use data provided to build a model to predict the TARGET

Exercise - MLflow setup

  • Overall goal: use data provided to build a model to predict the TARGET
  • Data: download application_train.csv file (will be specified)
    • TARGET is the target variable.
    • you can run ydata-profiling to get initial overview of data
  • Create a script to fit the data
    • use test/train split (e.g., sklearn.model_selection.train_test_split)
    • use any model you want (e.g., catboost)
    • start small, use just a few columns as features
    • collect metrics (e.g., sklearn.metrics.roc_auc_score)
  • MLflow server
    • install mlflow pakcage
    • run mlflow server
    • do not turn it off. Use other bash window / use nohup to run the server
    • check it out in a browser (url depends on what you have set up, defaults to http://127.0.0.1:5000)
  • MLflow experiment
    • setup an experiment (mlflow.create_experiment or using UI)
    • you can add tags etc.
  • MLflow log model run
    • set tracking uri (mlflow.set_tracking_uri("file:./mlruns"))
    • set experiment (mlflow.set_experiment)
    • add mlflow.run context manager (with mlflow.start_run() as run)
    • log fitted model (e.g., mlflow.sklearn.log_model)
    • log metrics (e.g., mlflow.log_metrics)
    • log parameters (e.g., mlflow.log_params)
    • run the script & check out MLflow UI
  • Tracking your experiments
    • Run the script multiple times with different parameters
      • you can use argparse to pass the params
    • (Optional): experiment with mlflow.autolog
    • Check out the experiments in MLflow UI

Exercise - Metaflow models

  • Register a model
    • Pick one of your recent runs (e.g., the most performant one)
    • Register a model in MLflow UI
      • Artifacts -> Model detail -> Register model
      • You need to specify new model name
    • Check out the model in the MLflow UI (top level “Models”)
  • Use of a model
    • Pick any of your models (either a run artifact or a registered model)
    • Check out model entry in artifacts
    • Follow the instructions in the MLflow UI to load the model into a jupyter/script
    • Use the model for predictions

Exercise - Metaflow pipeline (optional)

  • Gentle reminder: you need to use either Linux or WSL
  • Build empty pipeline
    • Create a class inheriting from FlowSpec
    • Add steps (method with @step) to the class
      • start
      • modelling
      • end
    • Connect steps together using self.next(self.<step_name>)
    • Check you’re able tu run your pipeline (python flow.py run)
  • Add some content
    • start
      • handle data loading
        • It need to be attribute to be accessible in other steps!
    • modelling
      • use model fitting code you already have!
    • run the flow & check it out in the MLflow UI
  • Add a feature engineering step
    • add a new step
    • add some feature engineering code (even a simple one)
    • insert the step in the pipeline (start -> <your> -> modelling)
    • note: you can try branching as well. (just add more than 1 step between start and modelling)
    • run the flow & check it out in the MLflow UI