Competing in Kaggle

Modeling and evaluation

Dominik Matula

Outline

First steps in a project
Tracking your progress
ML pipelines with Python
MLOps tips

Motivation

Example: Kaggle / Hackathon

Kaggle competition - link
Why
- Narrowed project setup
- Known goal
- Known metrics
- Limited data/scope

Example: Leaderboard

Example - 2nd place solution overview

First steps in a project

Identify project scope

Do you know your goal? E.g.:
- Predict whether a user will click on an ad
- Detect malicious sessions
Do you know how is your model going to be used? E.g.:
- Batch inference / Real-time inference
- On what population will it be used?
Investigate what is at your disposal
- Data description report
- Initial exploration report
- Goal: set up (fix) a population

Create modelling population

It should be as close to real world use-case as possible.
Implement a script that fixes the population
- Parametrized (period of time, inclusion criteria, …)
- Ad hoc conduction / materialized table
- Watch out reproducibility!
Think ahead: Train x Test split
- Random split?
- Time-based split? Gap period? Period sizes?
In case you have plenty of data
- start small
- find a period that is not affected by some interventions (e.g., campaigns)
- prefer recent data (to avoid some hidden drift)

Reference model

Goal:
- Have a clue about target predictability
- Have a reference for comparison
- Have something at hand
  - to experiment with
  - to let downstream developers to work with
  - as a fallback result
Performance is not of the first concern
- Moreover, it’s great to watch things are getting better 😄

Use version control

Build production ready code
Use GIT
- advantages
  - history
  - branches
  - collaboration
- disadvantages
  - ???
Use branches
- 4 eyes principle
- Use it for reports as well

Setup project structure

reports/
src/                    # keep that in this folder
  data_flow.py          # stay tuned :)
  my_package/           # one repo = one py package
    __init__.py
    my_module.py        # you can enter any number of submodules
    my_subpackage/      #... and subpackages
      __init__.py
      ...
    ... 
.gitignore
src/                   # python source codes
  my_package/
    __init__.py
    my_module.py
    my_subpackage/
        __init__.py
    ...
tests/
  test_my_module.py
  ...
.gitignore             # git setup
pyproject.toml         # py package configuration
requirements.txt       # specify python env
README.md              # write down project info & run instructions
...

MLflow

Tracking your progress

Motivation

How to track your progress?

❌ Use Excel & make notes
✅ Use an Experiment tracking tool – mlflow
mlflow is a repository for various ML artifacts:
- (trained) models,
- metrics,
- parameters,
- other artifacts (files,…).
Alternatives e.g. Comet ML, neptune.ai, …

Why MLflow?

MLflow GUI

MLflow terminology

Run: ~ data about one model training.
Experiment: group of (related) model runs.

MLflow tracking server

Setup
- You need to install mlflow python package first.

mlflow server --host 127.0.0.1 --port 8007

Access it from browser: http://localhost:8007
Access it from python
- You can use MLFLOW_TRACKING_URI environment variable as well

mport mlflow

mlflow.set_tracking_uri(uri="localhost:8007")

MLflow API

mlflow tracking API:

import mlflow

mlflow.create_experiment("my_experiment")
# mlflow.set_experiment("my_experiment")

with mlflow.start_run() as run:
    mlflow.log_metric("train_roc_auc", 0.74)

MlflowClient:

from mlflow import MlflowClient

client = MlflowClient()
experiment_id = client.create_experiment(
    "my_experiment")
run = client.create_run(experiment_id)
client.log_metric(run.info.run_id, "train_roc_auc", 0.74)

Example – logging run manually

# configure mlflow
mlflow.set_tracking_uri("http://127.0.0.1:8007")
mlflow.set_experiment("my_experiment")

with mlflow.start_run():
    # train model
    model = LogisticRegression(...)
    model.fit(train_X, train_y)

    # store models and its params to mlflow
    mlflow.sklearn.log_model(model, "model")
    mlflow.log_param("coefficients", ",".join(list(model.coefs_.squeeze())))
    mlflow.log_param("L2", model.C_)
    mlflow.log_param("train_prevalence", train_y.mean())
    mlflow.log_param("test_prevalence", test_y.mean())

    # compute and log metrics
    train_f1 = sklearn.metrics.f1_score(train_y,  model.predict(X_train))
    test_f1 = sklearn.metrics.f1_score(test_y, model.predict(X_test))
    mlflow.log_metric("train_f1", train_f1)
    mlflow.log_metric("test_f1", test_f1)

Example - `mlflow.autolog`

mlflow.set_tracking_uri("http://127.0.0.1:8007")
mlflow.set_experiment("my_experiment")
mlflow.autolog()

with mlflow.start_run():
    model = LogisticRegression(...)
    model.fit(train_X, train_y)

Retrieving information

import mlflow
mlflow.set_tracking_uri("http://127.0.0.1:8007")

experiment_ids = "my_experiment"
run_id = "..."


df = mlflow.search_runs(experiment_ids)
run = mlflow.get_run(run_id)
experiment = mlflow.get_experiment(experiment_id)

from mlflow import MlflowClient
mlflow.set_tracking_uri("http://127.0.0.1:8007")

experiment_ids = "my_experiment"
run_id = "..."

client = MLflowClient()
df = client.search_runs(experiment_ids)
run = client.get_run(run_id)
experiment = client.get_experiment(experiment_ids)

Models Management

A lot of stored models in our Mlflow. Multiple types.
Need for organization.
- Working model?
- Candidate for production?
- …

➡️ named models
- Assign name to a model you pick.
- Defines model lineage.
- Use versions as well!
Can be assigned in our flow or manually in GUI.
- CI/CD integration

Example - register model manually

flowchart LR
    a["Version 1"] --> b["Version 2"]
    b --> c["Version 3"]
    c --> d["Version 4"]

Model versions chain.

with mlflow.start_run() as run:
    ...
    mlflow.sklearn.log_model(.., "artifact-name") 

# create model version
model_name = "propensity-model"
mlflow.register_model("runs:/" + run.info.run_id + "/artifact-name", model_name)

Use model for interventions

MLflow UI provides you hints on how to use models
- Just navigate to Artifacts & select logged model.

model_name = "propensity-model"
model_version = "latest"

# Load the model from the Model Registry
model_uri = f"models:/{model_name}/{model_version}"
model = mlflow.sklearn.load_model(model_uri)

# Use the model
model.predict(X_test)

Model Tags

Motivation
- We want to detect which models are being tested / deployed to production / archived etc.
Tags are kew-value maps. Stages example:
1. new: newly created version. (stage: new)
2. staging: candidate for production. (stage: staging)
3. prod: model in production. (stage: prod)
4. archived: old, irrelevant models. (stage: archived)

flowchart LR
    1["Version 1 (Archived)"] --> 2["Version 2 (Prod)"]
    2 --> 3["Version 3 (Staging)"]
    3 --> 4["Version 4"]

    style 1 fill:#ededed,stroke:lightgrey
    style 2 fill:lightgreen,stroke:black
    style 3 fill:yellow,stroke:black

Model versions with stages for ‘accurate’ model.

with mlflow.start_run():
    mlflow.set_tag("release.version", "2.2.0")

Pipelines

Representing algorithm as a specific graph.
Example: login

flowchart LR
    A[Login form] --login/password----> B[Authentication service]

Example: modelling

flowchart LR
    A[Load data] --> B[Preprocess data] --> C[Train model] --> D[Predict]

DataFlow - pipeline dealing with some data

Example

data = load_dataset("/path/to/file")
feature1 = prepare_feature1(data)
feature2 = prepare_feature2(data)
features = join(feature1, feature2)
model = train_model(features)

Graph representation:

flowchart LR
    load_dataset --> prepare_feature1 --> prepare_feature2 --> join --> train_model

Parallelizing:

flowchart LR
    load_dataset --> prepare_feature1
    load_dataset --> prepare_feature2
    prepare_feature1 --> join
    prepare_feature2 --> join
    join --> train_model

Why to use pipelines

Reproducibility
- Automated workflows reduce human error
- Ensures consistent execution every time
- Makes experiments truly repeatable
Modularization
- Complex workflows split into manageable pieces
- Clear separation of concerns
- Easier maintenance and updates
Testability
- Individual components can be tested separately
Reusability
- Components can be shared across projects
Handling failures
Automatic parallelization
Scaling
Deployment

Types of DataFlows

Batch
- Big chunks of data.
- Example: SQL databases.
- Big overhead.
Real-time
- Small chunks of data.
- Small overhead.
- Example: Kafka, RabbitMQ, Spark streaming.

Piping via Metaflow

Metaflow
- Python library.
- Helps with data-intensive applications
- Helps develop, deploy and operate
- Has more features than we’re going to cover
  - E.g., running time-consuming steps on AWS Batch etc.

MetaFlow alternatives

Data Version Control or DVC
- CmdLine tool, file-oriented
Apache AirFlow
- A platform to author, schedule and monitor data pipelines
Luigi
- Python package, helps build complex piepelines
Kubeflow
…

Metaflow example

# file: flow.py
from metaflow import FlowSpec, step

class BasicFlow(FlowSpec):
    @step
    def start(self):
        import seaborn as sns

        print("start")
        self.x = 10
        self.next(self.a)

    @step
    def a(self):
        self.x = self.x + 1
        self.y = 5
        self.next(self.end)

    @step
    def end(self):
        print("end")
        print(f"x: {self.x}, y: {self.y}")

if __name__ == "__main__":
    BasicFlow()

Example: Result

Important

Metaflow has some troubles running on Windows. But it works fine on WSL / Linux.

python flow.py run  # run the dataflow

Creating local datastore in current directory (/home/jpalasek/mlops-skoleni/lectures/.metaflow)
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint is happy!
2023-04-13 14:43:01.233 Workflow starting (run-id 1681389781228269):
2023-04-13 14:43:01.236 [1681389781228269/start/1 (pid 849611)] Task is starting.
2023-04-13 14:43:02.652 [1681389781228269/start/1 (pid 849611)] start
2023-04-13 14:43:02.925 [1681389781228269/start/1 (pid 849611)] Task finished successfully.
2023-04-13 14:43:02.933 [1681389781228269/a/2 (pid 850507)] Task is starting.
2023-04-13 14:43:03.846 [1681389781228269/a/2 (pid 850507)] Task finished successfully.
2023-04-13 14:43:03.854 [1681389781228269/end/3 (pid 850522)] Task is starting.
2023-04-13 14:43:04.335 [1681389781228269/end/3 (pid 850522)] end
2023-04-13 14:43:04.622 [1681389781228269/end/3 (pid 850522)] x: 11, y: 5
2023-04-13 14:43:04.770 [1681389781228269/end/3 (pid 850522)] Name: sex, dtype: object
2023-04-13 14:43:04.770 [1681389781228269/end/3 (pid 850522)] Task finished successfully.
2023-04-13 14:43:04.771 Done!

Parametrize MLflow flow

No need for argparse etc.
Just add class atribute of type metaflow.Parameter

from metaflow import FlowSpec, step, Parameter

class BasicFlow(FlowSpec):

    data_path = Parameter(
        "data_path",
        default="data/homecredit_kaggle_train/application_train.csv",
        help="Source data path",
    )

    @step
    def start(self):
        ...

Proposed project structure

...
analyses/
    topic1/
        exploration_a.qmd
        exploration_b.qmd
    ...
src/
    my_flow.py  # flow
    my_package/
        included.py
        __init__.py
pyproject.toml
...

Metaflow features

DataStore
- = Directory with all info about runs
- Data, metadata, params; but no code
Resume
- python flow.py resume – last failed step in the last run
- python flow.py resume a – specific step in the last run
- python flow.py resume --origin-run-id 1692086818954876 – specific run
Inspection

from metaflow import Flow
flow = Flow("BasicFlow")
run = flow[run_id]
step = run[step_name]  # e.g. step_name = "start"
task = step.task
data = task.data

graph LR
    Flow --> Run --> Step --> Task --> D["Data Artifact"]

Simple branching

class BranchFlow(FlowSpec):
    @step
    def start(self):
        self.next(self.a, self.b)
    @step
    def a(self):
        self.x = 1
        self.next(self.join)
    @step
    def b(self):
        self.x = 2
        self.next(self.join)
    @step
    def join(self, inputs):
        print('a is %s' % inputs.a.x)
        print('b is %s' % inputs.b.x)
        print('total is %d' % sum(input.x for input in inputs))
        self.next(self.end)
    @step
    def end(self):
        pass

It can handle dynamic branching as well
E.g. --max-workers (default 4).

Decorators to adjust step behavior

@retry(times=3)
@catch(var='compute_failed')
- catch step’s error and gracefully handle in the second step
@timeout
@ignore_failure

Example: Decorators

class RetryFlow(FlowSpec):
    @retry
    @step
    def start(self):
        import time
        if int(time.time()) % 2 == 0:
            raise Exception("Bad luck!")
        else:
            print("Lucky you!")
        self.next(self.end)

    @step
    def end(self):
        print("Phew!")

Scaling up

@conda, @conda_base
- Set up dependencies on a step level.
- Watch out, it is slow! (But you can use custom image)
run on AWS
- you can even deploy (and @schedule) using AWS Step functions

python flow.py --environment conda --datastore s3 run --with batch

Exercises

Overall goal: use data provided to build a model to predict the TARGET

Exercise - MLflow setup

Overall goal: use data provided to build a model to predict the TARGET
Data: download application_train.csv file (will be specified)
- TARGET is the target variable.
- you can run ydata-profiling to get initial overview of data
Create a script to fit the data
- use test/train split (e.g., sklearn.model_selection.train_test_split)
- use any model you want (e.g., catboost)
- start small, use just a few columns as features
- collect metrics (e.g., sklearn.metrics.roc_auc_score)
MLflow server
- install mlflow pakcage
- run mlflow server
- do not turn it off. Use other bash window / use nohup to run the server
- check it out in a browser (url depends on what you have set up, defaults to http://127.0.0.1:5000)
MLflow experiment
- setup an experiment (mlflow.create_experiment or using UI)
- you can add tags etc.
MLflow log model run
- set tracking uri (mlflow.set_tracking_uri("file:./mlruns"))
- set experiment (mlflow.set_experiment)
- add mlflow.run context manager (with mlflow.start_run() as run)
- log fitted model (e.g., mlflow.sklearn.log_model)
- log metrics (e.g., mlflow.log_metrics)
- log parameters (e.g., mlflow.log_params)
- run the script & check out MLflow UI
Tracking your experiments
- Run the script multiple times with different parameters
  - you can use argparse to pass the params
- (Optional): experiment with mlflow.autolog
- Check out the experiments in MLflow UI

Exercise - Metaflow models

Register a model
- Pick one of your recent runs (e.g., the most performant one)
- Register a model in MLflow UI
  - Artifacts -> Model detail -> Register model
  - You need to specify new model name
- Check out the model in the MLflow UI (top level “Models”)
Use of a model
- Pick any of your models (either a run artifact or a registered model)
- Check out model entry in artifacts
- Follow the instructions in the MLflow UI to load the model into a jupyter/script
- Use the model for predictions

Exercise - Metaflow pipeline (optional)

Gentle reminder: you need to use either Linux or WSL
Build empty pipeline
- Create a class inheriting from FlowSpec
- Add steps (method with @step) to the class
  - start
  - modelling
  - end
- Connect steps together using self.next(self.<step_name>)
- Check you’re able tu run your pipeline (python flow.py run)
Add some content
- start
  - handle data loading
    - It need to be attribute to be accessible in other steps!
- modelling
  - use model fitting code you already have!
- run the flow & check it out in the MLflow UI
Add a feature engineering step
- add a new step
- add some feature engineering code (even a simple one)
- insert the step in the pipeline (start -> <your> -> modelling)
- note: you can try branching as well. (just add more than 1 step between start and modelling)
- run the flow & check it out in the MLflow UI

Competing in Kaggle

Outline

Motivation

Motivation

Example: Kaggle / Hackathon

Example: Leaderboard

Example - 2nd place solution overview

First steps in a project

Identify project scope

Create modelling population

Reference model

Use version control

Setup project structure

MLflow

Motivation

How to track your progress?

MLflow GUI

MLflow tracking server

MLflow API

Example – logging run manually

Example - mlflow.autolog

Retrieving information

Models Management

Example - register model manually

Use model for interventions

Model Tags

Pipelines

Pipelines

Example

Why to use pipelines

Types of DataFlows

Piping via Metaflow

MetaFlow alternatives

Metaflow example

Example: Result

Parametrize MLflow flow

Proposed project structure

Metaflow features

Simple branching

Decorators to adjust step behavior

Example: Decorators

Scaling up

Exercises

Exercise - MLflow setup

Exercise - Metaflow models

Exercise - Metaflow pipeline (optional)

Example - `mlflow.autolog`