Dimension Reduction & Clustering

Advanced Data Preparation / Unsupervised Modelling

Outline

  • Intro
  • Dimension Reduction
  • Clustering
  • Challenges

…and examples along the road

Intro

Dimension Reduction and Clustering can be considered as part of…

  • Data Preparation stage
    • “Advanced transformations”
    • Used as features in models
  • Modelling stage
    • Unsupervised modelling
      • Focused on discovering structures and patterns in the data
      • No “unknown quantity” to predict, no target to fit against

Supervised Learning

  • Model Relationship between features and a label
  • Examples
    • propensity
      • who is likely to buy a certain product?
    • churn
      • who is likely to leave?
    • regression
      • e.g. what is person’s predicted consumption for the next year?
  • Challenges
    • Fit on train set - generalize to test set
    • Sensitive to temporal issues, leakage, feature drift
    • Modelling stage often standardized - e.g., gradient boosting, etc.

Unsupervised Learning

  • No labels - learn structure and patterns from similarities in the data
  • Examples
    • empirical modelling
      • find relationships and flags without a target
    • anomaly detection
      • identify unusual or suspicious behavior
    • clustering
      • discover natural groups of points
  • Challenges
    • Define similarity measure between diverse features
    • Modelling is hard to automatize
      • no ground truth to compare feature informativeness and boundary inference

Metrics

  • Metric = how we measure similarity (or dissimilarity).
  • Main building block for most unsupervised problems.
  • Euclidean:
    • d(p, q) = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2}
    • simple geometrical interpretation, differentiable, but highly sensitive to scale
  • Cosine:
    • d(p, q) = 1 - \frac{p \cdot q}{\|p\| \|q\|}
    • semantic distance (often in NLP), scale invariant, misleading geometrical interpretation


Metrics: Toy Example – The Ferret Problem

Standard Setup: OHE → Standardization → Euclidean

  • Height: X \sim N(\mu,\sigma^2)
  • Pet: dog / ferret \sim \text{Be}(0.5)

After one-hot + standardization

  • Height: X \sim N(0,1)
  • Pet: Z_D \in \{-1,1\} \sim \text{Be}(0.5)
  • Categorical contribution: (1 - (-1))^2 = 4

Compare with height

  • P\left[(X_1 - X_2)^2 > 4\right] \approx 0.16
  • Pets dominate: height beats “pet” only ~16%
  • Audience Question: What if pet \sim \text{Be}(0.02)?

  • Graph shows pet
    • 200 points with dog / ferret \sim Be(0.5), Height: X \sim N(0,1)
    • median reference point for height (assigned ferret).
    • the color - how far is the point from the ferret referrence point

  • Solution? Reweighting of variables (or weighted metrics like Gower’s distance), empirical evaluation, etc.
  • Downweight the categorical feature by a downweighing constant c : \text{ferret} = \text{ferret} * c

Dimension Reduction (DR)

Motivation

  • Goals of transforming data by DR methods
    • Reduce dataset complexity, avoid “curse of dimension”
    • Preserve as much information as possible
  • Use cases
    • Models with less features
      • Not so complex structure
      • Better performance (overfitting, multicollinearity…)
    • Finding patterns
      • Visualization of similar entities is easier in fewer dims
  • Commonly used workflow: DR → clustering

Note: not part of this lecture

  • Reducing dimension by feature selection (various methods)
  • DL Embedding methods
    • E.g. word embeddings
    • Also produce lower-dim. representations of data
    • Main goal is to learn task-relevant representation to serve as input to downstream models
    • Not designed to preserve local and/or global structure → not good for visualization

Types of dimred methods

  • Global
    • Preserve relative distances or rank information between points, relative positions of neighborhoods from the high-dimensional space
    • Stronger focus on the preservation of distances between points that are farther away from each other
    • Cannot capture complex nonlinear structures
  • Local
    • Preserve the set of high-dimensional neighbors for each point , each observation’s neighborhood from the high-dimensional space
    • Preserve the relative distances or rank information among neighbors

Timeline of DR methods







PCA

  • Main idea:
    • Find linear transformations (linear combinations) of original observations \mathbf{x}_i
    • Directions (=PCs) defining the transformations are orthonormal
    • First transformed variable has the biggest variance (explains the most variance), further have the biggest variance with extra condition of direction being OG to the previous directions (once the effect of the previous components is removed)
  • Basically changing coordinate system so that if we ignore dimensions j=k+1,\dots p we lose as little information as possible

  • Formally j-th principal component is

    \mathbf{w}_j = \mathrm{arg max}_{\mathbf{w} \in \mathbb{R}^p, ||\mathbf{w}||=1} \mathrm{var}(\mathbf{w}^T\mathbf{x}_i)

  • Solution (just linear algebra)
    • Principal components are eigenvectors of \mathbb{V}, the covariance matrix of \mathbf{x}_i
    • The maximum variance achieved is the corresponding eigenvalue
  • Terminology
    • \mathbf{w}_j, j=1, \dots, p are principal components
    • t_{ij} = \mathbf{w}_j^T\mathbf{x}_i are principal components scores
  • Matrix form: \mathbb{T} = \mathbb{X}\mathbb{W}, where \mathbb{W} = (\mathbf{w}_1 | \dots | \mathbf{w}_p)

Iris
  • Computation (assuming centered data):
    • Compute spectral decomposition of sample covariance matrix (since true is unknown):

      \widehat{\mathbb{V}} \propto \mathbb{X}^T\mathbb{X} = \mathbb{W}\mathbb{\Lambda}\mathbb{W}^T → eigenvectors in \mathbb{W} are PCs

    • Compute SVD of \mathbb{X} = \mathbb{U}\mathbb{\sqrt{\Lambda}} \mathbb{W}^T → right singular vectors in \mathbb{W} are PCs (used in practice)

  • Choice of no. of PCs used criteria
    • % of total variance explained
    • Scree plot (eigenvalue index vs. % of explained variance → elbow method)
    • 2-3 for plotting
  • Limitations
    • Sensitivity to scale → scale to unit variance (and centrer) beforehand
    • Can capture only linear relationships between features

Example: projection of odd values

What was it about

  • Vendor Selection Hackaton - company was looking for strategic partner to do DS/ML/AI
  • Task: build POC of solution for credit scoring within 168 hrs
  • Challenge:
    • All data (except PK, target and few other variables) are masked (names var1-var5xx)
    • How to tackle if our main strength lies in crafting powerful features based on deep BU and DU and being able to interpret and find stories in the model?
  • Observation: variables often had odd values which seemed out of distribution (different types of missings, default values…?)

Approach (one of many)

  • Guess what are “odd values” → indicator matrix → PCA
  • Revealed groups representing clients coming from different sources with various amount of information available
  • Segments work as naïve risk categories
    • cluster B: “well-known and safe”
    • cluster D: „risky newbies in credit“

Lesson learned: PCA still may be useful for specific data patterns

Timeline of DR methods







t-SNE

  • = t-Distributed Stochastic Neighbor Embedding
  • Based on conditional probability distribution POV on data:
    • Distribution represents process of sampling neighbours:

      p_{j∣i} :​= \mathbb{P}(\text{"choose } \mathbf{x}_j​ \text{ as neighbor"} ∣ \text{centered at }\mathbf{x}_i​)

    • Similar (close) points have high probability to be picked as neighbours (and vice versa)

  • Algorithm creates aligned high-dim. and low-dim. ditributions

MNIST
High dim space
  • Assume generating neighbourhood of \mathbf{x}_i is proportional to Gaussian distribution \mathcal{N}_p(\mathbf{x}_i, \sigma_i^2 \mathbb{I}_p):

    p_{j∣i} = \frac{\exp(-\frac{||\mathbf{x}_j - \mathbf{x}_i||_2^2}{2\sigma_i^2})}{\sum_{k \neq i} \exp(-\frac{||\mathbf{x}_k - \mathbf{x}_i||_2^2}{2\sigma_i^2})}

  • Pick \sigma_i-s based on chosen Perplexity: \mathrm{Perp} =: 2^{-\sum_j p_{j|i} \log_{2}p_{j|i}}
    • Uniform probs for K neighbours p_{j∣i} = 1/K give \mathrm{Perp} = 2^{\log_{2}K} = K, therefore we can understand perplexity as “effective number of neighbours”.
    • Sigmas are chosen so that effective no. of neighbours is the same for all obs.
    • Notice: Perplexity is just transformation of Shannon entropy, therefore it’s measure of uncertainty of the distribution, here specifically measure of uncertainty about \mathbf{x}_i’s neighbours.
  • Define the symmetric probability p_{ij} = \frac{p_{i|j} + p_{j|i}}{2n}
Low dim space
  • Assume generating neighbourhood of \mathbf{y}_i is proportional to mutivariate t-distribution t_p(\mathbf{y}_i, \mathbb{I}_p, 1):

    q_{ij} = \frac{(1+||\mathbf{y}_j - \mathbf{y}_i||_2^2)^{-1}}{\sum_{k \neq l} (1+||\mathbf{y}_k - \mathbf{y}_l||_2^2)^{-1}}

    • Note: already symmetric

MNIST - SNE vs t-SNE
Algorithm of probs alignment
  • Loss function: Kuhlback-Leibler divergence KL(P, Q) = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}}
    • KL divergence is natural measure of “distances” between prob. distributions (not proper distance - is not symmetric)
    • Relation to cross-entropy: KL(P, Q) = H(P, Q) - H(P)
  • Init: \mathbf{y}_i \sim \mathcal{N}_q(\mathbf{o}, 10^{-4}\mathbb{I}_q)
  • Optim: GD with momentum, adaptive learning rate

UMAP

  • Based on graph representation of data
    • Weighted graph G
    • Vertices ~ observations \mathbf{x}_i, i=1, \dots, n
    • Edges ~ being among kNN (from either side)
    • Weights ~ proportional to similarities, can be understood as probabilities (sampling data points around \mathbf{x}_i)
  • Algorithm creates aligned high-dim. and low-dim. graph representations
  • Variations and improvements still appear: parametric UMAP, UMATO…

Algorithm phases
  1. Graph construction for the high-dimensional space
    • Find kNN (in given metric), compute min. pos. distance to neighbour \rho_i

    • Find scaling parameter \sigma_i (normalize distances to neighbours keeping proximities in orig. dim.)

    • Define weight function w(\mathbf{x}_i, \mathbf{x}_j) (closest point has 1, then decreases exponentially)

    • Define weighted graph G with weights based on symmetrized versions of w(\mathbf{x}_i, \mathbf{x}_j):

      \bar{w}_{ij} := w(\mathbf{x}_i, \mathbf{x}_j) + w(\mathbf{x}_j, \mathbf{x}_i) - w(\mathbf{x}_i, \mathbf{x}_j)w(\mathbf{x}_j, \mathbf{x}_i)

  1. Optimization of the low-dimensional graph layout
    • Init: spectral embedding (Laplacian Eigenmaps)
    • (Stochastic) Gradient Descent applied on observations (iterate over edges):
      • Attractive force: move \mathbf{y}_i towards \mathbf{y}_j

        \mathbf{y}_i := \mathbf{y}_i + \alpha \cdot f(||\mathbf{y}_i - \mathbf{y}_j||_2^2) \bar{w}_{ij} (\mathbf{y}_i - \mathbf{y}_j)

        (a la GD with learning rate \alpha, f contains hyperparams)

      • Repultive force: push \mathbf{y}_i from \mathbf{y}_k (random non-kNN)

        \mathbf{y}_i := \mathbf{y}_i + \alpha \cdot g(||\mathbf{y}_i - \mathbf{y}_k||_2^2) (1-\bar{w}_{ik}) (\mathbf{y}_i - \mathbf{y}_k)

Example: time patterns of ATM withdrawals

  • Special type of transactions: ATM withdrawals
  • Focus on avg no. of withdrawals in every hour of the week (each ATM is represented by 168 features)
  • UMAP with cosine metric (= it’s about weekly pattern, not how many withdrawals absolutely)
  • ATMs with “shared quality” are projected close to each other

Timeline of DR methods







PaCMAP

  • Idea: use three types of point-pairs to first capture more global structure, then refine local structure →
  • Better trade‐off between preserving both local and global structure

MNIST
Types of pairs
  • Neighbour pairs
    • Observation \mathbf{x}_i + each of n_{NB} nearest neighbours

    • Distance used: scaled Euclidean ||\mathbf{x}_i - \mathbf{x}_j||_2^2 / (\sigma_i \sigma_j)

      (scale to account for different magnitudes of neighbourhoods: larger scale if neighbours are far)

  • Mid-near pairs
    • Observation \mathbf{x}_i + 2nd nearest obs. out of 6 randomly selected points
    • Repeat to obtain n_{MN} such pairs for each \mathbf{x}_i
    • n_{MN} = \lfloor n_{NB} \cdot MN_\mathrm{ratio} \rfloor
  • Further pairs
    • Observation \mathbf{x}_i + each of n_{FP} ramdomly sampled non-neighbour points
    • n_{FP} = \lfloor n_{NB} \cdot FP_\mathrm{ratio} \rfloor

Algorithm
  • Hyperparams: n_{NB} = 10, MN_\mathrm{ratio} = 0.5, FP_\mathrm{ratio} = 2

  • Loss function:

    • \tilde{d}_{ij} = 1 + ||\mathbf{y}_i - \mathbf{y}_j||_2^2
    • Notice: low-dim distances similar as in t-SNE - motivated by t-distr.
    • Constructed to adhere principles of “good loss function” (introduced in the paper)
  • Init: PCA or random \mathcal{N}_q(\mathbf{o}, 10^{-4}\mathbb{I}_q)

    • Notice: random init as in t-SNE
  • Opti: Adam optimizer, weights of loss change with iterations
    • 1st phase: big weight on MN pairs loss → embed so that mainly global structure is preserved (weights decrease with iterations → focus moved to local structure → preserved at least to certain degree)
    • 2nd phase: small (but not zero) weight for mid-near pairs → improve local structure while maintaining the global structure captured before
    • 3rd phase: reduce MN (and NN) pair weights → separate possible clusters and make their boundary clearer

Clustering

Motivation

  • Unsupervised technique that links similar items together
  • Use cases
    • Understand the world - get new insights
      • Pattern discovery
      • Customer segmentation
      • Anomaly detection…
    • Acting on clusters
      • Features for prediction or explanatory models
      • Different marketing campaign strategies
      • Segment-specific eligibility rules and scoring…

K-Means Clustering

History & Motivation

  • Origin: Stuart Lloyd, Bell Labs, 1957
    • Initially for analog - digital signal quantization
  • Extension to data clustering: James MacQueen, 1967
    • Apply same principle to multi-dimensional data
  • Core idea: Group n points into k clusters by minimizing distance to centroids

Algorithm (Iterative Steps)

  1. Initialize k centroids (random, K-Means++, or domain knowledge)
  2. Assign each point to the nearest centroid (Assignment step)
  3. Recompute each centroid as the mean of assigned points (Update step)
  4. Repeat steps 2–3 until convergence (minimal change in centroids or WCSS)

WCSS = \sum_{i=1}^{k} \sum_{x \in C_i} \|x - \mu_i\|^2

Convergence Property

Theorem: WCSS decreases monotonically with each iteration:

WCSS_1 \ge WCSS_2 \ge WCSS_3 \ge ... \ge WCSS_k

  • Assignment and update steps always reduce intra-cluster variance
  • May converge to local minimum, not guaranteed global minimum

Centroid Choices

  • Mean (standard K-Means): minimizes squared distance for each cluster

\mu = \arg\min_{t} \frac{1}{n} \sum_{x \in \text{Cluster}} (x - t)^2

  • Median (K-Medoids): robust to outliers

Advantages

  • Simple, fast, scalable O(n)
  • Works well for large datasets
  • Intuitive and easy to implement (e.g. spark)
  • Centroids can be the point of the analysis
    • Example: Where to distribute COVID vaccination centers?

Disadvantages

  • K-Means assumes each cluster has a natural center
  • Fails for:
    • elongated, curved, or irregular clusters
    • nested structures or clusters without a clear center

Hierarchical Clustering

History & Motivation

The technique gained traction in the 1960s and 1970s.

  • Complex dataset explosion in biological sciences and psychometrics.
  • Need for taxonomy and interpretable sub-segmentation.
  • Data often heterogeneous or categorical, clusters non-convex.

Result: The Dendrogram

Example Dendrogram

Linkage types

Single Linkage
  • Cluster distance defined as the minimum pairwise distance
    D(A,B) = \min_{i \in A, j \in B} d(i,j)

  • Builds clusters by connecting nearest neighbors

  • Sensitive to local minima — may form long chains of points

  • Characteristics

    • Captures elongated or non-convex shapes
    • Handles arbitrary geometries
    • Very sensitive to noise and outliers
    • Tends to produce “chaining effect” – merging distant points through narrow connections

Complete Linkage
  • Cluster distance defined as the maximum pairwise distance
    D(A,B) = \max_{i \in A, j \in B} d(i,j)

  • Merges clusters only when all points in both clusters are relatively close

  • Strong preference for small-diameter, tightly packed clusters

  • Characteristics

    • Produces compact, spherical clusters
    • Much less prone to chaining than single linkage
    • Very sensitive to outliers—a single far point prevents merging
    • Used rarely - if ne aims for compact clusters - Ward is a better choice

Average Linkage
  • Cluster distance = average pairwise distance
    D(A,B) = \frac{1}{|A||B|} \sum_{i \in A, j \in B} d(i,j)

  • Balances between single (min) and complete (max) linkage

  • Characteristics

    • Produces moderately compact, balanced clusters
    • Less sensitive to noise than single linkage
    • Works well when clusters are roughly spherical
    • May fail for elongated or uneven-density clusters

Ward’s Method
  • Merges clusters that cause the smallest increase in within-cluster variance
    D(A,B) = \Delta WCSS = WCSS_{A \cup B} - (WCSS_A + WCSS_B)

  • Equivalent to hierarchical variance minimization (similar to k-means)

  • Characteristics
    • Produces compact, spherical clusters
    • Favors clusters of similar size and variance
    • Fails for elongated or irregularly shaped clusters
    • Not suitable for non-Euclidean distances

Summary: Linkage Comparison

Method Definition Produces Fails for
Single \min d(i,j)   Elongated, chained clusters Noisy data, chaining effect
Average \text{mean } d(i,j)   Balanced, compact clusters Elongated or uneven-density clusters
Ward \Delta\text{SSE} Compact, spherical clusters Elongated or irregular clusters

DBSCAN Clustering

Motivation

  • Traditional clustering algorithms struggle with non-convex shapes

  • Many datasets contain noise points (measurement errors, anomalies)

    • Forcing them into clusters can distort results
  • Cannot learn the number of clusters directly from data

  • Solution: DBSCAN

    • Density-Based Spatial Clustering of Applications with Noise
    • Defines clusters as dense regions of points

Dense Region Definition

Two key parameters:

  • ε (epsilon): radius for neighborhood
  • MinPts: minimum number of points to form a dense region

Intuition:

A point belongs to a cluster if it is in a dense region (≥ MinPts in ε-neighborhood)

Algorithm Steps

  1. Pick an unvisited point
  2. Count points within ε distance
    • If ≥ MinPts → core point, start new cluster
    • If < MinPts → temporarily labeled as noise
  3. Expand cluster by adding all points density-reachable from the core point
  4. Repeat until all points are visited

Advantages

  • Can detect clusters of arbitrary shape
  • Less resource-intensive than hierarchical clustering
  • Automatically determines number of clusters
  • Handles noise points automatically

Limitations

  • Sensitive to parameter choice:
    • ε too small: natural clusters split
    • ε too large: separate clusters merge
  • Struggles when clusters have different densities
  • Parameter tuning may require domain knowledge

Example: Clustering of Banking Dataset

  • We clustered banking clients based on
    MCC spending categories, temporal expenditure patterns,
    and related behavioural features.

  • Goal: derive customer prototypes for high-quality
    synthetic data generation.

    1. Standardization was sufficient since all features were
      expressed as percentages or normalized proportions.
    2. Applied UMAP with a cosine metric to capture behavioural similarity.
    3. Used DBSCAN to identify dense groups in the embedding space.
  • Our priority was obtaining clean, interpretable prototypes.

  • Edge points (border/noise points) would distort prototype profiles.

Challenges

Mixed data types

Problem
  • Methods often implicitely assume “Gaussian-like” distributions
    • Continuous
    • Sample space: \mathbb{R}
    • Not too skewed
  • Therefore distances based on Euclidean metric are perfectly natural
  • But real life data…
    • Both numeric and categorical
    • Often with limited support ([0, \infty), [0,1]), skewed, mixtures…
    • Have missings

Solution
  • Consider if using all different kinds of data makes sense
    • Working with “data of similar kind” usually bring better results (justification, interpretability, stability)

  • Consider different metrics
    • Std. distance or custom (domain specific)
    • Think of how metric behaves and if that agrees with what you want to put most significance on
    • Warning: Custom metrics may introduce computational difficulties
  • Preprocess the data beforehand
    • Make them more “Gaussian-like” to be easily consumed by algos with Euclidean distances
    • E.g. log and other functional transforms, dummy variables, imputation, re-weighting…
    • Warning: Probematic justification, weights sensitivity, metric tends to be dominated by indicators

Interpretation

Descriptive approach using stats and visuals:

  • Colour projection by variables
    • Find out relation between high-dim and low-dim variables
    • Locate hundles of specific variable values and how they correspond to clusters
  • Compare distributions in cluster and overall (or in complement)
    • Proper stats and plot types for different variable types

Example: regarding feature weight look at

\begin{equation*} \mathbb{E}[\mathtt{weight} \mid \text{in cluster}] > \mathbb{E}[\mathtt{weight} \mid \text{out cluster}] \end{equation*}

Explanatory approach: interpret helper model

  • Fit (multiclass) classification: target = cluster, features = (not only) input variables
  • Detect main drivers of cluster membership and how the variables associate with clusters
  • Difference (from descriptive approach in prev. slide) is in controlling for the variable
  • Example: Considering features weight and height, we fit a cluster membership model

    \frac{p}{1-p}= \exp(\beta_0 + \beta_{wt} \mathtt{weight} + \beta_{ht} \mathtt{height})

    • If \beta_{wt} > 0, then people with higher weight are more likely to belong in the cluster (while holding the rest constant).
    • If |\beta_{wt}| > |\beta_{ht}|, weight is a higher contributer than height.
    • More advanced “black-box” models (GB, NN) combined with SHAP values can be used.

Confounding matters

  • Audience Question: Can descriptive and explanatory lense disagree?
  • Let us imagine we have a dataset with two features, weight and height.

  • We fit a logistic regression (or use other explanatory technique of our choice)

    \frac{p}{1-p} = \exp(\beta_0 + \beta_{wt} \mathtt{weight} + \beta_{ht} \mathtt{height})

Can it happen that \beta_{wt} < 0 but \mathbb{E}[\mathtt{weight} | \text{in cluster}] > \mathbb{E}[\mathtt{weight} | \text{out cluster}]?

Problem: Tall and Thin

  • Cluster: “Tall and Thin” people
    • Height and weight are correlated.
    • Descriptive stats may show:
      • Tall and heavier than average.
    • But relative to their height, they are actually lighter than expected.
  • Where correlation structure breaks interesting things happen
    • People under insolvency = People who are technologically conservative for their age.

Labelling consistency

Problem
  • Nature of methods is often stochastic
  • Even with fixed random state and completely same data set, DR may produce e.g. rotated or flipped projection - permuted clusters
  • Once we want to refit on different sample from the same distribution, results might seem completely different at the first sight
  • We know cluster we interpreted as xxx is still there but has different number and coordinates in low-dim space so it’s hard to find

Solution
  • Formulate as
    • Graph problem: minimum weight perfect matching in bipartite graph
    • Integer programming problem: optimal assignment
  • Such problem can be solved by “Hungarian algorithm” (impl. e.g. in scipy)

Validation

  • Two ways to validate clustering
    • Technical
      • Size balance
      • Correct shape
      • Stability across retraining
    • Semantic
      • Clusters make sense
      • Target informativeness
      • Separation on unseen features

Technical Validation: Shape

  • Definition: Clustering is grouping points that are close to each other and far from others
  • Three key concepts
    • Compactness
      • Silhouette, WCSS, Davies–Bouldin
    • Separation
      • Dunn Index, Inter-cluster distance
    • Connectedness
      • CHAS, Graph-based measures
  • Algorithms
    • K-Means, Ward → Compactness
    • DBSCAN, Single Linkage → Connectedness
  • Note: Useful for automation, but don’t over-rely

Silhouette Score (for each point i):

  • s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}\in [-1, 1]

    • a(i) = avg. dist. from i to points in the same cluster
    • b(i) = avg. dist. from i to point in the nearest other cluster

Technical Validation: Stability

  • Clusters are stable if they stay similar under:
    • resampling (removing a few points),
    • reinitializations (different random seeds),
    • small changes or noise in the data (perturbations).

  • Rand Index (RI)
    • Measures agreement between two clusterings based on pairs of points: RI = \frac{\text{number of agreeing pairs}}{\text{total number of pairs}}
    • Some of the points will agree on accident so we adjust by the baseline, \text{ARI}(C_1, C_2) = \frac{\text{RI} - \text{Expected RI}}{\text{Max RI} - \text{Expected RI}}

Semantic Validation: Target Informativeness

  • Goal: Measure how useful the clustering is for an external business-relevant target (e.g., loan uptake, churn, campaign response)

  • Reformulate to supervised learning:
    Does cluster membership carry information about the target?

Example: Car Loan Uptake

  1. Fit clustering at reference date (e.g. 1.1.2025)
  2. Observe who takes a car loan within 6 months
  3. Compute cluster-level informativeness
    I(\text{cluster}) = \log\!\left(\frac{P(\text{loan} \mid \text{cluster})}{P(\text{loan})}\right)
  4. Overall informativeness of clustering
    I = \sum_{\text{cluster}} P(\text{cluster}) \cdot \,\lvert I(\text{cluster})\rvert

Semantic Validation: Omitted Variables

  • Idea: Validate clusters using variables that were not used during clustering.

  • Why it matters: If clusters show clear structure in unseen variables, it means the model captured
    real distinct population patterns, not spurious relationships in the fitted features.

Example: ATM Withdrawals (Unseen Variables)

  • Clustering is trained only on transaction patterns.
  • Big-city ATMs form distinct temporal patterns
    • life in the cities areas starts at later hours

Prod usage

  • Bringing DR and clustering solutions to prod is… difficult
    • Some techniques are primarily “exploratory” and don’t have transform method
    • Many decisions during implementation are crafted manually on dev sample → problem to automate refits
    • Interpretation is often mostly manual → costly re-interpretations

Two goals which go against each other:

Interpretablity & stability

  • Fewer, not too small, quite well-separated clusters
  • Nice stories for most of them
  • Robust againts small distribution shifts, hyperparams changes
  • Clusters are still quite variable, information loss is too high to produce accurate predictions

Usability in models (predictive power)

  • Microsegments, complex projection structures
  • Some segments might have very interesting backgrounds, others more blended, impossible to interpret everything
  • Can bring high added value on selected microsegments
  • Sensitive to data changes, hyperparams

Extras

Meta-Clustering

  • Clustering is often underspecified, unstable, and difficult to use directly in practice.
  • Rich Caruana’s famous lecture: “Clustering is approximately useless.”
  • Result: growing literature on embedding clustering into more complex, stable, and useful frameworks.

Prototype Clustering (Caruana et al.)

A single clustering is often unstable and heavily dependent on feature weights, distance metrics, preprocessing choices, the clustering algorithm

Idea

Search for stable structure by generating a large ensemble of clusterings
(e.g., 100,000 runs with random weights or metrics).

Steps
  1. Run many clusterings under diverse settings.
  2. Measure similarity between cluster assignments
    (e.g., Rand Index, Variation of Information).
  3. Cluster the clusterings themselves → obtain prototype clusterings.
  4. Present prototype clusterings to users and select the most
    useful, interpretable, or actionable one.
Outcome

A small set of representative, stable clusterings that summarize the
landscape of possible solutions and allows the end user to pick.

Feedback-Driven Clustering

  • Real-world feedback often expresses constraints such as:

    • Must-Link (ML): “These points belong together.”
    • Cannot-Link (CL): “These points must not be grouped together.”
  • Many methods incorporate such constraints:

    • Constrained K-Means
    • Finding the clustering closest to a baseline while satisfying ML/CL
    • Learning a distance metric consistent with the constraints
  • This enables iterative refinement with domain experts.

  • As business is a part of the solution, they are more likely to be onboard.

Voting / Consensus Clustering

Generate multiple clusterings on the same dataset:

  • Different distance metrics
  • Different subsets of variables
  • Different data types (categorical, count, skewed, etc.)

Define a co-association matrix:

  • Distance between two points = proportion of clusterings where they do not appear together.
  • Cluster this matrix to obtain a consensus clustering.

Advantages:

  • Produces a solution that is more stable than any individual clustering.
  • Naturally handles heterogeneous data by mixing clusterings tailored to different variable types.