Dimension Reduction & Clustering

Advanced Data Preparation / Unsupervised Modelling

Anna Michálková, Michal Sekerka

Outline

Intro
Dimension Reduction
Clustering
Challenges

…and examples along the road

Intro

Dimension Reduction and Clustering can be considered as part of…

Data Preparation stage
- “Advanced transformations”
- Used as features in models
Modelling stage
- Unsupervised modelling
  - Focused on discovering structures and patterns in the data
  - No “unknown quantity” to predict, no target to fit against

Supervised Learning

Model Relationship between features and a label
Examples
- propensity
  - who is likely to buy a certain product?
- churn
  - who is likely to leave?
- regression
  - e.g. what is person’s predicted consumption for the next year?
Challenges
- Fit on train set - generalize to test set
- Sensitive to temporal issues, leakage, feature drift
- Modelling stage often standardized - e.g., gradient boosting, etc.

Unsupervised Learning

No labels - learn structure and patterns from similarities in the data
Examples
- empirical modelling
  - find relationships and flags without a target
- anomaly detection
  - identify unusual or suspicious behavior
- clustering
  - discover natural groups of points
Challenges
- Define similarity measure between diverse features
- Modelling is hard to automatize
  - no ground truth to compare feature informativeness and boundary inference

Metrics

Metric = how we measure similarity (or dissimilarity).
Main building block for most unsupervised problems.
Euclidean:
- d(p, q) = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2}
- simple geometrical interpretation, differentiable, but highly sensitive to scale
Cosine:
- d(p, q) = 1 - \frac{p \cdot q}{\|p\| \|q\|}
- semantic distance (often in NLP), scale invariant, misleading geometrical interpretation

Metrics: Toy Example – The Ferret Problem

Standard Setup: OHE → Standardization → Euclidean

Height: X \sim N(\mu,\sigma^2)

Pet: dog / ferret \sim \text{Be}(0.5)

After one-hot + standardization

Height: X \sim N(0,1)

Pet: Z_D \in \{-1,1\} \sim \text{Be}(0.5)

Categorical contribution: (1 - (-1))^2 = 4

Compare with height

P\left[(X_1 - X_2)^2 > 4\right] \approx 0.16
Pets dominate: height beats “pet” only ~16%
Audience Question: What if pet \sim \text{Be}(0.02)?

Graph shows pet
- 200 points with dog / ferret \sim Be(0.5), Height: X \sim N(0,1)
- median reference point for height (assigned ferret).
- the color - how far is the point from the ferret referrence point

Solution? Reweighting of variables (or weighted metrics like Gower’s distance), empirical evaluation, etc.
Downweight the categorical feature by a downweighing constant c : \text{ferret} = \text{ferret} * c

Dimension Reduction (DR)

Motivation

Goals of transforming data by DR methods
- Reduce dataset complexity, avoid “curse of dimension”
- Preserve as much information as possible
Use cases
- Models with less features
  - Not so complex structure
  - Better performance (overfitting, multicollinearity…)
- Finding patterns
  - Visualization of similar entities is easier in fewer dims
Commonly used workflow: DR → clustering

Curse of dimension

As the number of features grows, the volume of the feature space increases exponentially (curse of dim. primarily stems from how dimensionality impacts volume, which has profound implications for data distribution). That means your data points become sparser—they occupy a vast high-dimensional space where each point is far away from the others.
Consequences:
- Distance metrics lose meaning: In high dimensions, all points tend to appear almost equally distant from each other, breaking algorithms that rely on similarity (like k-NN, clustering, or kernel methods).
- Data sparsity: The more dimensions you have, the more data you need to populate the space adequately. With limited samples, your model can’t generalize well.
- Overfitting risk increases: Models can fit noise instead of signal, because in high dimensions it’s easy to find a complex boundary that perfectly separates training data—but fails on unseen data.
- Computation becomes expensive: Time and memory requirements grow quickly with each additional feature.
Intuitive example regarding sparsity: In 2D, a grid of 10×10 points gives you 100 samples. In 10D, a similar resolution would need 10^10 samples — impossible in practice.

Note: not part of this lecture

Reducing dimension by feature selection (various methods)
DL Embedding methods
- E.g. word embeddings
- Also produce lower-dim. representations of data
- Main goal is to learn task-relevant representation to serve as input to downstream models
- Not designed to preserve local and/or global structure → not good for visualization

Types of dimred methods

Global
- Preserve relative distances or rank information between points, relative positions of neighborhoods from the high-dimensional space
- Stronger focus on the preservation of distances between points that are farther away from each other
- Cannot capture complex nonlinear structures

Local
- Preserve the set of high-dimensional neighbors for each point , each observation’s neighborhood from the high-dimensional space
- Preserve the relative distances or rank information among neighbors

Timeline of DR methods

PCA

Main idea:
- Find linear transformations (linear combinations) of original observations \mathbf{x}_i
- Directions (=PCs) defining the transformations are orthonormal
- First transformed variable has the biggest variance (explains the most variance), further have the biggest variance with extra condition of direction being OG to the previous directions (once the effect of the previous components is removed)

Basically changing coordinate system so that if we ignore dimensions j=k+1,\dots p we lose as little information as possible

Formally j-th principal component is

\mathbf{w}_j = \mathrm{arg max}_{\mathbf{w} \in \mathbb{R}^p, ||\mathbf{w}||=1} \mathrm{var}(\mathbf{w}^T\mathbf{x}_i)

Solution (just linear algebra)
- Principal components are eigenvectors of \mathbb{V}, the covariance matrix of \mathbf{x}_i
- The maximum variance achieved is the corresponding eigenvalue

Terminology
- \mathbf{w}_j, j=1, \dots, p are principal components
- t_{ij} = \mathbf{w}_j^T\mathbf{x}_i are principal components scores
Matrix form: \mathbb{T} = \mathbb{X}\mathbb{W}, where \mathbb{W} = (\mathbf{w}_1 | \dots | \mathbf{w}_p)

Computation (assuming centered data):
- Compute spectral decomposition of sample covariance matrix (since true is unknown):
  
  \widehat{\mathbb{V}} \propto \mathbb{X}^T\mathbb{X} = \mathbb{W}\mathbb{\Lambda}\mathbb{W}^T → eigenvectors in \mathbb{W} are PCs
- Compute SVD of \mathbb{X} = \mathbb{U}\mathbb{\sqrt{\Lambda}} \mathbb{W}^T → right singular vectors in \mathbb{W} are PCs (used in practice)

Choice of no. of PCs used criteria
- % of total variance explained
- Scree plot (eigenvalue index vs. % of explained variance → elbow method)
- 2-3 for plotting
- …
Limitations
- Sensitivity to scale → scale to unit variance (and centrer) beforehand
- Can capture only linear relationships between features

Example: projection of odd values

What was it about

Vendor Selection Hackaton - company was looking for strategic partner to do DS/ML/AI
Task: build POC of solution for credit scoring within 168 hrs
Challenge:
- All data (except PK, target and few other variables) are masked (names var1-var5xx)
- How to tackle if our main strength lies in crafting powerful features based on deep BU and DU and being able to interpret and find stories in the model?
Observation: variables often had odd values which seemed out of distribution (different types of missings, default values…?)

Approach (one of many)

Guess what are “odd values” → indicator matrix → PCA
Revealed groups representing clients coming from different sources with various amount of information available
Segments work as naïve risk categories
- cluster B: “well-known and safe”
- cluster D: „risky newbies in credit“

Lesson learned: PCA still may be useful for specific data patterns

Timeline of DR methods

t-SNE

= t-Distributed Stochastic Neighbor Embedding
Based on conditional probability distribution POV on data:
- Distribution represents process of sampling neighbours:
  
  p_{j∣i} := \mathbb{P}(\text{"choose } \mathbf{x}_j \text{ as neighbor"} ∣ \text{centered at }\mathbf{x}_i)
- Similar (close) points have high probability to be picked as neighbours (and vice versa)
Algorithm creates aligned high-dim. and low-dim. ditributions

High dim space

Assume generating neighbourhood of \mathbf{x}_i is proportional to Gaussian distribution \mathcal{N}_p(\mathbf{x}_i, \sigma_i^2 \mathbb{I}_p):

p_{j∣i} = \frac{\exp(-\frac{||\mathbf{x}_j - \mathbf{x}_i||_2^2}{2\sigma_i^2})}{\sum_{k \neq i} \exp(-\frac{||\mathbf{x}_k - \mathbf{x}_i||_2^2}{2\sigma_i^2})}

Pick \sigma_i-s based on chosen Perplexity: \mathrm{Perp} =: 2^{-\sum_j p_{j|i} \log_{2}p_{j|i}}
- Uniform probs for K neighbours p_{j∣i} = 1/K give \mathrm{Perp} = 2^{\log_{2}K} = K, therefore we can understand perplexity as “effective number of neighbours”.
- Sigmas are chosen so that effective no. of neighbours is the same for all obs.
- Notice: Perplexity is just transformation of Shannon entropy, therefore it’s measure of uncertainty of the distribution, here specifically measure of uncertainty about \mathbf{x}_i’s neighbours.

Define the symmetric probability p_{ij} = \frac{p_{i|j} + p_{j|i}}{2n}

Low dim space

Assume generating neighbourhood of \mathbf{y}_i is proportional to mutivariate t-distribution t_p(\mathbf{y}_i, \mathbb{I}_p, 1):

q_{ij} = \frac{(1+||\mathbf{y}_j - \mathbf{y}_i||_2^2)^{-1}}{\sum_{k \neq l} (1+||\mathbf{y}_k - \mathbf{y}_l||_2^2)^{-1}}
- Note: already symmetric

Algorithm of probs alignment

Loss function: Kuhlback-Leibler divergence KL(P, Q) = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}}
- KL divergence is natural measure of “distances” between prob. distributions (not proper distance - is not symmetric)
- Relation to cross-entropy: KL(P, Q) = H(P, Q) - H(P)
Init: \mathbf{y}_i \sim \mathcal{N}_q(\mathbf{o}, 10^{-4}\mathbb{I}_q)
Optim: GD with momentum, adaptive learning rate

UMAP

Based on graph representation of data
- Weighted graph G
- Vertices ~ observations \mathbf{x}_i, i=1, \dots, n
- Edges ~ being among kNN (from either side)
- Weights ~ proportional to similarities, can be understood as probabilities (sampling data points around \mathbf{x}_i)
Algorithm creates aligned high-dim. and low-dim. graph representations
Variations and improvements still appear: parametric UMAP, UMATO…

Algorithm phases

Graph construction for the high-dimensional space
- Find kNN (in given metric), compute min. pos. distance to neighbour \rho_i
- Find scaling parameter \sigma_i (normalize distances to neighbours keeping proximities in orig. dim.)
- Define weight function w(\mathbf{x}_i, \mathbf{x}_j) (closest point has 1, then decreases exponentially)
- Define weighted graph G with weights based on symmetrized versions of w(\mathbf{x}_i, \mathbf{x}_j):
  
  \bar{w}_{ij} := w(\mathbf{x}_i, \mathbf{x}_j) + w(\mathbf{x}_j, \mathbf{x}_i) - w(\mathbf{x}_i, \mathbf{x}_j)w(\mathbf{x}_j, \mathbf{x}_i)

Optimization of the low-dimensional graph layout
- Init: spectral embedding (Laplacian Eigenmaps)
- (Stochastic) Gradient Descent applied on observations (iterate over edges):
  - Attractive force: move \mathbf{y}_i towards \mathbf{y}_j
    
    \mathbf{y}_i := \mathbf{y}_i + \alpha \cdot f(||\mathbf{y}_i - \mathbf{y}_j||_2^2) \bar{w}_{ij} (\mathbf{y}_i - \mathbf{y}_j)
    
    (a la GD with learning rate \alpha, f contains hyperparams)
  - Repultive force: push \mathbf{y}_i from \mathbf{y}_k (random non-kNN)
    
    \mathbf{y}_i := \mathbf{y}_i + \alpha \cdot g(||\mathbf{y}_i - \mathbf{y}_k||_2^2) (1-\bar{w}_{ik}) (\mathbf{y}_i - \mathbf{y}_k)

Example: time patterns of ATM withdrawals

Special type of transactions: ATM withdrawals
Focus on avg no. of withdrawals in every hour of the week (each ATM is represented by 168 features)
UMAP with cosine metric (= it’s about weekly pattern, not how many withdrawals absolutely)
ATMs with “shared quality” are projected close to each other

Timeline of DR methods

Triplet Contraint Methods
- Learn a low-dimensional embedding from triplets of points, with comparisons between points in the form of “a is more similar to b than to c”
- Examples: DrLIM, t-STE, SNaCK, TriMap
- Mostly didn’t work - human triplets, crowd-sourcing
- TriMap - randomly sample triplets for each point but include some of that point’s nearest neighbors in most triplets -> neighbors stay close, others farther away
- Idea: triplets would preserve global structure, not only local
Latest yrs (2020 onwards)
- Attempts to balance global and local structure preservance
- PaCMAP
- CBMAP (Clustering-based Manifold Approximation and Projection) - preserve both local & global structure, reduce hyperparameter sensitivity. Use case: If you anticipate extremely large data, or embedding new (“test”) points post‐embedding, CBMAP might be worth watching.

PaCMAP

Idea: use three types of point-pairs to first capture more global structure, then refine local structure →
Better trade‐off between preserving both local and global structure

Types of pairs

Neighbour pairs
- Observation \mathbf{x}_i + each of n_{NB} nearest neighbours
- Distance used: scaled Euclidean ||\mathbf{x}_i - \mathbf{x}_j||_2^2 / (\sigma_i \sigma_j)
  
  (scale to account for different magnitudes of neighbourhoods: larger scale if neighbours are far)
Mid-near pairs
- Observation \mathbf{x}_i + 2nd nearest obs. out of 6 randomly selected points
- Repeat to obtain n_{MN} such pairs for each \mathbf{x}_i
- n_{MN} = \lfloor n_{NB} \cdot MN_\mathrm{ratio} \rfloor

Further pairs
- Observation \mathbf{x}_i + each of n_{FP} ramdomly sampled non-neighbour points
- n_{FP} = \lfloor n_{NB} \cdot FP_\mathrm{ratio} \rfloor

Algorithm

Hyperparams: n_{NB} = 10, MN_\mathrm{ratio} = 0.5, FP_\mathrm{ratio} = 2
Loss function:
- \tilde{d}_{ij} = 1 + ||\mathbf{y}_i - \mathbf{y}_j||_2^2
- Notice: low-dim distances similar as in t-SNE - motivated by t-distr.
- Constructed to adhere principles of “good loss function” (introduced in the paper)
Init: PCA or random \mathcal{N}_q(\mathbf{o}, 10^{-4}\mathbb{I}_q)
- Notice: random init as in t-SNE

Opti: Adam optimizer, weights of loss change with iterations
- 1st phase: big weight on MN pairs loss → embed so that mainly global structure is preserved (weights decrease with iterations → focus moved to local structure → preserved at least to certain degree)
- 2nd phase: small (but not zero) weight for mid-near pairs → improve local structure while maintaining the global structure captured before
- 3rd phase: reduce MN (and NN) pair weights → separate possible clusters and make their boundary clearer

Clustering

Motivation

Unsupervised technique that links similar items together
Use cases
- Understand the world - get new insights
  - Pattern discovery
  - Customer segmentation
  - Anomaly detection…
- Acting on clusters
  - Features for prediction or explanatory models
  - Different marketing campaign strategies
  - Segment-specific eligibility rules and scoring…

K-Means Clustering

History & Motivation

Origin: Stuart Lloyd, Bell Labs, 1957
- Initially for analog - digital signal quantization
Extension to data clustering: James MacQueen, 1967
- Apply same principle to multi-dimensional data
Core idea: Group n points into k clusters by minimizing distance to centroids

Algorithm (Iterative Steps)

Initialize k centroids (random, K-Means++, or domain knowledge)
Assign each point to the nearest centroid (Assignment step)
Recompute each centroid as the mean of assigned points (Update step)
Repeat steps 2–3 until convergence (minimal change in centroids or WCSS)

WCSS = \sum_{i=1}^{k} \sum_{x \in C_i} \|x - \mu_i\|^2

Convergence Property

Theorem: WCSS decreases monotonically with each iteration:

WCSS_1 \ge WCSS_2 \ge WCSS_3 \ge ... \ge WCSS_k

Assignment and update steps always reduce intra-cluster variance
May converge to local minimum, not guaranteed global minimum

Centroid Choices

Mean (standard K-Means): minimizes squared distance for each cluster

\mu = \arg\min_{t} \frac{1}{n} \sum_{x \in \text{Cluster}} (x - t)^2

Median (K-Medoids): robust to outliers

Advantages

Simple, fast, scalable O(n)
Works well for large datasets
Intuitive and easy to implement (e.g. spark)
Centroids can be the point of the analysis
- Example: Where to distribute COVID vaccination centers?

Disadvantages

K-Means assumes each cluster has a natural center
Fails for:
- elongated, curved, or irregular clusters
- nested structures or clusters without a clear center

Hierarchical Clustering

History & Motivation

The technique gained traction in the 1960s and 1970s.

Complex dataset explosion in biological sciences and psychometrics.
Need for taxonomy and interpretable sub-segmentation.
Data often heterogeneous or categorical, clusters non-convex.

Result: The Dendrogram

Linkage types

Single Linkage

Cluster distance defined as the minimum pairwise distance
D(A,B) = \min_{i \in A, j \in B} d(i,j)
Builds clusters by connecting nearest neighbors
Sensitive to local minima — may form long chains of points
Characteristics
- Captures elongated or non-convex shapes
- Handles arbitrary geometries
- Very sensitive to noise and outliers
- Tends to produce “chaining effect” – merging distant points through narrow connections

Complete Linkage

Cluster distance defined as the maximum pairwise distance
D(A,B) = \max_{i \in A, j \in B} d(i,j)
Merges clusters only when all points in both clusters are relatively close
Strong preference for small-diameter, tightly packed clusters
Characteristics
- Produces compact, spherical clusters
- Much less prone to chaining than single linkage
- Very sensitive to outliers—a single far point prevents merging
- Used rarely - if ne aims for compact clusters - Ward is a better choice

Average Linkage

Cluster distance = average pairwise distance
D(A,B) = \frac{1}{|A||B|} \sum_{i \in A, j \in B} d(i,j)
Balances between single (min) and complete (max) linkage
Characteristics
- Produces moderately compact, balanced clusters
- Less sensitive to noise than single linkage
- Works well when clusters are roughly spherical
- May fail for elongated or uneven-density clusters

Ward’s Method

Merges clusters that cause the smallest increase in within-cluster variance
D(A,B) = \Delta WCSS = WCSS_{A \cup B} - (WCSS_A + WCSS_B)
Equivalent to hierarchical variance minimization (similar to k-means)

Characteristics
- Produces compact, spherical clusters
- Favors clusters of similar size and variance
- Fails for elongated or irregularly shaped clusters
- Not suitable for non-Euclidean distances

Summary: Linkage Comparison

Method	Definition	Produces	Fails for
Single	\min d(i,j)	Elongated, chained clusters	Noisy data, chaining effect
Average	\text{mean } d(i,j)	Balanced, compact clusters	Elongated or uneven-density clusters
Ward	\Delta\text{SSE}	Compact, spherical clusters	Elongated or irregular clusters

DBSCAN Clustering

Motivation

Traditional clustering algorithms struggle with non-convex shapes
Many datasets contain noise points (measurement errors, anomalies)
- Forcing them into clusters can distort results
Cannot learn the number of clusters directly from data
Solution: DBSCAN
- Density-Based Spatial Clustering of Applications with Noise
- Defines clusters as dense regions of points

Dense Region Definition

Two key parameters:

ε (epsilon): radius for neighborhood
MinPts: minimum number of points to form a dense region

Intuition:

A point belongs to a cluster if it is in a dense region (≥ MinPts in ε-neighborhood)

Algorithm Steps

Pick an unvisited point
Count points within ε distance
- If ≥ MinPts → core point, start new cluster
- If < MinPts → temporarily labeled as noise
Expand cluster by adding all points density-reachable from the core point
Repeat until all points are visited

Advantages

Can detect clusters of arbitrary shape
Less resource-intensive than hierarchical clustering
Automatically determines number of clusters
Handles noise points automatically

Limitations

Sensitive to parameter choice:
- ε too small: natural clusters split
- ε too large: separate clusters merge
Struggles when clusters have different densities
Parameter tuning may require domain knowledge

Example: Clustering of Banking Dataset

We clustered banking clients based on
MCC spending categories, temporal expenditure patterns,
and related behavioural features.
Goal: derive customer prototypes for high-quality
synthetic data generation.
1. Standardization was sufficient since all features were
  expressed as percentages or normalized proportions.
2. Applied UMAP with a cosine metric to capture behavioural similarity.
3. Used DBSCAN to identify dense groups in the embedding space.
Our priority was obtaining clean, interpretable prototypes.
Edge points (border/noise points) would distort prototype profiles.

Challenges

Mixed data types

Problem

Methods often implicitely assume “Gaussian-like” distributions
- Continuous
- Sample space: \mathbb{R}
- Not too skewed
Therefore distances based on Euclidean metric are perfectly natural
But real life data…
- Both numeric and categorical
- Often with limited support ([0, \infty), [0,1]), skewed, mixtures…
- Have missings

Solution

Consider if using all different kinds of data makes sense
- Working with “data of similar kind” usually bring better results (justification, interpretability, stability)

Consider different metrics
- Std. distance or custom (domain specific)
- Think of how metric behaves and if that agrees with what you want to put most significance on
- Warning: Custom metrics may introduce computational difficulties

Preprocess the data beforehand
- Make them more “Gaussian-like” to be easily consumed by algos with Euclidean distances
- E.g. log and other functional transforms, dummy variables, imputation, re-weighting…
- Warning: Probematic justification, weights sensitivity, metric tends to be dominated by indicators

Interpretation

Descriptive approach using stats and visuals:

Colour projection by variables
- Find out relation between high-dim and low-dim variables
- Locate hundles of specific variable values and how they correspond to clusters
Compare distributions in cluster and overall (or in complement)
- Proper stats and plot types for different variable types

Example: regarding feature weight look at

\begin{equation*} \mathbb{E}[\mathtt{weight} \mid \text{in cluster}] > \mathbb{E}[\mathtt{weight} \mid \text{out cluster}] \end{equation*}

Explanatory approach: interpret helper model

Fit (multiclass) classification: target = cluster, features = (not only) input variables
Detect main drivers of cluster membership and how the variables associate with clusters
Difference (from descriptive approach in prev. slide) is in controlling for the variable

Example: Considering features weight and height, we fit a cluster membership model

\frac{p}{1-p}= \exp(\beta_0 + \beta_{wt} \mathtt{weight} + \beta_{ht} \mathtt{height})
- If \beta_{wt} > 0, then people with higher weight are more likely to belong in the cluster (while holding the rest constant).
- If |\beta_{wt}| > |\beta_{ht}|, weight is a higher contributer than height.
- More advanced “black-box” models (GB, NN) combined with SHAP values can be used.

Confounding matters

Audience Question: Can descriptive and explanatory lense disagree?

Let us imagine we have a dataset with two features, weight and height.
We fit a logistic regression (or use other explanatory technique of our choice)

\frac{p}{1-p} = \exp(\beta_0 + \beta_{wt} \mathtt{weight} + \beta_{ht} \mathtt{height})

Can it happen that \beta_{wt} < 0 but \mathbb{E}[\mathtt{weight} | \text{in cluster}] > \mathbb{E}[\mathtt{weight} | \text{out cluster}]?

Problem: Tall and Thin

Cluster: “Tall and Thin” people
- Height and weight are correlated.
- Descriptive stats may show:
  - Tall and heavier than average.
- But relative to their height, they are actually lighter than expected.
Where correlation structure breaks interesting things happen
- People under insolvency = People who are technologically conservative for their age.

Labelling consistency

Problem

Nature of methods is often stochastic
Even with fixed random state and completely same data set, DR may produce e.g. rotated or flipped projection - permuted clusters
Once we want to refit on different sample from the same distribution, results might seem completely different at the first sight
We know cluster we interpreted as xxx is still there but has different number and coordinates in low-dim space so it’s hard to find

Solution

Formulate as
- Graph problem: minimum weight perfect matching in bipartite graph
- Integer programming problem: optimal assignment
Such problem can be solved by “Hungarian algorithm” (impl. e.g. in scipy)

Validation

Two ways to validate clustering
- Technical
  - Size balance
  - Correct shape
  - Stability across retraining
- Semantic
  - Clusters make sense
  - Target informativeness
  - Separation on unseen features

Technical Validation: Shape

Definition: Clustering is grouping points that are close to each other and far from others
Three key concepts
- Compactness
  - Silhouette, WCSS, Davies–Bouldin
- Separation
  - Dunn Index, Inter-cluster distance
- Connectedness
  - CHAS, Graph-based measures
Algorithms
- K-Means, Ward → Compactness
- DBSCAN, Single Linkage → Connectedness
Note: Useful for automation, but don’t over-rely

Silhouette Score (for each point i):

s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}\in [-1, 1]
- a(i) = avg. dist. from i to points in the same cluster
- b(i) = avg. dist. from i to point in the nearest other cluster

Technical Validation: Stability

Clusters are stable if they stay similar under:
- resampling (removing a few points),
- reinitializations (different random seeds),
- small changes or noise in the data (perturbations).

Rand Index (RI)
- Measures agreement between two clusterings based on pairs of points: RI = \frac{\text{number of agreeing pairs}}{\text{total number of pairs}}
- Some of the points will agree on accident so we adjust by the baseline, \text{ARI}(C_1, C_2) = \frac{\text{RI} - \text{Expected RI}}{\text{Max RI} - \text{Expected RI}}

Semantic Validation: Target Informativeness

Goal: Measure how useful the clustering is for an external business-relevant target (e.g., loan uptake, churn, campaign response)
Reformulate to supervised learning:
Does cluster membership carry information about the target?

Example: Car Loan Uptake

Fit clustering at reference date (e.g. 1.1.2025)
Observe who takes a car loan within 6 months
Compute cluster-level informativeness
I(\text{cluster}) = \log\!\left(\frac{P(\text{loan} \mid \text{cluster})}{P(\text{loan})}\right)
Overall informativeness of clustering
I = \sum_{\text{cluster}} P(\text{cluster}) \cdot \,\lvert I(\text{cluster})\rvert

Semantic Validation: Omitted Variables

Idea: Validate clusters using variables that were not used during clustering.
Why it matters: If clusters show clear structure in unseen variables, it means the model captured
real distinct population patterns, not spurious relationships in the fitted features.

Example: ATM Withdrawals (Unseen Variables)

Clustering is trained only on transaction patterns.
Big-city ATMs form distinct temporal patterns
- life in the cities areas starts at later hours

Prod usage

Bringing DR and clustering solutions to prod is… difficult
- Some techniques are primarily “exploratory” and don’t have transform method
- Many decisions during implementation are crafted manually on dev sample → problem to automate refits
- Interpretation is often mostly manual → costly re-interpretations

Two goals which go against each other:

Interpretablity & stability

Fewer, not too small, quite well-separated clusters
Nice stories for most of them
Robust againts small distribution shifts, hyperparams changes
Clusters are still quite variable, information loss is too high to produce accurate predictions

Usability in models (predictive power)

Microsegments, complex projection structures
Some segments might have very interesting backgrounds, others more blended, impossible to interpret everything
Can bring high added value on selected microsegments
Sensitive to data changes, hyperparams

Extras

Meta-Clustering

Clustering is often underspecified, unstable, and difficult to use directly in practice.
Rich Caruana’s famous lecture: “Clustering is approximately useless.”
Result: growing literature on embedding clustering into more complex, stable, and useful frameworks.

Prototype Clustering (Caruana et al.)

A single clustering is often unstable and heavily dependent on feature weights, distance metrics, preprocessing choices, the clustering algorithm

Idea

Search for stable structure by generating a large ensemble of clusterings
(e.g., 100,000 runs with random weights or metrics).

Steps

Run many clusterings under diverse settings.
Measure similarity between cluster assignments
(e.g., Rand Index, Variation of Information).
Cluster the clusterings themselves → obtain prototype clusterings.
Present prototype clusterings to users and select the most
useful, interpretable, or actionable one.

Outcome

A small set of representative, stable clusterings that summarize the
landscape of possible solutions and allows the end user to pick.

Feedback-Driven Clustering

Real-world feedback often expresses constraints such as:
- Must-Link (ML): “These points belong together.”
- Cannot-Link (CL): “These points must not be grouped together.”
Many methods incorporate such constraints:
- Constrained K-Means
- Finding the clustering closest to a baseline while satisfying ML/CL
- Learning a distance metric consistent with the constraints
This enables iterative refinement with domain experts.
As business is a part of the solution, they are more likely to be onboard.

Voting / Consensus Clustering

Generate multiple clusterings on the same dataset:

Different distance metrics
Different subsets of variables
Different data types (categorical, count, skewed, etc.)

Define a co-association matrix:

Distance between two points = proportion of clusterings where they do not appear together.
Cluster this matrix to obtain a consensus clustering.

Advantages:

Produces a solution that is more stable than any individual clustering.
Naturally handles heterogeneous data by mixing clusterings tailored to different variable types.