Advanced Data Preparation / Unsupervised Modelling
…and examples along the road
Dimension Reduction and Clustering can be considered as part of…
Standard Setup: OHE → Standardization → Euclidean
After one-hot + standardization
Compare with height
Note: not part of this lecture
Formally j-th principal component is
\mathbf{w}_j = \mathrm{arg max}_{\mathbf{w} \in \mathbb{R}^p, ||\mathbf{w}||=1} \mathrm{var}(\mathbf{w}^T\mathbf{x}_i)
Compute spectral decomposition of sample covariance matrix (since true is unknown):
\widehat{\mathbb{V}} \propto \mathbb{X}^T\mathbb{X} = \mathbb{W}\mathbb{\Lambda}\mathbb{W}^T → eigenvectors in \mathbb{W} are PCs
Compute SVD of \mathbb{X} = \mathbb{U}\mathbb{\sqrt{\Lambda}} \mathbb{W}^T → right singular vectors in \mathbb{W} are PCs (used in practice)
Lesson learned: PCA still may be useful for specific data patterns
Distribution represents process of sampling neighbours:
p_{j∣i} := \mathbb{P}(\text{"choose } \mathbf{x}_j \text{ as neighbor"} ∣ \text{centered at }\mathbf{x}_i)
Similar (close) points have high probability to be picked as neighbours (and vice versa)
Assume generating neighbourhood of \mathbf{x}_i is proportional to Gaussian distribution \mathcal{N}_p(\mathbf{x}_i, \sigma_i^2 \mathbb{I}_p):
p_{j∣i} = \frac{\exp(-\frac{||\mathbf{x}_j - \mathbf{x}_i||_2^2}{2\sigma_i^2})}{\sum_{k \neq i} \exp(-\frac{||\mathbf{x}_k - \mathbf{x}_i||_2^2}{2\sigma_i^2})}
Assume generating neighbourhood of \mathbf{y}_i is proportional to mutivariate t-distribution t_p(\mathbf{y}_i, \mathbb{I}_p, 1):
q_{ij} = \frac{(1+||\mathbf{y}_j - \mathbf{y}_i||_2^2)^{-1}}{\sum_{k \neq l} (1+||\mathbf{y}_k - \mathbf{y}_l||_2^2)^{-1}}
Find kNN (in given metric), compute min. pos. distance to neighbour \rho_i
Find scaling parameter \sigma_i (normalize distances to neighbours keeping proximities in orig. dim.)
Define weight function w(\mathbf{x}_i, \mathbf{x}_j) (closest point has 1, then decreases exponentially)
Define weighted graph G with weights based on symmetrized versions of w(\mathbf{x}_i, \mathbf{x}_j):
\bar{w}_{ij} := w(\mathbf{x}_i, \mathbf{x}_j) + w(\mathbf{x}_j, \mathbf{x}_i) - w(\mathbf{x}_i, \mathbf{x}_j)w(\mathbf{x}_j, \mathbf{x}_i)
Attractive force: move \mathbf{y}_i towards \mathbf{y}_j
\mathbf{y}_i := \mathbf{y}_i + \alpha \cdot f(||\mathbf{y}_i - \mathbf{y}_j||_2^2) \bar{w}_{ij} (\mathbf{y}_i - \mathbf{y}_j)
(a la GD with learning rate \alpha, f contains hyperparams)
Repultive force: push \mathbf{y}_i from \mathbf{y}_k (random non-kNN)
\mathbf{y}_i := \mathbf{y}_i + \alpha \cdot g(||\mathbf{y}_i - \mathbf{y}_k||_2^2) (1-\bar{w}_{ik}) (\mathbf{y}_i - \mathbf{y}_k)
Observation \mathbf{x}_i + each of n_{NB} nearest neighbours
Distance used: scaled Euclidean ||\mathbf{x}_i - \mathbf{x}_j||_2^2 / (\sigma_i \sigma_j)
(scale to account for different magnitudes of neighbourhoods: larger scale if neighbours are far)
Hyperparams: n_{NB} = 10, MN_\mathrm{ratio} = 0.5, FP_\mathrm{ratio} = 2
Loss function:
Init: PCA or random \mathcal{N}_q(\mathbf{o}, 10^{-4}\mathbb{I}_q)
WCSS = \sum_{i=1}^{k} \sum_{x \in C_i} \|x - \mu_i\|^2
Theorem: WCSS decreases monotonically with each iteration:
WCSS_1 \ge WCSS_2 \ge WCSS_3 \ge ... \ge WCSS_k
\mu = \arg\min_{t} \frac{1}{n} \sum_{x \in \text{Cluster}} (x - t)^2
The technique gained traction in the 1960s and 1970s.
Result: The Dendrogram
Cluster distance defined as the minimum pairwise distance
D(A,B) = \min_{i \in A, j \in B} d(i,j)
Builds clusters by connecting nearest neighbors
Sensitive to local minima — may form long chains of points
Characteristics
Cluster distance defined as the maximum pairwise distance
D(A,B) = \max_{i \in A, j \in B} d(i,j)
Merges clusters only when all points in both clusters are relatively close
Strong preference for small-diameter, tightly packed clusters
Characteristics
Cluster distance = average pairwise distance
D(A,B) = \frac{1}{|A||B|} \sum_{i \in A, j \in B} d(i,j)
Balances between single (min) and complete (max) linkage
Characteristics
Merges clusters that cause the smallest increase in within-cluster variance
D(A,B) = \Delta WCSS = WCSS_{A \cup B} - (WCSS_A + WCSS_B)
Equivalent to hierarchical variance minimization (similar to k-means)
| Method | Definition | Produces | Fails for |
|---|---|---|---|
| Single | \min d(i,j) | Elongated, chained clusters | Noisy data, chaining effect |
| Average | \text{mean } d(i,j) | Balanced, compact clusters | Elongated or uneven-density clusters |
| Ward | \Delta\text{SSE} | Compact, spherical clusters | Elongated or irregular clusters |
Traditional clustering algorithms struggle with non-convex shapes
Many datasets contain noise points (measurement errors, anomalies)
Cannot learn the number of clusters directly from data
Solution: DBSCAN
Two key parameters:
Intuition:
A point belongs to a cluster if it is in a dense region (≥ MinPts in ε-neighborhood)
We clustered banking clients based on
MCC spending categories, temporal expenditure patterns,
and related behavioural features.
Goal: derive customer prototypes for high-quality
synthetic data generation.
Our priority was obtaining clean, interpretable prototypes.
Edge points (border/noise points) would distort prototype profiles.
Descriptive approach using stats and visuals:
Example: regarding feature weight look at
\begin{equation*} \mathbb{E}[\mathtt{weight} \mid \text{in cluster}] > \mathbb{E}[\mathtt{weight} \mid \text{out cluster}] \end{equation*}
Explanatory approach: interpret helper model
Example: Considering features weight and height, we fit a cluster membership model
\frac{p}{1-p}= \exp(\beta_0 + \beta_{wt} \mathtt{weight} + \beta_{ht} \mathtt{height})
weight are more likely to belong in the cluster (while holding the rest constant).weight is a higher contributer than height.
Let us imagine we have a dataset with two features, weight and height.
We fit a logistic regression (or use other explanatory technique of our choice)
\frac{p}{1-p} = \exp(\beta_0 + \beta_{wt} \mathtt{weight} + \beta_{ht} \mathtt{height})
Can it happen that \beta_{wt} < 0 but \mathbb{E}[\mathtt{weight} | \text{in cluster}] > \mathbb{E}[\mathtt{weight} | \text{out cluster}]?
scipy)
Silhouette Score (for each point i):
s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}\in [-1, 1]
Goal: Measure how useful the clustering is for an external business-relevant target (e.g., loan uptake, churn, campaign response)
Reformulate to supervised learning:
Does cluster membership carry information about the target?
Idea: Validate clusters using variables that were not used during clustering.
Why it matters: If clusters show clear structure in unseen variables, it means the model captured
real distinct population patterns, not spurious relationships in the fitted features.
transform methodTwo goals which go against each other:
Interpretablity & stability
Usability in models (predictive power)
A single clustering is often unstable and heavily dependent on feature weights, distance metrics, preprocessing choices, the clustering algorithm
Search for stable structure by generating a large ensemble of clusterings
(e.g., 100,000 runs with random weights or metrics).
A small set of representative, stable clusterings that summarize the
landscape of possible solutions and allows the end user to pick.
Real-world feedback often expresses constraints such as:
Many methods incorporate such constraints:
This enables iterative refinement with domain experts.
As business is a part of the solution, they are more likely to be onboard.
Generate multiple clusterings on the same dataset:
Define a co-association matrix:
Advantages: