Data Understanding

The Importance of Thinking

Anna Michálková, Jan Hučín

Outline

What is DU
- Goals
- Three sources of DU
- Why is it important?
- What to find out - basics
DU steps in a project
- 1. Data & Existing Knowledge Collection
- 1. First looks into the data
- 1. Data description (profiling)
- 1. Data exploration

…and many examples along the road

What is Data Understanding (DU)?

Goals

Increase familiarity with the data via profiling and exploration
Identify data quality problems and potential issues
Discover initial insights into the data
Find interesting paterns
Generate hypotheses for further phases of analysis
…

Three sources of understanding

1. Existing knowledge base

What others know about the data

2. Data themselves

What we can find from the data

3. Own critical thinking

To compare and verify outcomes of previous sources

Why is it important?

Make informed decisions about how to proceed

Helps with justification for the business
What should we focus on?
- Don’t waste resources: many data sets, columns or rows may be irrelevant
- Set right order of next tasks: from core to the rest & from simple to difficult
Key inputs for next CRISP phases
- Which data preprocessing is needed?
- Which modelling techniques will be suitable?
  
  (no. of obs., no. of potential features, NAs, dtypes, relationships between target and variables or variables themselves…)

Tackling problems easier

Prevents misinterpretation and incorrect conclusions
Early detection of problematic columns → saved effort in mining featues, crafting models depending on them
Something doesn’t work as expected → it’s difficult to trace down the issue without DU - where to start?

Example: loan application

About

UK FinTech StartUp
Consumer loans via aggregating websites
- Form: Fill info about you
- API calls to credit bureau, income verification…
- Competitors offer rate → customer selects among offers
Task: predict offer of the startup is clicked, optimize price to improve click rate

Importance of DU

First model iteration (quick and dirty) - much added value from employment sector and position features
Reason? Data leak

What to find out - basics

Tip

Documenting findings already during your exploration in a reproducible way >>> Generating random stats and plots and then trying to make sense of them retrospectively.

Properties of a data set

File format (DB table, csv, binary, audiovisual…)
Population of entities → what the data set covers and what not?
Data creation pipeline
- Original data source
- Owner and users with access
- Process of verification, correction, cleaning, consolidation
- Anonymization / pseudonymization
- → where ask for repeated ingest? how much can we trust the data?

Properties of a data set (cont.)

Size
- small / big data (does it fit on the RAM, HDD)
- no. of rows, columns
- → data processing engine
Historization
- fixed / changes in time (valid date available?)
- → what can be used for modelling?

Data sets relations

Relation keys (possibility of missings, uniqueness…)
Units relation stats (no. of records for basic entity)

Properties of variables in a data set

Meaning (incl. code sheets)
Value origin - exact value / estimated / model result?
Types (numeric / categorical / time / text…) and formats (decimal places, encoding…)
Missings - possible or not, encoding
Errors, outliers and specificities
Relevance to the addressed problem
Summary statistics
Distribution visualization - more next week

Variables relations

Correlations and dependencies
Special focus: relation to target variable

DU steps in a project

1. Data & Existing Knowledge Collection

Gather access to all relevant data sources (databases, files etc.)
- Security concerns
- Anonymization / pseudonymization
- Collection difficulties…
Get all documentation available
- We don’t have documentation
- It hasn’t been updated for years
Meet people who work with the data
- We don’t have time to explain…
- It’s clear from the names what the data mean…

Using AI

Transcriptions & meeting summaries
Documentation - extraction, summarization

Warning

Be careful about security and GDPR.

Be doubtful - check outcomes and don’t rely on them blindly.

2. First looks into the data

It pays off to spend some time just looking at the data
- Confrontation of first impression we got during BU or collecting existing data knowledge
- Inspiration for sanity checks
- Helps to formulate first hypotheses
Use small random sample (SRSWOR, stratified…)

Example: loan application

BU stage: after scoring there are two options which may lead to decline or approval
- auto stream
- manual stream - loan expert evaluates the customer

First check of the Application Status Table - it’s a bit more complicated…

time	state
10:00:00	scoring
10:00:02	manual check
12:00:00	scoring

time	state
10:00:00	scoring
10:00:02	waits for income verif docs
10:15:00	scoring
10:15:02	income verif failed
10:15:03	waits for manual income verif

3. Data description (profiling)

Basic info learnable from the data itself (see next slide for examples)
Automation tools:
- ydata_profiling
- data_prep

ydata_logo data_prep_logo

Tip

Focus on overall and 1D descriptions
Use sample if necessary
Subset colums if necessary

Variable types

Numeric (quantitative)
- By metric type
  - Continuous - uncountable (height…)
  - Discrete - countable (count of transactions…)
- By divisibility
  - Ratio - ratio of values makes sense (weight, age…)
  - Interval - ratio doesn’t make sense (temp. in °C, cal. year…)
Categorical (qualitative)
- Nominal - no natural ordering (marital status, city…)
- Ordinal - categories can be sorted and compared (education…)

From exploration perspective

Summary stats or visualization tech. for numerics are the same (no matter cont./discrete)
Special types
- Binary (boolean) - categorical (nominal or ordinal), can be reformulated as indicator → behaves a bit like numeric (stats as mean make sense) (gender, open/closed…)
- Date/Time - numeric but many usual stats are irrelevant
- Keys - usually repr. as discrete continuous but primary interest is uniqueness and missings

Variable distributions

discrete / continuous
- Num.: discrete with infinite support (Poisson…) or continuous (normal, exponential…) distribution
- Cat.: discrete distribution with finite support (Bernoulli, categorical)
symmetrical / skewed
heavy tailed
unimodal / multimodal

Summary statistics

Location - what is typical value? where is the main mass of the distribution located?
- Num.: mean, median, trimmed mean…
- Cat.: median, mode
Variability - how much can values differ? how spread are they?
- Num.: SD, IQR, MAD, CV, range…
- Cat.: levels count
Distribution
- Num.: quantiles, skewness, kurtosis
- Cat.: proportions per level
Other artifacts
- % of NAs, distinct, zeroes, positive/negative, infinite
- monotonicity
- outliers (extreme values)
- common values

4. Data exploration

aka EDA (Exploratory Data Analysis)
Digging deeper
- Specificities of variables
- Relationships between variables
- Investigation whether hypotheses could be true
- Finding patterns
Using more advanced visualizations (and possibly tables)
- More next week
Later: focusing on specific parts of data

Using AI

Currently
- Tools automate routine aggregations and plotting (good old automation, no genAI)
- AI can help drafting ideas: generating more hypotheses
- Interpretation is up to you

Future?
- Describe what table / plot says in human language
- Pin interesting observations
- Validate against existing knowledge
- Recommend cleaning steps
- …

Example: loan application

EDA - possible status changes on aggregated level

Inspired by first look into the data → questions:
- How frequent is manual check? How does it affect duration of application process? Does it have effect on acceptance?
- Where to put income verification in the process? Does it always happen the same?
- What about the rest of the process (before and after scoring part), is it simplier?

Confrontation of three sources of DU
- BU / meetings to gather existing knowledge: some status transitions are not possible
- Data tell us otherwise - and doesn’t seem to be random error
- Our own thinking → brainstorm ideas why that might be
Inputs for further steps
- Second loop of BU
  - Are our findings valid, why initial impression was different?
  - Why income verification has multiple different ways to be done?
- Data selection based on quality
  - What can be mined without leaks?
  - Are there any applications we should drop?