Data Understanding

The Importance of Thinking

Outline

  • What is DU
    • Goals
    • Three sources of DU
    • Why is it important?
    • What to find out - basics
  • DU steps in a project
      1. Data & Existing Knowledge Collection
      1. First looks into the data
      1. Data description (profiling)
      1. Data exploration

…and many examples along the road

What is Data Understanding (DU)?

Goals

  • Increase familiarity with the data via profiling and exploration
  • Identify data quality problems and potential issues
  • Discover initial insights into the data
  • Find interesting paterns
  • Generate hypotheses for further phases of analysis

Three sources of understanding

1. Existing knowledge base

What others know about the data

2. Data themselves

What we can find from the data

3. Own critical thinking

To compare and verify outcomes of previous sources

Why is it important?

Make informed decisions about how to proceed

  • Helps with justification for the business
  • What should we focus on?
    • Don’t waste resources: many data sets, columns or rows may be irrelevant
    • Set right order of next tasks: from core to the rest & from simple to difficult
  • Key inputs for next CRISP phases
    • Which data preprocessing is needed?

    • Which modelling techniques will be suitable?

      (no. of obs., no. of potential features, NAs, dtypes, relationships between target and variables or variables themselves…)

Tackling problems easier

  • Prevents misinterpretation and incorrect conclusions
  • Early detection of problematic columns → saved effort in mining featues, crafting models depending on them
  • Something doesn’t work as expected → it’s difficult to trace down the issue without DU - where to start?

Example: loan application

About

  • UK FinTech StartUp
  • Consumer loans via aggregating websites
    • Form: Fill info about you
    • API calls to credit bureau, income verification…
    • Competitors offer rate → customer selects among offers
  • Task: predict offer of the startup is clicked, optimize price to improve click rate

Importance of DU

  • First model iteration (quick and dirty) - much added value from employment sector and position features
  • Reason? Data leak

What to find out - basics

Tip

Documenting findings already during your exploration in a reproducible way >>> Generating random stats and plots and then trying to make sense of them retrospectively.

More in week 4

Properties of a data set

  • File format (DB table, csv, binary, audiovisual…)
  • Population of entities → what the data set covers and what not?
  • Data creation pipeline
    • Original data source
    • Owner and users with access
    • Process of verification, correction, cleaning, consolidation
    • Anonymization / pseudonymization
    • → where ask for repeated ingest? how much can we trust the data?

Properties of a data set (cont.)

  • Size
    • small / big data (does it fit on the RAM, HDD)
    • no. of rows, columns
    • → data processing engine
  • Historization
    • fixed / changes in time (valid date available?)
    • → what can be used for modelling?

Data sets relations

  • Relation keys (possibility of missings, uniqueness…)
  • Units relation stats (no. of records for basic entity)

Properties of variables in a data set

  • Meaning (incl. code sheets)
  • Value origin - exact value / estimated / model result?
  • Types (numeric / categorical / time / text…) and formats (decimal places, encoding…)
  • Missings - possible or not, encoding
  • Errors, outliers and specificities
  • Relevance to the addressed problem
  • Summary statistics
  • Distribution visualization - more next week

Variables relations

  • Correlations and dependencies
  • Special focus: relation to target variable

DU steps in a project

1. Data & Existing Knowledge Collection

  • Gather access to all relevant data sources (databases, files etc.)
    • Security concerns
    • Anonymization / pseudonymization
    • Collection difficulties…
  • Get all documentation available
    • We don’t have documentation
    • It hasn’t been updated for years
  • Meet people who work with the data
    • We don’t have time to explain…
    • It’s clear from the names what the data mean…

Using AI

  • Transcriptions & meeting summaries
  • Documentation - extraction, summarization

Warning

Be careful about security and GDPR.

Be doubtful - check outcomes and don’t rely on them blindly.

2. First looks into the data

  • It pays off to spend some time just looking at the data
    • Confrontation of first impression we got during BU or collecting existing data knowledge
    • Inspiration for sanity checks
    • Helps to formulate first hypotheses
  • Use small random sample (SRSWOR, stratified…)

Example: loan application

  • BU stage: after scoring there are two options which may lead to decline or approval
    • auto stream
    • manual stream - loan expert evaluates the customer
  • First check of the Application Status Table - it’s a bit more complicated…
time state
10:00:00 scoring
10:00:02 manual check
12:00:00 scoring
time state
10:00:00 scoring
10:00:02 waits for income verif docs
10:15:00 scoring
10:15:02 income verif failed
10:15:03 waits for manual income verif

3. Data description (profiling)

ydata_logo data_prep_logo

Tip

  • Focus on overall and 1D descriptions
  • Use sample if necessary
  • Subset colums if necessary

Variable types

  • Numeric (quantitative)
    • By metric type
      • Continuous - uncountable (height…)
      • Discrete - countable (count of transactions…)
    • By divisibility
      • Ratio - ratio of values makes sense (weight, age…)
      • Interval - ratio doesn’t make sense (temp. in °C, cal. year…)
  • Categorical (qualitative)
    • Nominal - no natural ordering (marital status, city…)
    • Ordinal - categories can be sorted and compared (education…)

From exploration perspective

  • Summary stats or visualization tech. for numerics are the same (no matter cont./discrete)
  • Special types
    • Binary (boolean) - categorical (nominal or ordinal), can be reformulated as indicator → behaves a bit like numeric (stats as mean make sense) (gender, open/closed…)
    • Date/Time - numeric but many usual stats are irrelevant
    • Keys - usually repr. as discrete continuous but primary interest is uniqueness and missings

Variable distributions

  • discrete / continuous
    • Num.: discrete with infinite support (Poisson…) or continuous (normal, exponential…) distribution
    • Cat.: discrete distribution with finite support (Bernoulli, categorical)
  • symmetrical / skewed
  • heavy tailed
  • unimodal / multimodal

Summary statistics

  • Location - what is typical value? where is the main mass of the distribution located?
    • Num.: mean, median, trimmed mean…
    • Cat.: median, mode
  • Variability - how much can values differ? how spread are they?
    • Num.: SD, IQR, MAD, CV, range…
    • Cat.: levels count
  • Distribution
    • Num.: quantiles, skewness, kurtosis
    • Cat.: proportions per level
  • Other artifacts
    • % of NAs, distinct, zeroes, positive/negative, infinite
    • monotonicity
    • outliers (extreme values)
    • common values

4. Data exploration

  • aka EDA (Exploratory Data Analysis)
  • Digging deeper
    • Specificities of variables
    • Relationships between variables
    • Investigation whether hypotheses could be true
    • Finding patterns
  • Using more advanced visualizations (and possibly tables)
    • More next week
  • Later: focusing on specific parts of data

Using AI

  • Currently
    • Tools automate routine aggregations and plotting (good old automation, no genAI)
    • AI can help drafting ideas: generating more hypotheses
    • Interpretation is up to you

  • Future?
    • Describe what table / plot says in human language
    • Pin interesting observations
    • Validate against existing knowledge
    • Recommend cleaning steps

Example: loan application

EDA - possible status changes on aggregated level

  • Inspired by first look into the data → questions:
    • How frequent is manual check? How does it affect duration of application process? Does it have effect on acceptance?
    • Where to put income verification in the process? Does it always happen the same?
    • What about the rest of the process (before and after scoring part), is it simplier?
  • Confrontation of three sources of DU
    • BU / meetings to gather existing knowledge: some status transitions are not possible
    • Data tell us otherwise - and doesn’t seem to be random error
    • Our own thinking → brainstorm ideas why that might be
  • Inputs for further steps
    • Second loop of BU
      • Are our findings valid, why initial impression was different?
      • Why income verification has multiple different ways to be done?
    • Data selection based on quality
      • What can be mined without leaks?
      • Are there any applications we should drop?