DATA INTEGRATION AND ERROR: BIG DATA FROM THE 1930’S TO NOW

DATA INTEGRATION AND ERROR:

BIG DATA FROM THE 1930’S TO NOW

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

2

CONTENTS

• Big Data in the 1930’s and why that matters now• TV measurement and Return Path Data (STB)• Interesting questions for understanding error

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

3

BIG DATA 1930’S STYLE

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

4

PROBABILITY SAMPLING 1930’S STYLE

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

5

EVOLUTION OF STATISTICAL CONCEPTS IN RESEARCH

Early days: Novel, non-scientific

1930’s: Scientific sampling

Since the 1950’s: weighting, probability models, imputation techniques, data fusion, time series analyses, hybrid (Big Data/sample integration)

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

6

NIELSEN AND AUDIENCE MEASUREMENT

1923: Nielsen Founded1950: Introduces TV Audience

Measurement

Current technology: People Meter• Electronic measurement• Probability samples• All people and sets in home

measured

Nielsen Ratings are the currency for US TV advertising

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

7

THE CHANGING TV ENVIRONMENT

• Fragmentation of Viewing Choices

• Proliferation of Devices

• Increasing Population Diversity

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

8

RESEARCH DATA - STATISTICAL TOOLS

From: Sample/Measure/Project (Panel Data)To: Sample/Measure/Project + Integrate

- Data Fusion- Probability Modeling- Calibration- Predictive Modeling

Using Multiple Panels, Census Data, Surveys

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

9

WHAT STB AND PANELS CAN GIVE US

STBLarge convenience samples,

stable resultsDATA

PanelsCompleteness of Audience

MeasurementRESEARCH PRODUCTS

In combination, STB + Panels offer the possibility of stable,

UNBIASED RESEARCH

+

=

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

10

STB GAPS AND BIAS

1. Data Quality/coverage/

timeliness/representativeness

2. Set Activity (On/Off/Other Source)

3. Household Characteristics

4. Persons viewing (including visitors in the home)

5. Other Viewing Activity

Bias

Standard Error

STB

Bias

Standard Error

People Meter

STB + People Meter?Bias

Standard Error

Total Survey Error

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

11

0

10

20

30

40

50

60

70

12:0

0 AM

1:00

AM

2:00

AM

3:00

AM

4:00

AM

5:00

AM

6:00

AM

7:00

AM

8:00

AM

9:00

AM

10:0

0 AM

11:0

0 AM

12:0

0 PM

1:00

PM

2:00

PM

3:00

PM

4:00

PM

5:00

PM

6:00

PM

7:00

PM

8:00

PM

9:00

PM

10:0

0 PM

11:0

0 PM

AA %

STB Tuning Activity

STB DATA QUALITY – EXAMPLE ANALYSES

• Good… • Not so good…

0.00

0.50

1.00

1.50

2.00

2.50

3.00

5:00 PM

5:30 PM

6:00 PM

6:30 PM

7:00 PM

7:30 PM

8:00 PM

8:30 PM

9:00 PM

9:30 PM

10:00 PM

10:30 PM

11:00 PM

11:30 PM

12:00 AM

12:30 AM

1:00 AM

%

Adjacent Tuning Sessions - April 22nd 2011

Same Channel Different Channel

Machine Reboot ActivityProgram junction spikes

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

12

ARE WE IMPROVING THE MEASUREMENT?1. Transparency and validation at each step and overall

2. Total Survey Error

0.0

2.0

4.0

6.0

8.0

10.0

12.0

5 AM -6 AM

6 AM -7 AM

7 AM -9 AM

9 AM -12 PM

12 PM -3 PM

3 PM -5 PM

5 PM -6 PM

6 PM -6:30 PM

6:30 PM - 7 PM

7 PM -10 PM

10 PM -11 PM

11 PM -11:30 PM

11:30 PM -1 AM

Local Station Ratings M-F Nov 2010 -Women 18+

People Meter Hybrid

Females 18+ 19 5 7Females 18 - 34 38 41 40

Total Survey Error % Reduction

Broad-cast Cable TotalTotal Survey

Error Bias

Standard Error

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

13

ASSESSING INTEGRATION ERROR

• Input Error (GIGO) • Matching Error• Statistical Error• Validity Levels• Multiple Database error compounding

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

14

ASSESSING INTEGRATION ERRORS

• Input Error (GIGO)- Coverage Gaps, Definitional problems, Input Errors etc- But possible improvement through integration weighting

effects

Most problems remain but some can be mitigated through integration

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

15

ASSESSING INTEGRATION ERRORS

• Matching Error (eg address matching)- Good – correct match, Bad – no match, Ugly – incorrect

match- Trade-off between match rates and error rates

Multiple databases may have correlated errors – that may be preferable to random errors since overall effect is restricted to a smaller group (eg new householders in some address lists)

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

16

STATISTICAL ERROR (SAMPLE-BASED IMPUTATION)• Model bias leads to attenuation (regression to mean)• Individual data point bias can be undetectable due to

sampling error

Persons 2+ Total Viewing Weekly Average Hours across 1000 Product Categories

0

5

10

15

20

25

30

35

40

Fused

Actual

Fused Best Fit

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

17

SEPARATING MODEL BIAS AND SAMPLING ERROR

Actual vs Expected Distribution of Differences between Real

and Fused Results

0

100

200

300

400

500

600

700

800

900

1000

M-Sun 1am-6am

Expected

Z-tests on each comparison and evaluation of Z-score distributions

Deviation from expected distribution gives bias estimate

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

18

STATISTICAL ERROR - MULTIPLE DATA SETS

TV

BuyWeb

Hub and Spoke Sequential

TV

BuyWeb

1 2 1 2

2

Comparison with Single Source Data:Nielsen National People Meter TV and Internet matched

with Credit Card Purchase Data

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

19

ACCURACY TEST

TV

BuyWeb

Hub and Spoke Sequential

TV

BuyWeb

R = 0.4

Correlation of 8 product categories with 14 TV Networks and 60 Websites

R = 0.5

R = 0.67R = 0.44

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

20

SEQUENTIAL VS HUB AND SPOKE

• Unless the Hub has all the relevant linking information, a sequential approach gives better results

• In our example, we captured interactions between web and purchase behavior through the sequential fusion

• However sequential fusions can fall down with too many data-sets as error compounds.

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

21

VALIDITY LEVELS – INDIVIDUAL VS AGGREGATED

Individual Prediction• IDEAL SCENARIO: You can predict

every individual’s behavior

• REALITY With most Imputation methods we can do better than random but rarely can we get close to 100% accuracy.

• Eg ~40% improvement on random when predicting product users based on cookies.

ie 14% of online ad impressions delivered to product users rather

than 10%

Aggregate Prediction• Imputation methods can reliably

predict aggregate level behavior given good predictive variables

• Eg 90% Accuracy (10% regression to mean) for TV audience estimates by product users

• Errors compound with multiple sources but extent varies by case

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

22

CONCLUSION

• Data Everywhere!• Data quality and relevance is essential• Integration brings insights and error• Statistical Integrity is as important now as it

was in the 1930’s

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

23

APPENDIX

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

24

AD EFFECTIVENESS - MORE COMPLICATED

• Imagine a data set of 10,000 people for whom you have tracked exposure to a brand’s website and subsequent purchase of that brand.

• In our initial thought experiment, 76% converted.

HUB: Matching

info

TBD...

PUR-CHASE

Website visit

TBD...

TBD...

TBD...

TBD...

TBD...

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

25

A BASIC EXPERIMENT

• Now imagine that you have measurement error in 10% of your cases. We ran a simulation of 1000 datasets which had incorrect data on site visits in 10% of cases.

• The difference between the original conversion rate and that in the 1000 error ridden test cases is about 8.5%. SD is xx.

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

26

A BASIC EXPERIMENT

• What happens when we add another data set?

HUB: Matching

info

TBD...

PUR-CHASE!

Website visit

Saw TV ad

TBD...

TBD...

TBD...

TBD...

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

27

MORE DATA – SAME ERROR

• Given two types of ad exposure data to measure, the impact of error in a single data source should be less...

• Imagine that you have measurement error in 10% of your cases for one data source – the same error as in previous experiment.

• As expected, conversion values are closer to our error-free data set. SD =

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

28

MORE DATA – MORE ERROR

• Next, we introduced error into the TV data set as well.

• Worsening of performance SD is xx.

• But it looks more additive than exponential.

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

29

MORE DATA – EVEN MORE ERROR

• Next, we imagined combining 6 data sets, each with 10% error.

• WHAT DO WE SEE?

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

30

MATCHING ERROR• In any data combination, there is an additional source of error – mismatches

to the HUB or identity variable.

• Mispelled names can lead to false negatives. Non-deterministic matching can lead to false positives.

• Introducing 10% matching error (to first only, both and second only data sets) suggests that the impact is negligible over conversion in error free data.

• Suggests the quality of data is more important than the matching quality.

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

31

ASIDE: THE IMPORTANCE OF WEIGHT

• Here, TV data was heavily weighted toward exposure.

• That overwhelmed any error from website visit data. Indeed, it appeared to counterbalance it.

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

32

ASIDE: THE IMPORTANCE OF CORRELATION

• The greater the correlation between the dependent and independent variable, the greater the impact of error.

Weaker correlation between webvisit and purchase (xx)

Strong correlation between webvisit and purchase (xx)

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

33

WHAT DO WE KNOW THUS FAR?

• Still more work to do certainly. But we have formed certain hypotheses:• When combining multiple data sets, the error appears additive.

• Error rates being equal, the underlying aspects of the data are more likely to impact the outcome than the combination.

• It is important, however, to qualify basic relatedness between each independent variable and the dependent outcome. This argues for a hub and spoke approach to data combination.

• SO how did these hypotheses fare in a quick test using real world data? (next slide on your recent error work)

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

34

There are two basic paths to integrating data

A serial integration: (A+B)+C

Each data set resulting from an integration is smaller thaneither original source due to non-matches.

Combining Data Sets

Data Source

A+B

Data Source B

Data Source A

Data Source C

Data Source A+B+C

+ =

+ =Data

SourceA+B

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

35

COMBINING DATA SETS

Another approach is a hub and

spoke model:

(A+B)+(A+C)...etc.

While the final integrated set

is still reduced due to non-

matches, the error from each

match to the HUB is known.

HUB: Matching

info

TBD...

TBD...

TBD.

TBD...

TBD...

TBD...

TBD...

TBD...

Copy

right

©20

12 T

he N

iels

en C

ompa

ny. C

onfid

entia

l and

pro

prie

tary

.

36

AD EFFECTIVENESS - MORE COMPLICATED

Ad effectiveness captures the correlation between exposure to advertising and subsequent purchase of a product.

When someone who sees an ad buys a product, we say they have CONVERTED.

HUB: Matching

info

TBD...

PUR-CHASE

TBD.

TBD...

TBD...

TBD...

TBD...

TBD...

Documents

DATA INTEGRATION AND ERROR: BIG DATA FROM THE 1930’S TO NOW