Roger hoerl say award presentation 2013

1Statistical Engineering and BIG DATA

'Big Data’ - A Challenge for Statistical Leadership

Chicago Chapter ASA

SAY Award Luncheon

Roger W. Hoerl

Union College

Schenectady, NY

With significant input from Ron Snee


Abstract

The Wall Street Journal, New York Times and other respected publications have had major features

recently on Big Data - the massive data sets which are becoming commonplace, and on the new,

"sexy" data mining methods developed to analyze them. These articles, as well as much of the

professional data mining and Big Data literature, may give casual users the impression that if one

has a powerful enough algorithm and a lot of data, good models and good results are guaranteed at

the push of a button. Obviously, this is not the case. The leadership challenge to the statistical

profession is to insure that Big Data projects are built upon a sound foundation of good modeling,

and not upon the sandy foundation of hype and unstated assumptions. Further, we need to

accomplish this without giving the impression that we are "against" Big Data or newer methods. I feel

that the principles of statistical engineering (see Anderson-Cook and Lu 2012) can provide a path to

do just this. Three statistical engineering principles that are often overlooked or underemphasized by

Big Data enthusiasts are the importance of data quality - knowing the "pedigree" of the data; the

need to view statistical studies as part of the sequential process of scientific discovery - versus the

"one-shot study" so common in textbooks; and the criticality of using subject-matter knowledge when

developing models. I will present examples of the severe problems that can arise in Big Data studies

when these principles are not understood or ignored. In summary, I argue that the development of

Big Data analytics provides significant opportunities to the profession, but at the same time requires

a more proactive role from us, if we are to provide true leadership in the Big Data phenomenon.


Outline

Statistical Leadership (Advocacy)

The ―Big Data‖ Phenomenon

What Could Possibly Go Wrong?

Statistical Engineering, and How It Can Help

Leading the Way – Doing Big Data the Right Way

Summary


Statistical Leadership

Leadership: taking people from one paradigm to another.

Enabling people to think statistically, and apply statistical methods, requires leadership.

Opinion: too many statisticians are satisfied being experts in the tools themselves,

without worrying much about the overall impact our profession is having on society.

Can’t see the forest for the trees.

As a result, society too often compartmentalizes statisticians as narrow specialists, and

does not view us as thought leaders; they look elsewhere for leadership.

Passive consultants versus proactive leaders.

As a case in point, most professionals view the ―Big Data‖ phenomenon as being led by

computer scientists, engineers, or data scientists (whatever that means), rather than by

statisticians.

Ron Snee, Gerry Hahn, and other leaders have been noting for years that statisticians

need to be more proactive, and guide society as to what needs to be done.

We shouldn’t be satisfied being the ―tools guys‖.

“Everything Rises and Falls on Leadership.” John Maxwell


Data Mining and Big Data

The technology for acquiring, storing, and processing data have been increasing

exponentially (―Big Data‖), providing new opportunities to ―mine‖ the data.

According to IBM, there are now 1.6 zetabytes (1021 bytes) of digital data available.

To use 1.6 zetabytes of bandwidth, you would need to watch HD TV for 47,000 years.

―I keep saying that the sexy job in the next 10 years will be statisticians,‖ Hal Varian,

chief economist at Google. ―And I’m not kidding.‖

March 2012: The White House announced a national "Big Data Initiative" that

consisted of six Federal departments and agencies committing more than $200

million to Big Data research projects.

As noted by Ron Snee, data mining has been around for decades:

1950s: Stepwise regression first developed at Esso (now Exxon) by Efroymson

to analyze refinery data

1960s: Graphical methods developed by Tukey, Wilk, Gnanadesikan and others

at Bell Labs to gain insight from large data sets

1970s: DuPont uses data compression algorithms in process monitoring using

on-line systems Big Data and Data Mining are Growing Rapidly, but Are Not New.


What’s New?

Sheer size of data – often requires compression, parallel processing, and sampling,

to store and analyze.

Some traditional methods are no longer relevant, e.g., hypothesis testing.

Insight from graphical methods must be rethought – difficult to see find outliers in

zetabytes of data.

The sample sizes coupled with faster computing enables much more complex

models, relative to data sets of 30.

Due to the above, newer techniques have become popular:

CART and other tree-based methods; recursive splits on the data.

Neural networks; non-linear models involving combinations of variables – very flexible.

Methods based on bootstrapping – resampling and combining models; random forests,

―bagging‖, etc.

Clustering and classification methods designed for massive data sets; K-means

clustering, support vector machines, etc.

Good News: We Have More Data and Powerful Analysis Methods.





Duke Genomics Center published several groundbreaking articles

conclusively identifying cancer biomarkers in the 2005-2010 timeframe.

Unfortunately, clinical trials based on this research did not pan out.

Women died unexpectedly.

Two statisticians, Keith Baggerly and Kevin Coombes, dug into the

research.

New York Times, July 8, 2011:

Dr. Baggerly and Dr. Coombes found errors almost immediately. Some seemed

careless – moving a row or column over by one in a giant spreadsheet – while others

seemed inexplicable. The Duke team shrugged them off as ―clerical errors‖...In the end,

four gene signature papers were retracted. Duke shut down three trials using the results.

(Lead investigator) Dr. Potti resigned from Duke...His collaborator and mentor, Dr.

Nevins, no longer directs one of Duke’s genomics centers. The cancer world is reeling.

Large Amounts of Data Plus Sophisticated Algorithms Do Not Guarantee Success.



Financial giant Lehman Brothers declared bankruptcy on September 15th,

2008.

This was the largest bankruptcy filing in US history, with Lehman Brothers

holding roughly $600 billion in assets.

The Dow Jones Industrial Average dropped over 500 points that day,

several other financial institutions followed Lehman Brothers into

bankruptcy.....and the rest is history.

A few years earlier, I had visited Lehman Brothers headquarters in NY with

representatives of GE Capital:

Lehman was selling models to predict corporate defaults.

Their models were quite sophisticated, and based on large amounts of historical

financial data.

Virtually all financial institutions impacted by the crisis had models.

“Historical Results Do Not Guarantee Future Performance.”



On April 18th, 2011 the book ―The Making of a Fly‖ goes on sale on

Amazon.com.

Amazon’s automated algorithm places a price of $1,730,045 on the

book.

Later in the day, the Amazon price goes up to $23,698,656.

Plus $3.55 for shipping and handling.

No one buys the book that day.

Days later, the Amazon price was $106.

People started to buy the book.

“We are Writing Things That No One Can Read.” Kevin Slavin (2011 TED Conference)



Our quandary:

All other things being equal, ―Big Data‖ is better than ―little

data‖.

The newer data mining tools are powerful and work quite

well in numerous cases.

Yet, modeling disasters continue to occur; why?

Clearly, we are missing something in the equation.

Could It Be That the Fundamentals Are Still Important?


Can Statistical Engineering Principles Help?

Some Background, and a Definition


Interesting Course Taught at Harvard

Stat 399: Problem Solving in Statistics

“…emphasizes deep, broad, and creative statistical

thinking instead of technical problems that correspond

to a recognizable textbook chapter.”*

*Xiao-Li Meng, American Statistician, August 2009

Do the Important Problems We Face “Correspond to a

Recognizable Textbook Chapter?”


Susan Hockfield – MIT President

Around the dawn of the 20th century, physicists discovered the

basic building blocks of the universe; a ―parts list‖, if you

will. Engineers said ―we can build something from this list,‖

and produced the electronics revolution, and subsequently

the computer revolution.

More recently, biologists have discovered and mapped the

basic ―parts list‖ of life – the human genome. Engineers

have said ―we can build something from this list,‖ and are

producing a revolution in personalized medicine.*

Who is Building Something Meaningful From the Statistical Science Parts List of Tools?

*Loosely quoted from January, 2010 seminar at GE Global Research


Statistical Engineering Definition

Statistical engineering:

The study of how to best utilize statistical concepts, methods, and tools

and integrate them with information technology and other relevant

sciences to generate improved results (Hoerl and Snee 2010a).

In other words, trying to build something meaningful from the statistical

science tools list.

Enables us to attack the large, complex, unstructured problems “that do

not correspond to a recognizable textbook chapter.”

Notes

This is a different definition than that used by Eisenhart, who we believe was

the first to use this term in 1950.

Good statisticians have always done this, but little practical guidance has

been documented in the literature.

This Definition is Consistent with Dictionary Definitions of Engineering.


Typical Phases of Statistical Engineering Projects

1. Identify problems: find the high-impact issues inhibiting

achievement of the organization’s strategic goals.

2. Create structure: carefully define the problem, objectives,

constraints, metrics for success, and so on.

3. Understand the context: identify important stakeholders (e.g.,

customers, organizations, individuals, management), research the

history of the issue, identify unstated complications and cultural

issues, locate relevant data sources.

4. Develop a strategy: create an overall, high level approach to

attacking the problem, based on phases 2 and 3.

5. Establish tactics: develop and implement diverse initiatives or

projects that collectively will accomplish the strategy.

There Are No “Seven Easy Steps” to Statistical Engineering Projects.


Statistical Engineering – Critical Considerations for BIG DATA

Data QualityFree of omissions, errors, missing values, etc.

Missing variables

High measurement variation

Biases – human, equipment,

Subject Matter Knowledge – Used in Many different waysVariables selection and appropriate scales (e.g., log, inverse, square. …)

Selection of model form; linear, curvilinear, multiplicative

Interpretation of results

Ability to extrapolate findings

Use of Sequential Approaches Big problems are not solved with one analysis or even one data set

Strategy must move beyond the one shot study mindset

Three Macro Issues That Seem to Be Overlooked in the Big Data literature.


Understanding the “Data Pedigree”

Trust but Verify - Data pedigree must be assessed when

analyzing Big Data. Data quality is an issue with all sources of

data.

Careful thought must be given to the model form needed to

answer the question, and whether the current data is sufficient

for that purpose.

Multiple sources of data require careful thought as to data

pedigree and how to fit the data bases together to produce

useful results.

Different data sources are typically associated with political

issues, different agendas, different objectives, etc.

Good Principle: Data Are Guilty Until Proven Innocent.


The Advantages of a Sequential Approach

Much of our professional literature, and virtually all of our textbooks,

assume that statistical problems are, by their nature, ―one shot

studies‖:

We are handed a fixed data set, and must develop the ―best‖

model to fit the data.

Articles are frequently published challenging previously published

analyses, and proposing a better model for the same data.

This is the clearly the tone of many high-profile data analysis

competitions, beginning with the Netflix Challenge, and continuing

today with Kaggle.com.

Are Most Statistical Problems One-Shot Studies?


The Advantages of a Sequential Approach

In 30 years working as a statistician in the private sector, I almost

always needed a sequential approach, involving more than one

statistical tool, to solve the important problems I faced.

If one is in the midst of an sequential process, he or she approaches

data analysis from a very different viewpoint versus one-shot studies.

A key goal in the process is to direct the next round of data gathering

and analysis, as opposed to finding the ―optimal‖ model.

Sequential approaches, as proposed by Box, Hunter, and Hunter

(2005) also offer the opportunity for using hindsight to our advantage.

―The best time to design an experiment is after examining the

results.‖

Are Netflix and Kaggle.com Missing Something?


The Importance of Subject Matter Knowledge

―Data have no meaning in themselves; they are meaningful only in relation to a

conceptual model of the phenomenon being studied.‖ Box, Hunter, and Hunter.

Implied message of the data mining, machine learning, and Big Data literature; ―Data

have complete meaning in themselves; no theory is required‖.

For example, only subject matter theory, NOT statistics, allows us to extrapolate

the results of a study, say a clinical trial, to a broader population.

Subject matter theory guides the statistical process, including data collection,

analysis, and interpretation.

This is a ―scientific method‖ approach to statistics, as opposed to a ―test‖ approach to

statistics.

Such an approach allows statistics and statisticians an active role in developing new

theories, as opposed to simply providing yes/no answers to existing theories

(proactive leadership vs. passive consulting paradigm).

New subject matter insights lead naturally to new questions, and new data,

directly linking this principle to the sequential approach principle.

Data and Understanding Are Not Synonyms


Data

Subject Matter Theory

Process Knowledge Increases

Business Process

Customer

Data

Integration of Subject Matter Knowledge

From Hoerl & Snee, Statistical Thinking: Improving Business Performance, 2nd Ed., Wiley, 2012


Putting It All Together

- Providing Leadership to Ensure We Do Big Data the

Right Way


Statistical Engineering Approach to Big Data

Leadership is needed to avoid the pitfalls of ―Big Data + powerful algorithms = success‖

fallacy; if we don’t lead the way, it probably won’t happen.

The fundamentals still apply – in fact they are even more critical.

The phases of Statistical Engineering provide a framework with which to attack Big

Data projects more scientifically

1. Identify problems: find the high-impact Big Data problems – don’t wait for them to

come to you

2. Create structure: carefully define the real (versus stated) problem, objectives,

constraints, metrics for success, and so on.

3. Understand the context: obtain as much subject-matter knowledge as possible,

research the history of the issue, locate relevant data sources, and so on.

4. Develop a strategy: create an overall, high level approach to attacking the problem,

based on phases 2 and 3; incorporate a sequential approach – applying what we

learn in the initial analysis.

5. Establish tactics: develop and implement individual steps in the strategy – stay

flexible, but start with a defined plan.

Big Data Constitutes One of Our Profession’s Best Leadership Opportunities in Our History.


Summary

The glass is half-full: Big Data and associated tools offer a unique opportunity to

solve important problems that were previously intractable.

Fundamentals of good science, analytical modeling and interpretation still apply.

Ignoring these fundamentals increases the probability that invalid

conclusions are reached and inappropriate actions taken.

Statistical Engineering provides a useful approach for using Big Data to solve

important problems.

A five-phase framework is suggested to guide the work associated with Big

Data problems that are typically large, complex and unstructured.

Probability of success is significantly increased when the following aspects of

Statistical Engineering are incorporated in the approach:

Understanding of data pedigree

Utilization of sequential approaches

Integration of subject matter knowledge

Statistical Engineering Can Help Big Data Projects Be Successful


References

Davenport, T. H and J. G. Harris (2007) Competing on Analytics, Harvard Business School Press, Boston,

MA

DeVeaux, R. D. and D. J. Hand (2005) ―How to Lie with Bad Data‖, Statistical Science, Vol. 20, No.3, 231-

238

Hoerl, R. W. and R. D. Snee (2012) Statistical Thinking: Improving Business Performance, 2nd Ed., Wiley,

2012

Pierrard, J. M. (1974) ―Relating Automotive Emissions and Urban Air Quality‖, DuPont Innovation, Vol. 5.

No. 2, pp 6-9.

Pierrard, J. M., R. D. Snee and J. Zelson (1973) ―A New Approach to Setting Vehicle Emission Standards‖,

Presented at Air Pollution Control Association Annual Meeting, June 24-28, 1973

Pierrard, J. M., R. D. Snee and J. Zelson (1974) ―A New Approach to Setting Vehicle Emission Standards‖,

Air Pollution Control Association Journal, Vol. 24, No. 9, pp 841-848.

Snee, R. D. and R. W. Hoerl (2003) Leading Six Sigma – A Step by Step Guide Based on Experience With

General Electric and Other Six Sigma Companies, FT Prentice Hall, New York, NY.

Snee, R. D. and R. W. Hoerl (2012) ―Inquiry on Pedigree – Do You Know the Quality and Origin of Your

Data?‖ Quality Progress, December 2012, 66-68.

Snee, R. D. and J. M. Pierrard (1977) ―The Annual Average: An Alternative to the Second Highest Value as

a Measure of Air Quality‖, Air Pollution Control Association Journal, Vol. 27, No. 2, pp 131-133.


Articles on Statistical Engineering by Hoerl and Snee

Roger W. Hoerl and Ronald D. Snee, (2009) ―Post Financial Meltdown: What Do Services Industries Need

From Us Now?‖ Applied Stochastic Models in Business and Industry, December 2009, pp. 509-521.

Roger W. Hoerl and Ronald D. Snee, (2010) ―Moving the Statistics Profession Forward to the Next Level,‖ The

American Statistician, February 2010, pp. 10-14.

Roger W. Hoerl and R. D. Snee, (2010) ―Closing the Gap: Statistical Engineering Can Bridge Statistical

Thinking with Methods and Tools,‖ Quality Progress, May 2010, pp. 52-53.

Roger W. Hoerl and R. D. Snee, (2010) ―Tried and True—Organizations Put Statistical Engineering to the Test

and See Real Results,‖ Quality Progress, June 2010, pp. 58-60.

Roger W. Hoerl and Ronald D. Snee, (2010) ―Statistical Thinking and Methods in Quality Improvement: A Look

to the Future,‖ Quality Engineering, 22, 3, pp. 119-139.

Roger W. Hoerl and Ronald D. Snee, (2011) ―Statistical Engineering: Is This Just Another Term for Applied

Statistics?‖ Joint Newsletter of the ASA Section on Physical and Engineering Sciences and Quality and

Productivity , March 2011, 4-6.

Ronald D. Snee and Roger W. Hoerl, (2010) ―Further Explanation; Clarifying Points About Statistical

Engineering,‖ Quality Progress, December 2010, pp. 68-72

Ronald D. Snee and Roger W. Hoerl (2011) ―Engineering an Advantage‖, Six Sigma Forum Magazine, Guest

Editorial, February 2011, 6-7.

Ronald D. Snee and Roger W. Hoerl, (2011) ―Proper Blending: Finding the Right Mix of Statistical Engineering

and Traditional Applied Statistics,‖ Quality Progress, June 2011.

Technology

Roger hoerl say award presentation 2013