19
Can Automated Feature Engineering prevent target leaks? The many ways you setup your problem wrong Meir Maor

Can automated feature engineering prevent target leaks

Embed Size (px)

Citation preview

Page 1: Can automated feature engineering prevent target leaks

Can Automated Feature Engineering prevent

target leaks?The many ways you setup your problem wrong

Meir Maor

Page 2: Can automated feature engineering prevent target leaks

About Me

Meir Maor

Chief Architect @ SparkBeyond

At SparkBeyond we leverage the collective human knowledge to solve the world's toughest problems

Page 3: Can automated feature engineering prevent target leaks

This talk

Problem setup mistakes, target leaks sampling bias and friends

How can we detect them? Can we look at a data in a way which makes these flaws obvious?

Diverse examples from real (anonymized) problems.

Page 4: Can automated feature engineering prevent target leaks

Target Leak

Using information not actually available at prediction time, something from the future, or something affected.

Make sure all fields in your training data are indeed available. Easy right?

Page 5: Can automated feature engineering prevent target leaks

A Retail Example

A large Retailer wants to predict who will make a purchase and how much will he or she spend.

Since there are big differences between first and repeat customers these were modeled separately.

One of the fields we may use is Address, it has lot’s of information. Many users enter it at sign up so it’s available at prediction time.

Page 6: Can automated feature engineering prevent target leaks

The leak

100% of those who have ordered have the addressed filled out, while not so initially.

Though the field is available at prediction time

We do not have a temporal database to tell us what the value was then.

Page 7: Can automated feature engineering prevent target leaks

Feature engineering Address

Token TF-IDF

ZipCode / county

Geo-location

Address length, address non-empty

Page 8: Can automated feature engineering prevent target leaks

Mining for Unobtainium*

A client in the never never land want to find new Unobtainium deposits in the never-never lands.

A large part of the the land has been explored and we have a map of the mines

Many areas were not explored, we have no Map

* Identifying client details were changed

Page 9: Can automated feature engineering prevent target leaks

Modelling Take 1

Place a grid on the never-never land map

All grid square with a known deposit are positive

Since Unobtainium is rare all others can be assumed to be negative

Use advanced imaging, radiometric, magnetic, topographic maps, geological maps, and more for explaining variables.

Page 10: Can automated feature engineering prevent target leaks

99% AUC!! We are going to be rich!

Using topographic data, a big hole in the ground predicts a large deposit perfectly.

We are detecting existing active mines.

Back to the archives to find 50 year old maps from before most mines were open.

Page 11: Can automated feature engineering prevent target leaks

96% AUC! We are going to be rich!

Distance from roads, Is an excellent predictor.

Not only do all existing mines have roads to them

Past exploration was primarily in accessible areas

Removing roads is not enough, They are hidden in all the data.

Page 12: Can automated feature engineering prevent target leaks
Page 13: Can automated feature engineering prevent target leaks

A cure for cancer?

Early detection of cancer based on routine medical tests.

Page 14: Can automated feature engineering prevent target leaks

Modeling take 1

Predict cancer X time units in advance of current discovery date.

For sick people take data up to X prior to diagnosis

For Healthy take a fixed time window from an average diagnosis date.

Replace all dates with relative time stamps.

Page 15: Can automated feature engineering prevent target leaks

We always model the easiest part

Detecting when the samples were taken is much easier than detecting Cancer, so that is what the model does.

Page 16: Can automated feature engineering prevent target leaks

Take 2

A quarterly snapshot, with different positives & negatives each quarter

If we allow repeat patients we get correlated examples

If we randomly assign a patient to a quarter we don’t have enough positives

If we deduplicate but keep all positives we get a skewed distribution.

Page 17: Can automated feature engineering prevent target leaks

Feature engineering

Each of the flaws is easily spotted when we look at a good engineered feature to exploit it

Poorly engineered features may exploit the leak/bias to a limited extent and never get discovered

Complex models with simple features can exploit the leaks totally but are opaque and this can go unnoticed

Page 18: Can automated feature engineering prevent target leaks

Automatic feature discovery

Exploit each leak to it’s fullest

Human understable top insights show target leaks

Allow data scientists to focus on problem definition, complex feature engineering and iterate rapidly.

Page 19: Can automated feature engineering prevent target leaks

Join Us

http://www.sparkbeyond.com/careers

Try the SparkBeyond Challenge: http://bit.ly/dss16-quiz