Can automated feature engineering prevent target leaks

Can Automated Feature Engineering prevent

target leaks?The many ways you setup your problem wrong

Meir Maor

About Me

Meir Maor

Chief Architect @ SparkBeyond

At SparkBeyond we leverage the collective human knowledge to solve the world's toughest problems

This talk

Problem setup mistakes, target leaks sampling bias and friends

How can we detect them? Can we look at a data in a way which makes these flaws obvious?

Diverse examples from real (anonymized) problems.

Target Leak

Using information not actually available at prediction time, something from the future, or something affected.

Make sure all fields in your training data are indeed available. Easy right?

A Retail Example

A large Retailer wants to predict who will make a purchase and how much will he or she spend.

Since there are big differences between first and repeat customers these were modeled separately.

One of the fields we may use is Address, it has lot’s of information. Many users enter it at sign up so it’s available at prediction time.

The leak

100% of those who have ordered have the addressed filled out, while not so initially.

Though the field is available at prediction time

We do not have a temporal database to tell us what the value was then.

Feature engineering Address

Token TF-IDF

ZipCode / county

Geo-location

Address length, address non-empty

Mining for Unobtainium*

A client in the never never land want to find new Unobtainium deposits in the never-never lands.

A large part of the the land has been explored and we have a map of the mines

Many areas were not explored, we have no Map

* Identifying client details were changed

Modelling Take 1

Place a grid on the never-never land map

All grid square with a known deposit are positive

Since Unobtainium is rare all others can be assumed to be negative

Use advanced imaging, radiometric, magnetic, topographic maps, geological maps, and more for explaining variables.

99% AUC!! We are going to be rich!

Using topographic data, a big hole in the ground predicts a large deposit perfectly.

We are detecting existing active mines.

Back to the archives to find 50 year old maps from before most mines were open.

96% AUC! We are going to be rich!

Distance from roads, Is an excellent predictor.

Not only do all existing mines have roads to them

Past exploration was primarily in accessible areas

Removing roads is not enough, They are hidden in all the data.

A cure for cancer?

Early detection of cancer based on routine medical tests.

Modeling take 1

Predict cancer X time units in advance of current discovery date.

For sick people take data up to X prior to diagnosis

For Healthy take a fixed time window from an average diagnosis date.

Replace all dates with relative time stamps.

We always model the easiest part

Detecting when the samples were taken is much easier than detecting Cancer, so that is what the model does.

Take 2

A quarterly snapshot, with different positives & negatives each quarter

If we allow repeat patients we get correlated examples

If we randomly assign a patient to a quarter we don’t have enough positives

If we deduplicate but keep all positives we get a skewed distribution.

Feature engineering

Each of the flaws is easily spotted when we look at a good engineered feature to exploit it

Poorly engineered features may exploit the leak/bias to a limited extent and never get discovered

Complex models with simple features can exploit the leaks totally but are opaque and this can go unnoticed

Automatic feature discovery

Exploit each leak to it’s fullest

Human understable top insights show target leaks

Allow data scientists to focus on problem definition, complex feature engineering and iterate rapidly.

Join Us

http://www.sparkbeyond.com/careers

Try the SparkBeyond Challenge: http://bit.ly/dss16-quiz

http://bit.ly/dss16-quiz

Data & Analytics

Can automated feature engineering prevent target leaks