Upload
meir-maor
View
163
Download
0
Embed Size (px)
Citation preview
Can Automated Feature Engineering prevent
target leaks?The many ways you setup your problem wrong
Meir Maor
About Me
Meir Maor
Chief Architect @ SparkBeyond
At SparkBeyond we leverage the collective human knowledge to solve the world's toughest problems
This talk
Problem setup mistakes, target leaks sampling bias and friends
How can we detect them? Can we look at a data in a way which makes these flaws obvious?
Diverse examples from real (anonymized) problems.
Target Leak
Using information not actually available at prediction time, something from the future, or something affected.
Make sure all fields in your training data are indeed available. Easy right?
A Retail Example
A large Retailer wants to predict who will make a purchase and how much will he or she spend.
Since there are big differences between first and repeat customers these were modeled separately.
One of the fields we may use is Address, it has lot’s of information. Many users enter it at sign up so it’s available at prediction time.
The leak
100% of those who have ordered have the addressed filled out, while not so initially.
Though the field is available at prediction time
We do not have a temporal database to tell us what the value was then.
Feature engineering Address
Token TF-IDF
ZipCode / county
Geo-location
Address length, address non-empty
Mining for Unobtainium*
A client in the never never land want to find new Unobtainium deposits in the never-never lands.
A large part of the the land has been explored and we have a map of the mines
Many areas were not explored, we have no Map
* Identifying client details were changed
Modelling Take 1
Place a grid on the never-never land map
All grid square with a known deposit are positive
Since Unobtainium is rare all others can be assumed to be negative
Use advanced imaging, radiometric, magnetic, topographic maps, geological maps, and more for explaining variables.
99% AUC!! We are going to be rich!
Using topographic data, a big hole in the ground predicts a large deposit perfectly.
We are detecting existing active mines.
Back to the archives to find 50 year old maps from before most mines were open.
96% AUC! We are going to be rich!
Distance from roads, Is an excellent predictor.
Not only do all existing mines have roads to them
Past exploration was primarily in accessible areas
Removing roads is not enough, They are hidden in all the data.
A cure for cancer?
Early detection of cancer based on routine medical tests.
Modeling take 1
Predict cancer X time units in advance of current discovery date.
For sick people take data up to X prior to diagnosis
For Healthy take a fixed time window from an average diagnosis date.
Replace all dates with relative time stamps.
We always model the easiest part
Detecting when the samples were taken is much easier than detecting Cancer, so that is what the model does.
Take 2
A quarterly snapshot, with different positives & negatives each quarter
If we allow repeat patients we get correlated examples
If we randomly assign a patient to a quarter we don’t have enough positives
If we deduplicate but keep all positives we get a skewed distribution.
Feature engineering
Each of the flaws is easily spotted when we look at a good engineered feature to exploit it
Poorly engineered features may exploit the leak/bias to a limited extent and never get discovered
Complex models with simple features can exploit the leaks totally but are opaque and this can go unnoticed
Automatic feature discovery
Exploit each leak to it’s fullest
Human understable top insights show target leaks
Allow data scientists to focus on problem definition, complex feature engineering and iterate rapidly.
Join Us
http://www.sparkbeyond.com/careers
Try the SparkBeyond Challenge: http://bit.ly/dss16-quiz