23
Make Sense Out of Data with Feature Engineering Xavier Conort Chief Data Scientist at DataRobot @Melbourne Data Science Initiative 2016

Make Sense Out of Data with Feature Engineering

Embed Size (px)

Citation preview

Page 1: Make Sense Out of Data with Feature Engineering

Make Sense Out of Data with Feature Engineering

Xavier ConortChief Data Scientist at DataRobot

@Melbourne Data Science Initiative 2016

Page 2: Make Sense Out of Data with Feature Engineering

AgendaPreamble2 examples:

Key takeaways

Page 3: Make Sense Out of Data with Feature Engineering

Automation is integral part of human civilization

Page 4: Make Sense Out of Data with Feature Engineering

Car

Destination Crude oil

Refined Oilprocess oil into more useful products such gasoline

A successful journeyKey elements for a successful car journey

Page 5: Make Sense Out of Data with Feature Engineering

Car = Modelling engineMachine Learning solutions replace more and more traditional statistical approach and can automate the modelling process and produce world-

class predictive accuracy without much effort

Destination = Outcomewell defined outcome to predict and well defined process to use it to optimize business problems

Crude Oil = Raw Dataincreased volume and capacity to handle

terabytes of Data

Refined oil = Feature Engineeringtalent to extract from raw data

information that can be used by models

open source programming

social network of coders

automated solutions

Key elements for a successful data science journey

Page 6: Make Sense Out of Data with Feature Engineering

Refined oil for Machine Learning: Flat File Data Format

© DataRobot, Inc. All rights reserved. Confidential

● 1 record per prediction event ● 1 column for each predictive field /

feature● 1 column for the value to be predicted

(training data only)

6

ID Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Target1 0.73 Female 5.28 Thursday 37.52 Yes2 0.20 Male 4.20 Tuesday 35.04 Yes3 1.82 Male 14.71 Friday 7.02 Yes4 -0.69 Female 11.82 Sunday -3.29 No5 1.07 Male 16.55 Monday 12.59 Yes6 -0.27 Male 10.87 Thursday -8.19 Yes7 2.88 Male 5.24 Wednesday -21.67 No8 1.35 Female 9.40 Tuesday 9.70 Yes9 0.73 Female 1.04 Sunday 26.60 Yes

10 0.02 Female -9.79 Saturday -14.47 Yes11 3.43 Male 11.59 Thursday 27.48 No12 2.56 Female -13.25 Saturday 12.41 No

Page 7: Make Sense Out of Data with Feature Engineering

Feature Engineering that we will cover today

●Variables you should not use●Dealing with high dimensional features●Using external data to add valuable

information●Dealing with transactional data

© DataRobot, Inc. All rights reserved. Confidential 7

Page 8: Make Sense Out of Data with Feature Engineering

8

● Hosted by Practice Fusion, a cloud-based electronic health record platform for doctors and patients

● Challenge: Given a de-identified data set of patient electronic health records, build a model to determine who has a diabetes diagnosis

● Data:○ 17 tables containing 4 years history of medical records!

Example 1:

Page 9: Make Sense Out of Data with Feature Engineering

Think of variables you should not use

© DataRobot, Inc. All rights reserved. Confidential

Feature Engineering the YearOfBirth Value

● We expect that as a patient gets older their risk of diabetes will increase, yet their YearOfBirth value remains static

● We need a feature that changes as the patient gets older

● The true predictor of diabetes is more likely to be age than year of birth

● The data is extracted at the end of 2012● Age = 2012 - YearOfBirth

9

Page 10: Make Sense Out of Data with Feature Engineering

Learn to deal with high dimensionality

© DataRobot, Inc. All rights reserved. Confidential 10

● Add characteristics of levels of categorical features to see how similar they are

● Use location for regional categories to see how close they are

● Group hierarchical categories together based on those hierarchies

● Text mine the descriptions● Use the overall ordinal frequency ranking

as a feature● Top Kagglers likelihood / credibility

features

Page 11: Make Sense Out of Data with Feature Engineering

Case study: engineer state variable

© DataRobot, Inc. All rights reserved. Confidential 11

● So that the machine learning algorithms can know which states are near (and possibly similar to) each other

● Centroid for each of the 51 states http://dev.maxmind.com/geoip/legacy/codes/state_latlon/

Page 12: Make Sense Out of Data with Feature Engineering

Case study: engineer diagnosis

© DataRobot, Inc. All rights reserved. Confidential 12

● Use the ICD9 code groupings https://en.wikipedia.org/wiki/List_of_ICD-9_codes

○ So that the machine learning algorithms can know which diagnoses are similar to each other

○ Count the observations in each group

● Use the ICD9 descriptions○ Do text mining on the descriptions

to find words or phrases within the descriptions

Page 13: Make Sense Out of Data with Feature Engineering

Case Study: engineer drugs

© DataRobot, Inc. All rights reserved. Confidential 13

● Use drug databases http://www.fda.gov/drugs/informationondrugs/ucm142438.htm

● To enable the machine learning algorithm to know which drugs are similar:

○ Replace proprietary brand names with generic medication names

○ Text mine the list of pharmaceutical classes

Page 14: Make Sense Out of Data with Feature Engineering

But What About Relational Databases?

© DataRobot, Inc. All rights reserved. Confidential 14

Challenge: many records per patient○ 9948 patients○ 196,290 transcripts○ 142,741 diagnostics○ 66,487 medications○ 3,030 lab results

Page 15: Make Sense Out of Data with Feature Engineering

Deal with one to many relationships

© DataRobot, Inc. All rights reserved. Confidential

● create predictive fields using summary statistics

○ e.g. averages of last 24 hours / week / month / year

○ e.g. variance or standard deviation○ e.g. entropy○ e.g. maximum or minimum values○ e.g. counts○ e.g. most frequent value○ e.g. sequences of events

15

ID Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Target1 0.73 Female 5.28 Thursday 37.52 Yes2 0.20 Male 4.20 Tuesday 35.04 Yes3 1.82 Male 14.71 Friday 7.02 Yes4 -0.69 Female 11.82 Sunday -3.29 No5 1.07 Male 16.55 Monday 12.59 Yes6 -0.27 Male 10.87 Thursday -8.19 Yes7 2.88 Male 5.24 Wednesday -21.67 No8 1.35 Female 9.40 Tuesday 9.70 Yes9 0.73 Female 1.04 Sunday 26.60 Yes

10 0.02 Female -9.79 Saturday -14.47 Yes11 3.43 Male 11.59 Thursday 27.48 No12 2.56 Female -13.25 Saturday 12.41 No

Page 16: Make Sense Out of Data with Feature Engineering

Case Study: Relationship Between a Patient and Diagnosis

© DataRobot, Inc. All rights reserved. Confidential

● One patient to many diagnoses● Uniquely joined via PatientGuid

16

Page 17: Make Sense Out of Data with Feature Engineering

Case Study: Build sequences string

© DataRobot, Inc. All rights reserved. Confidential

Feature Engineering the ICD9 Codes

● One patient can have from 1 to 75 diagnoses

● We need to compress this data to a single record per patient

● One way is to concatenate the sequence of diagnoses into a string and do text mining models that use ngrams on that string

● It often helps to remove consecutive duplicates

● Sometimes it is useful to know the first and last events, and the times of those events

17

Patient ID ICD9Code12345 39112345 40112345 40112345 454.112346 410.312346 463

Page 18: Make Sense Out of Data with Feature Engineering

Case Study: Count and entropy

© DataRobot, Inc. All rights reserved. Confidential

Feature Engineering the ICD9 Codes

● One patient can have from 1 to 75 diagnoses

● We need to compress this data to a single record per patient

● Sometimes it is useful to know ○ The number of events○ The most common event type○ The level of variety of event types

e.g. the entropy (as my colleague Owen Zhang did for the KKD Cup 2015)

18

Patient ID ICD9Code12345 39112345 40112345 40112345 454.112346 410.312346 463

-propn*ln(propn)= -0.25*ln(0.25)

=0.347

Page 19: Make Sense Out of Data with Feature Engineering

Case Study: Timing stats

© DataRobot, Inc. All rights reserved. Confidential

Feature Engineering the Timing of Diagnoses

● One patient can have from 1 to 75 diagnoses

● We need to compress this data to a single record per patient

● Sometimes it is useful to know information about the timing of events

○ The range of event times e.g. mean, median, maximum, minimum, quantiles

○ The amount of time between events e.g. mean, median, maximum, minimum, quantiles

○ The regularity of the timing of events e.g. variance 19

Page 20: Make Sense Out of Data with Feature Engineering

Hosted by XuetangX, a Chinese MOOC learning platform initiated by Tsinghua University

Challenge: predict whether a user will drop a course within next 10 days based on his or her prior activities.

Data: enrollment_train (120K rows) / enrollment_test (80K rows):

Columns: enrollment_id, username, course_id

log_train / log_test

Columns: enrollment_id, time, source, event, object

object

Columns: course_id, module_id, category, children, start

truth_train

Columns: enrollment_id, dropped_out

Example 2:

Page 21: Make Sense Out of Data with Feature Engineering

We applied same recipes to log data5890 objects

and generated a flat file with 100s of features!!!

Page 22: Make Sense Out of Data with Feature Engineering

Techniques we used in

… to describe course, enrollment and students from log data:countstime statistics (min, mean, max, diff)entropysequences treated as text on which we ran

SVD and logistic regression on 3grams20 first components of SVD on user x objectMore can be found in

http://www.slideshare.net/DataRobot/featurizing-log-data-before-xgboost

Page 23: Make Sense Out of Data with Feature Engineering

Key takeawaysMachine Learning (ML) can automatically generate world

class predictive accuracy

But feature engineering is still an art that requires a lot of creativity, business insight, curiosity and effort

Be careful! Infinite number of features can be generated… Start with winning recipes (steal them from others and make up your own) and then iterate with new recipes, ideas, external data... Stop when you don’t get much additional accuracy