L5. Data Transformation and Feature Engineering

Data Transformation and Feature Engineering

Charles ParkerAllston Trading

2

• Oregon State University (Structured output spaces)

• Music recognition

• Real-time strategy game-playing

• Kodak Research Labs

• Media classification (audio, video)

• Document Classification

• Performance Evaluation

• BigML

• Allston Trading (applying machine learning to market data)

Full Disclosure

3

• But it’s “machine learning”!

• Your data sucks (or at least I hope it does) . . .

• Data is broken

• Data is incomplete

• . . . but you know about it!

• Make the problem easier

• Make the answer more obvious

• Don’t waste time modeling the obvious

• Until you find the right algorithm for it

Data Transformation

Your Data Sucks I: Broken Features

• Suppose you have a market data feature called trade imbalance = (buy - sell) / total volume that you calculate every five minutes

• Now suppose there are no trades over five minutes

• What to do?

• Point or feature removal

• Easy default

4

Your Data Sucks II: Missing Values

• Suppose you’re building a model to predict the presence or absence of cancer

• Each feature is a medical test

• Some are simple (height, weight, temperature)

• Some are complex (blood counts, CAT scan)

• Some patients have had all of these done, some have not.

• Does the presence or absence of a CAT scan tell you something? Should it be a feature?

5

Height Weight Blood Test Cancer?

179 80 No

160 60 2,4 No

150 65 4,5 Yes

155 70 No

Simplifying Your Problem

• What about the class variable?

• It’s just another feature, so it can be engineered

• Change the problem

• Do you need so many classes?

• Do you need to do a regression?

6

Feature Engineering: What?

• Your data may be too “raw” for learning

• Multimedia Data

• Raw text data

• Something must be done to make the data “learnable”

• Compute edge histograms, SIFT features

• Do word counts, latent topic modeling

7

An Instructive Example

• Build a model to determine if two geo-coordinates are walking distance from one another

8

Lat. 1 Long 1. Lat. 2 Long. 2 Can Walk?

48.871507 2.354350 48.872111 2.354933 Yes

48.872111 2.354933 44.597422 -123.248367 No

48.872232 2.354211 48.872111 2.354933 Yes

44.597422 -123.248367 48.872232 2.354211 No

• Whether two points are walking distance from each other is not an obvious function of the latitude and longitude

• But it is an obvious function of the distance between the two points

• Unfortunately, that function is quite complicated

• Fortunately, you know it already!

9

An Instructive Example

• Build a model to determine if two geo-coordinates are walking distance from one another

10

Lat. 1 Long 1. Lat. 2 Long. 2 Distance (km) Can Walk?

48.871507 2.354350 48.872111 2.354933 2 Yes

48.872111 2.354933 44.597422 -123.248367 9059 No

48.872232 2.354211 48.872111 2.354933 5 Yes

44.597422 -123.248367 48.872232 2.354211 9056 No

Feature Engineering

• One of the core (maybe the core) competencies of a machine learning engineer

• Requires domain understanding

• Requires algorithm understanding

• If you do it really well, you eliminate the need for machine learning entirely

• Gives you another path to success; you can often substitute domain knowledge for modeling expertise

• But what if you don’t have specific domain knowledge?

11

Techniques I: Discretization

• Construct meaningful bins for a continuous feature (two or more)

• Body temperature

• Credit score

• The new features are categorical features, each category of which has nice semantics

• Don’t make the algorithm waste effort modeling things that you already know about

12

Techniques II: Delta

• Sometimes, the difference between two features is the important bit

• As it was in the distance example

• Also holds a lot in the time domain

• Example: Hiss in speech recognition

• Struggling? Just differentiate! (In all seriousness, this sometimes works)

13

Techniques III: Windowing

• If points are distributed in time, previous points in the same window are often very informative

• Weather

• Stock prices

• Add this to a 1-d sequence of points to get an instant machine learning problem!

• Sensor data

• User behavior

• Maybe add some delta features?

14

Techniques IV: Standardization

• Constrain each feature to have a mean of zero and standard deviation of one (subtract the mean and divide by the standard deviation).

• Good for domains with heterogeneous but gaussian-distributed data sources

• Demographic data

• Medical testing

• Note that this isn’t in general effective for decision trees!

• Transformation is order preserving

• Decision tree splits rely only on ordering!

• Good for things like k-NN

15

Techniques V: Normalization

• Force each feature vector to have unit norm (e.g., [0, 1, 0] -> [0, 1, 0] and [1, 1, 1] -> [0.57, 0.57, 0.57])

• Nice for sparse feature spaces like text

• Helps us tell the difference between documents and dictionaries

• We’ll come back to the idea of sparsity

• Note that this will effect decision trees

• Does not necessarily preserve order (co-dependency between features)

• A lesson against over-generalization of technique!

16

What Do We Really Want?

• This is nice, but what ever happened to “machine learning”?

• Construct a feature space in which “learning is easy”, whatever that means

• The space must preserve “important aspects of the data”, whatever that means

• Are there general ways of posing this problem?(Spoiler Alert: Yes)

17

Aside I: Projection

• A projection is a one-to-one mapping from one feature space to another

• We want a function f(x) that projects a point x into a space where a good classifier is obvious

• The axes (features) in your new space are called your new basis

18

f(x)

f(x)

A Hack Projection: Distance to Cluster

• Do clustering on your data

• For each point, compute the distance to each cluster centroid

• These distances are your new features

• The new space can be either higher or lower dimensional than your new space

• For highly clustered data, this can be a fairly powerful feature space

19

Principle Components Analysis

• Find the axis through the data with the highest variance

• Repeat for the next orthogonal axis and so on, until you run out of data or dimensions

• Each axis is a feature

20

PCA is Nice!

• Generally quite fast (matrix decomposition)

• Features are linear combinations of originals (which means you can project test data into the space)

• Features are linearly independent (great for some algorithms)

• Data can often be “explained” with just the first few components (so this can be “dimensionality reduction”)

21

Spectral Embeddings

• Two of the seminal ones are Isomap and LLE

• Generally, compute the nearest neighbor matrix and use this to create the embedding

• Pro: Pretty spectacular results

• Con: No projection matrix

22

Combination Methods

• Large Margin Nearest Neighbor, Xing’s Method

• Create an objective function that preserves neighbor relationships

• Neighbor distances (unsupervised)

• Closest points of the same class (supervised)

• Clever search for a projection matrix that satisfies this objective (usually an elaborate sort of gradient descent)

• I’ve had some success with these

23

Aside II: Sparsity

• Machine learning is essentially compression, and constantly plays at the edges of this idea

• Minimum description length

• Bayesian information criteria

• L1 and L2 regularization

• Sparse representations are easily compressed

• So does that mean they’re more powerful?

24

Sparsity I: Text Data

• Text data is inherently sparse

• The fact that we choose a small number of words to use gives a document its semantics

• Text features are incredibly powerful in the grand scheme of feature spaces

• One or two words allow us to do accurate classification

• But those one or two words must be sparse

25

Sparsity II: EigenFaces

• Here are the first few components of PCA applied to a collection of face images

• A small number of these explain a huge part of a huge number of faces

• First components are like stop words, last few (sparse) components make recognition easy

26

Sparsity III: The Fourier Transform

• Very complex waveform

• Turns out to be easily expressible as a combination of a few (i.e., sparse) constant frequency signals

• Such representations make accurate speech recognition possible

27

Sparse Coding

• Iterate

• Choose a basis

• Evaluate that basis based on how well you can use it to reconstruct the input, and how sparse it is

• Take some sort of gradient step to improve that evaluation

• Andrew Ng’s efficient sparse coding algorithms and Hinton’s deep autoencoders are both flavors of this

28

The New Basis

• Text: Topics

• Audio: Frequency Transform

• Visual: Pen Strokes

29

Another Hack: Totally Random Trees

• Train a bunch of decision trees

• With no objective!

• Each leaf is a feature

• Ta-da! Sparse basis

• This actually works

30

And More and More

• There are a ton a variations on these themes

• Dimensionality Reduction

• Metric Learning

• “Coding” or “Encoding”

• Nice canonical implementations can be found at: http://lvdmaaten.github.io/drtoolbox/

31

http://lvdmaaten.github.io/drtoolbox/

Data & Analytics

L5. Data Transformation and Feature Engineering