Upload
machine-learning-valencia
View
944
Download
1
Embed Size (px)
Citation preview
Data Transformation and Feature Engineering
Charles ParkerAllston Trading
2
• Oregon State University (Structured output spaces)
• Music recognition
• Real-time strategy game-playing
• Kodak Research Labs
• Media classification (audio, video)
• Document Classification
• Performance Evaluation
• BigML
• Allston Trading (applying machine learning to market data)
Full Disclosure
3
• But it’s “machine learning”!
• Your data sucks (or at least I hope it does) . . .
• Data is broken
• Data is incomplete
• . . . but you know about it!
• Make the problem easier
• Make the answer more obvious
• Don’t waste time modeling the obvious
• Until you find the right algorithm for it
Data Transformation
Your Data Sucks I: Broken Features
• Suppose you have a market data feature called trade imbalance = (buy - sell) / total volume that you calculate every five minutes
• Now suppose there are no trades over five minutes
• What to do?
• Point or feature removal
• Easy default
4
Your Data Sucks II: Missing Values
• Suppose you’re building a model to predict the presence or absence of cancer
• Each feature is a medical test
• Some are simple (height, weight, temperature)
• Some are complex (blood counts, CAT scan)
• Some patients have had all of these done, some have not.
• Does the presence or absence of a CAT scan tell you something? Should it be a feature?
5
Height Weight Blood Test Cancer?
179 80 No
160 60 2,4 No
150 65 4,5 Yes
155 70 No
Simplifying Your Problem
• What about the class variable?
• It’s just another feature, so it can be engineered
• Change the problem
• Do you need so many classes?
• Do you need to do a regression?
6
Feature Engineering: What?
• Your data may be too “raw” for learning
• Multimedia Data
• Raw text data
• Something must be done to make the data “learnable”
• Compute edge histograms, SIFT features
• Do word counts, latent topic modeling
7
An Instructive Example
• Build a model to determine if two geo-coordinates are walking distance from one another
8
Lat. 1 Long 1. Lat. 2 Long. 2 Can Walk?
48.871507 2.354350 48.872111 2.354933 Yes
48.872111 2.354933 44.597422 -123.248367 No
48.872232 2.354211 48.872111 2.354933 Yes
44.597422 -123.248367 48.872232 2.354211 No
• Whether two points are walking distance from each other is not an obvious function of the latitude and longitude
• But it is an obvious function of the distance between the two points
• Unfortunately, that function is quite complicated
• Fortunately, you know it already!
9
An Instructive Example
• Build a model to determine if two geo-coordinates are walking distance from one another
10
Lat. 1 Long 1. Lat. 2 Long. 2 Distance (km) Can Walk?
48.871507 2.354350 48.872111 2.354933 2 Yes
48.872111 2.354933 44.597422 -123.248367 9059 No
48.872232 2.354211 48.872111 2.354933 5 Yes
44.597422 -123.248367 48.872232 2.354211 9056 No
Feature Engineering
• One of the core (maybe the core) competencies of a machine learning engineer
• Requires domain understanding
• Requires algorithm understanding
• If you do it really well, you eliminate the need for machine learning entirely
• Gives you another path to success; you can often substitute domain knowledge for modeling expertise
• But what if you don’t have specific domain knowledge?
11
Techniques I: Discretization
• Construct meaningful bins for a continuous feature (two or more)
• Body temperature
• Credit score
• The new features are categorical features, each category of which has nice semantics
• Don’t make the algorithm waste effort modeling things that you already know about
12
Techniques II: Delta
• Sometimes, the difference between two features is the important bit
• As it was in the distance example
• Also holds a lot in the time domain
• Example: Hiss in speech recognition
• Struggling? Just differentiate! (In all seriousness, this sometimes works)
13
Techniques III: Windowing
• If points are distributed in time, previous points in the same window are often very informative
• Weather
• Stock prices
• Add this to a 1-d sequence of points to get an instant machine learning problem!
• Sensor data
• User behavior
• Maybe add some delta features?
14
Techniques IV: Standardization
• Constrain each feature to have a mean of zero and standard deviation of one (subtract the mean and divide by the standard deviation).
• Good for domains with heterogeneous but gaussian-distributed data sources
• Demographic data
• Medical testing
• Note that this isn’t in general effective for decision trees!
• Transformation is order preserving
• Decision tree splits rely only on ordering!
• Good for things like k-NN
15
Techniques V: Normalization
• Force each feature vector to have unit norm (e.g., [0, 1, 0] -> [0, 1, 0] and [1, 1, 1] -> [0.57, 0.57, 0.57])
• Nice for sparse feature spaces like text
• Helps us tell the difference between documents and dictionaries
• We’ll come back to the idea of sparsity
• Note that this will effect decision trees
• Does not necessarily preserve order (co-dependency between features)
• A lesson against over-generalization of technique!
16
What Do We Really Want?
• This is nice, but what ever happened to “machine learning”?
• Construct a feature space in which “learning is easy”, whatever that means
• The space must preserve “important aspects of the data”, whatever that means
• Are there general ways of posing this problem?(Spoiler Alert: Yes)
17
Aside I: Projection
• A projection is a one-to-one mapping from one feature space to another
• We want a function f(x) that projects a point x into a space where a good classifier is obvious
• The axes (features) in your new space are called your new basis
18
f(x)
f(x)
A Hack Projection: Distance to Cluster
• Do clustering on your data
• For each point, compute the distance to each cluster centroid
• These distances are your new features
• The new space can be either higher or lower dimensional than your new space
• For highly clustered data, this can be a fairly powerful feature space
19
Principle Components Analysis
• Find the axis through the data with the highest variance
• Repeat for the next orthogonal axis and so on, until you run out of data or dimensions
• Each axis is a feature
20
PCA is Nice!
• Generally quite fast (matrix decomposition)
• Features are linear combinations of originals (which means you can project test data into the space)
• Features are linearly independent (great for some algorithms)
• Data can often be “explained” with just the first few components (so this can be “dimensionality reduction”)
21
Spectral Embeddings
• Two of the seminal ones are Isomap and LLE
• Generally, compute the nearest neighbor matrix and use this to create the embedding
• Pro: Pretty spectacular results
• Con: No projection matrix
22
Combination Methods
• Large Margin Nearest Neighbor, Xing’s Method
• Create an objective function that preserves neighbor relationships
• Neighbor distances (unsupervised)
• Closest points of the same class (supervised)
• Clever search for a projection matrix that satisfies this objective (usually an elaborate sort of gradient descent)
• I’ve had some success with these
23
Aside II: Sparsity
• Machine learning is essentially compression, and constantly plays at the edges of this idea
• Minimum description length
• Bayesian information criteria
• L1 and L2 regularization
• Sparse representations are easily compressed
• So does that mean they’re more powerful?
24
Sparsity I: Text Data
• Text data is inherently sparse
• The fact that we choose a small number of words to use gives a document its semantics
• Text features are incredibly powerful in the grand scheme of feature spaces
• One or two words allow us to do accurate classification
• But those one or two words must be sparse
25
Sparsity II: EigenFaces
• Here are the first few components of PCA applied to a collection of face images
• A small number of these explain a huge part of a huge number of faces
• First components are like stop words, last few (sparse) components make recognition easy
26
Sparsity III: The Fourier Transform
• Very complex waveform
• Turns out to be easily expressible as a combination of a few (i.e., sparse) constant frequency signals
• Such representations make accurate speech recognition possible
27
Sparse Coding
• Iterate
• Choose a basis
• Evaluate that basis based on how well you can use it to reconstruct the input, and how sparse it is
• Take some sort of gradient step to improve that evaluation
• Andrew Ng’s efficient sparse coding algorithms and Hinton’s deep autoencoders are both flavors of this
28
The New Basis
• Text: Topics
• Audio: Frequency Transform
• Visual: Pen Strokes
29
Another Hack: Totally Random Trees
• Train a bunch of decision trees
• With no objective!
• Each leaf is a feature
• Ta-da! Sparse basis
• This actually works
30
And More and More
• There are a ton a variations on these themes
• Dimensionality Reduction
• Metric Learning
• “Coding” or “Encoding”
• Nice canonical implementations can be found at: http://lvdmaaten.github.io/drtoolbox/
31