View
67
Download
5
Category
Tags:
Preview:
Citation preview
Contact: Ben Beinecke | Ben@Ufora.com | 646-918-6435 | Copyright © 2015
Scaling Your Machine LearningMLConf 2015 NYC
3
But Big Data Has Broken The Iterative Workflow
Prototyping is essential for data analysis, but big data has made prototyping expensive, painful, and slow.
The Problem:
1. Small data tools don’t scale• e.g. Matlab
2. Fixed Frameworks are not customizable enough for real-world problems• e.g. Hadoop MapReduce
3. Customized solutions break, are hard to modify, and expensive to maintain• e.g. C++ with MPI
4
Business Logic (Algorithm Code)
Implementation Logic (Infrastructure Code)
Hand-Coded Infrastructure Isn’t Practical
Data Science 1.0(Business Logic Encumbered with Implementation)
Data Science 2.0
Automatic
(Business Logic free from Implementation)
5
Apply Learning Techniques to Data Distribution and Parallelization
Data Science 2.0Data Science 1.0
CPU’s RAM
Smart Compute
Part-of-Speech Tagging for Noisy Data Sets
Connie Yee
Text Analytics and Machine Learning (TAML)
Financial & Risk
Part-of-Speech Tagging
• Many uses including:
Input to a full parser in order to
facilitate deep processing
1
Plays well with others INPUT
AMBIGUITY
OUTPUT VBZ RB IN NNS
NNS/VBZ UH/JJ/NN/RB IN NNS
Named-entity recognition
– How to
– pronounce “lead”?
Supervised Classification
Trainer using
Parameter
Estimation
Classifier
Model
Feature
Generator
Decoder using
Beam Search
2
Tag
Sequence
Training
Data
Input
Sentence
Feature
Generator
A. Training
B. Decoding
(Prediction on unseen data)
features
A model includes
parameter values for
an event and all its
possible outcomes
Tagging News and Twitter Data
• Wall St. Journal treebank from UPenn
(PTB)
– Training: 38k sentences
– Test: 5k sentences
• Features
– Preceding tags
– Words surrounding target word
– Word shape, such as case, prefix,
and suffix
3
System Accuracy
TAML 96.6
• Twitter dataset from CMU sampled from
10/27/2010
– Training: 1000 tweets
– Test: 500 tweets
• Build features on top of News features
– Word clustering 111010100010 : "lmao", "lmfao", "lmaoo", …
111010100011 : "haha", "hahaha", "hehe", …
– Use PTB as a soft-constraint tag
dictionary
System Accuracy
TAML – news features 74.56
+ normalization 84.84
+ word clustering 88.37
+ tag dictionary 88.53
Sample Tagged Twitter Data
4
• Spending_V the_D day_N
withhh_P mommma_N !_,
•Its_L hard_A for_P me_O
when_R I_O have_V too_R
ask_V ,_, is_V it_O really_R
that_P dull_A !?_,
•@JBieberzLuvies_@ LOL_! i_O
ranther_R go_V see_V
payton_^ rae_^ and_& MAYBE_R
caitlin_^ beadles_N XDD_E
N Common noun
O Pronoun
^ Proper noun
V Verb
D Determiner
P Pre- or
postposition, or
subordinating
conjunction
R Adverb
A Adjective
L Nominal + verbal
@ At-mention
E Emoticon
, Punctuation
! Interjection
Recommended