View
224
Download
2
Embed Size (px)
Citation preview
RoadmapClassification:
Feature templatesCase Study Examples
Text CategorizationCoreference Resolution
Classification SystemsOverviewMallet
Classification Problem Steps
Input processing:Split data into training/dev/testConvert data into an Attribute-Value Matrix
Identify candidate featuresPerform feature selectionCreate AVM representation
Classification Problem Steps
Input processing:Split data into training/dev/testConvert data into an Attribute-Value Matrix
Identify candidate featuresPerform feature selectionCreate AVM representation
Training
Classification Problem Steps
Input processing:Split data into training/dev/testConvert data into an Attribute-Value Matrix
Identify candidate featuresPerform feature selectionCreate AVM representation
Training
Testing
Evaluation
Feature TemplateExample: Prevword (or w-1)
Template corresponds to many featurese.g. time flies like an arrow
Feature TemplateExample: Prevword (or w-1)
Template corresponds to many featurese.g. time flies like an arroww-1=<s>
w-1=time
w-1=flies
w-1=like
w-1=an…
Feature TemplateExample: Prevword (or w-1)
Template corresponds to many featurese.g. time flies like an arroww-1=<s>
w-1=time
w-1=flies
w-1=like
w-1=an…
Shorthand for: w-1=<s> 0 or w-1=time 1
AVM ExampleTime flies like an arrow
Note: this is a compact form of the true sparse vectorw-1=w 0 or 1, for w in |V|
w-1 w0 w-1w0 w+1 label
x1 <s> Time <s>Time
flies N
x2 Time flies Time flies
like V
x3 flies like flies like an P
Text CategorizationTask:
Given a document, assign to one of finite set of classes
What are the classes?
What are the features?
Text 1 Several hundred protesters, some wearing goggles and gas masks,
marched past authorities in a downtown street Sunday, hours after riot police forced Occupy Portland demonstrators out of a pair of weeks-old encampments in nearby parks.
Police moved in shortly before noon and drove protesters into the street after dozens remained in the camp in defiance city officials. Mayor Sam Adams had ordered that the camp shut down Saturday at midnight, citing unhealthy conditions and the encampment’s attraction of drug users and thieves.
Anti-Wall Street protesters and their supporters flooded a city park area in Portland early Sunday in defiance of an eviction order, and authorities elsewhere stepped up pressure against the demonstrators, arresting nearly two dozen. (Nov. 13)
More than 50 protesters were arrested in the police action, but officers did not use tear gas, rubber bullets or other so-called non-lethal weapons, police said.
Washington Post, online 11/13/2011
Text 2 George Washington coach Mike Lonergan looked at the stat
sheet, tried to muster a smile then clicked off the reasons why the Colonials lost to No. 24 California on Sunday night.
A piercing 21-0 run by the Golden Bears at the end of the first half was at the top of the list.
Not even a second straight 20-point effort from Tony Taylor was enough to dig George Washington out of the early hole, and the Colonials spent the rest of the night in a futile game of catch-up.
“I’ve never really been involved with a run quite like that,” Lonergan said after Cal’s 81-54 win over George Washington. “I tried calling a couple timeouts. It was very disappointing that we just never really got our composure back the rest of that half. To end it that way and not even score any points, that was basically the game right there.”
Washington Post, online 11/13/2011
Test 3
‘Jersey Boys’ at the National Theatre
By Jane Horwitz, Sunday, November 13, 5:29 PM
“Jersey Boys” is irresistible, and the touring company now at the National Theatre gets it almost entirely right.
This Broadway hit (it has been running since fall 2005 and has played Washington before as well) rises well above the so-called jukebox show genre. Subtitled “The Story of Frankie Valli & the Four Seasons,” the musical tells a tale that transcends show business gossip to become a close character study of four talented but very different blue-collar guys from New Jersey — who just happen to have sung some of the best close-harmony rock/pop tunes of the late 1950s, the 1960s and into the 1970s.
Washington Post, online 11/13/2011
Example: CoreferenceQueen Elizabeth set about transforming her
husband, King George VI, into a viable monarch. Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment...
Example: CoreferenceQueen Elizabeth set about transforming her
husband, King George VI, into a viable monarch. Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment...
Example: CoreferenceQueen Elizabeth set about transforming her
husband, King George VI, into a viable monarch. Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment...
Can be viewed as a classification problem
Example: CoreferenceQueen Elizabeth set about transforming her
husband, King George VI, into a viable monarch. Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment...
Can be viewed as a classification problem
What are the inputs?
Example: CoreferenceQueen Elizabeth set about transforming her
husband, King George VI, into a viable monarch. Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment...
Can be viewed as a classification problem
What are the inputs?
What are the categories?
Example: CoreferenceQueen Elizabeth set about transforming her
husband, King George VI, into a viable monarch. Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment...
Can be viewed as a classification problem
What are the inputs?
What are the categories?
What features would be useful?
Example: NERNamed Entity tagging:
John visited New York last Friday [person John] visited [location New York] [time last
Friday]
As a classification problem John/PER-B visited/O New/LOC-B York/LOC-I last/TIME-
B Friday/TIME-I
Input? Features? Classes?
Supervision in ClassifiersSupervised:
True label/class of each training instance is provided to the learner at training time
Naïve Bayes, MaxEnt, Decision Trees, Neural nets, etc
Supervision in ClassifiersSupervised:
True label/class of each training instance is provided to the learner at training time
Naïve Bayes, MaxEnt, Decision Trees, Neural nets, etc
Unsupervised:No true labels are provided for examples during
trainingClustering: k-means; Min-cut algorithms
Supervision in ClassifiersSupervised:
True label/class of each training instance is provided to the learner at training time
Naïve Bayes, MaxEnt, Decision Trees, Neural nets, etc
Unsupervised:No true labels are provided for examples during trainingClustering: k-means; Min-cut algorithms
Semi-supervised: (bootstrapping)True labels are provided for only a subset of examplesCo-training, semi-supervised SVM/CRF, etc
Inductive BiasWhat form of function is learned?
Function that separates members of different classesLinear separatorHigher order functionsVornoi diagrams, etc
Inductive BiasWhat form of function is learned?
Function that separates members of different classesLinear separatorHigher order functionsVornoi diagrams, etc
Graphically, decision boundary+ + + - - -
Machine Learning Functions
Problem: Can the representation effectively model the class to be learned?
Machine Learning Functions
Problem: Can the representation effectively model the class to be learned?
Motivates selection of learning algorithm
++ + + + +
- - - - - - - - -
Machine Learning Functions
Problem: Can the representation effectively model the class to be learned?
Motivates selection of learning algorithm
++ + + + +
- - - - - - - - -
For this function,Linear discriminant is GREAT!
Machine Learning Functions
Problem: Can the representation effectively model the class to be learned?
Motivates selection of learning algorithm
++ + + + +
- - - - - - - - -
For this function,Linear discriminant is GREAT!Rectangular boundaries (e.g. ID trees)
TERRIBLE!
Machine Learning Functions
Problem: Can the representation effectively model the class to be learned?
Motivates selection of learning algorithm
++ + + + +
- - - - - - - - -
For this function,Linear discriminant is GREAT!Rectangular boundaries (e.g. ID trees)
TERRIBLE!
Pick the right representation!
Machine Learning Features
Inputs: E.g.words, acoustic measurements, parts-of-
speech, syntactic structures, semantic classes, ..
Vectors of features:E.g. word: letters
‘cat’: L1=c; L2 = a; L3 = t
Parts of syntax trees?
Machine Learning Features
Questions: Which features should be used?How should they relate to each other?
Issue 1: How do we define relation in feature space if features have different scales?
Machine Learning Features
Question: Which features should be used?How should they relate to each other?
Issue 1: How do we define relation in feature space if features have different scales? Solution: Scaling/normalization
Machine Learning Features
Question: Which features should be used?How should they relate to each other?
Issue 1: How do we define relation in feature space if features have different scales? Solution: Scaling/normalization
Issue 2: Which ones are important?
Machine Learning Features
Question: Which features should be used?How should they relate to each other?
Issue 1: How do we define relation in feature space if features have different scales? Solution: Scaling/normalization
Issue 2: Which ones are important?If differ in irrelevant feature, should ignore
Machine Learning ToolkitsMany learners, many tools/implementations
Some broad tool setsweka
Java, lots of classifiers, pedagogically oriented
Machine Learning ToolkitsMany learners, many tools/implementations
Some broad tool setsweka
Java, lots of classifiers, pedagogically oriented
mallet Java, classifiers, sequence learnersMore heavy duty
MalletMachine learning toolkit
Developed at UMass Amherst by Andrew McCallum
Java implementation, open source
MalletMachine learning toolkit
Developed at UMass Amherst by Andrew McCallum
Java implementation, open source
Large collection of machine learning algorithmsTargeted to language processingNaïve Bayes, MaxEnt, Decision Trees, Winnow,
BoostingAlso, clustering, topic models, sequence learners
MalletMachine learning toolkit
Developed at UMass Amherst by Andrew McCallum
Java implementation, open source
Large collection of machine learning algorithmsTargeted to language processingNaïve Bayes, MaxEnt, Decision Trees, Winnow, BoostingAlso, clustering, topic models, sequence learners
Widely used, butResearch software: some bugs/gaps; odd documentation
InstallationInstalled on patas
/NLP_TOOLS/tool_sets/mallet/latest/Will be updated to 2.0.7
Directories:bin/: script filessrc/: java source codeclass/: java classeslib/: jar filessample-data/: wikipedia docs for languages id, etc
EnvironmentShould be set up on patas
$PATH should include/NLP_TOOLS/tool_sets/mallet/latest/bin
$CLASSPATH should include/NLP_TOOLS/tool_sets/mallet/latest/lib/mallet-deps.jar;
/NLP_TOOLS/tool_sets/mallet/latest/lib/mallet.jar
Check:which text2vectors
/NLP_TOOLS/tool_sets/mallet/latest/bin
Mallet CommandsMallet command types:
Data preparationData/model inspectionTrainingClassification
Command line scriptsShell scripts
Set up java environmentInvoke java programs
--help lists command line parameters for scripts
Mallet DataMallet data instances:
Instance_id label f1 v1 f2 v2 …..
Stored in internal binary format: “vectors”
Binary format used by learners, decoders
Need to convert text files to binary format
Data PreparationBuilt-in data importers
One class per directory, one instance per filebin/mallet import-dir --input IF --output OF
Label is directory name
(Also text2vectors)
One instance per linebin/mallet import-file --input IF --output OF
Line: instance label text …..
(Also csv2vectors)Create binary representation of text feature counts
Data Preparationbin/mallet import-svmlight --input IF --output OF
Allows import of user constructed feature value pairs
Data Preparationbin/mallet import-svmlight --input IF --output OF
Allows import of user constructed feature value pairs
Format:label f1:v1 f2:v2 …..fn:vn
Features can strings or indexes
(Also bin/svmlight2vectors)
Data Preparationbin/mallet import-svmlight --input IF --output OF
Allows import of user constructed feature value pairs
Format:label f1:v1 f2:v2 …..fn:vn
Features can strings or indexes
(Also bin/svmlight2vectors)
If building test data separately from originalbin/mallet import-svmlight --input IF --output OF
--use-pipe-from previously_built.vectors
Data Preparationbin/mallet import-svmlight --input IF --output OF
Allows import of user constructed feature value pairs Format:
label f1:v1 f2:v2 …..fn:vn Features can strings or indexes
(Also bin/svmlight2vectors)
If building test data separately from original bin/mallet import-svmlight --input IF --output OF
--use-pipe-from previously_built.vectors Ensures consistent feature representation
Note: can’t mix svmlight models with others
Accessing Binary Formatsvectors2info --input IF
-- print-labels TRUEPrints list of category labels in data set
Accessing Binary Formatsvectors2info --input IF
-- print-labels TRUEPrints list of category labels in data set
-- print-matrix sicprints all features and values by string and number
Returns original text feature-value list Possibly out of order
Accessing Binary Formatsvectors2info --input IF
-- print-labels TRUEPrints list of category labels in data set
-- print-matrix sicprints all features and values by string and number
Returns original text feature-value list Possibly out of order
vectors2vectors --input IF --training-file TNF --testing-file TTF --training-portion pct
Accessing Binary Formatsvectors2info --input IF
-- print-labels TRUEPrints list of category labels in data set
-- print-matrix sicprints all features and values by string and number
Returns original text feature-value list Possibly out of order
vectors2vectors --input IF --training-file TNF --testing-file TTF --training-portion pctCreates random training/test splits in some ratio
Building & Accessing Models
bin/mallet train-classifier --trainer classifiertype - -training-portion 0.9 --output-classifier OF Builds classifier model
Can also store model, produce scores, confusion matrix, etc
Building & Accessing Models
bin/mallet train-classifier --input vector_data_file --trainer classifiertype --training-portion 0.9 --output-classifier OF Builds classifier model
Can also store model, produce scores, confusion matrix, etc
--trainer: MaxEnt, DecisionTree, NaiveBayes, etc
Building & Accessing Models
bin/mallet train-classifier --trainer classifiertype - -training-portion 0.9 --output-classifier OF Builds classifier model
Can also store model, produce scores, confusion matrix, etc
--trainer: MaxEnt, DecisionTree, NaiveBayes, etc--report: train:accuracy, test:f1:en
Building & Accessing Models
bin/mallet train-classifier --trainer classifiertype - -training-portion 0.9 --output-classifier OF Builds classifier model
Can also store model, produce scores, confusion matrix, etc
--trainer: MaxEnt, DecisionTree, NaiveBayes, etc--report: train:accuracy, test:f1:en
Can also use pre-split training & testing filese.g. output of vectors2vectors--training-file, --testing-file
Building & Accessing Models
bin/mallet train-classifier --trainer classifiertype - -training-portion 0.9 --output-classifier OF Builds classifier model
Can also store model, produce scores, confusion matrix, etc --trainer: MaxEnt, DecisionTree, NaiveBayes, etc --report: train:accuracy, test:f1:en
Confusion Matrix, row=true, column=predicted accuracy=1.0
label 0 1 |total 0 de 1 . |1 1 en . 1 |1 Summary. train accuracy mean = 1.0 stddev = 0 stderr = 0 Summary. test accuracy mean = 1.0 stddev = 0 stderr = 0
Accessing Classifiers classifier2info --classifier maxent.model
Prints out contents of model file
FEATURES FOR CLASS en
<default> -0.036953801963395115
book 0.004605219133228236
the 0.24270652500835088
i 0.004605219133228236
TestingUse new data to test a previously built classifier
bin/mallet classify-svmlight --input testfile --output outputfile --classifier maxent.model
TestingUse new data to test a previously built classifier
bin/mallet classify-svmlight --input testfile --output outputfile --classifier maxent.modelAlso instance file, directories: classify-file, classify-dir
TestingUse new data to test a previously built classifier
bin/mallet classify-svmlight --input testfile --output outputfile --classifier maxent.modelAlso instance file, directories: classify-file, classify-dirPrints class,score matrix
TestingUse new data to test a previously built classifier
bin/mallet classify-svmlight --input testfile --output outputfile --classifier maxent.model Also instance file, directories: classify-file, classify-dir Prints class,score matrix
Inst_id class1 score1 class2 score2 array:0 en 0.995 de 0.0046 array:1 en 0.970 de 0.0294 array:2 en 0.064 de 0.935 array:3 en 0.094 de 0.905
General Usebin/mallet import-svmlight --input
svmltrain.vectors.txt --output svmltrain.vectorsBuilds binary representation from feature:value pairs
General Usebin/mallet import-svmlight --input
svmltrain.vectors.txt --output svmltrain.vectorsBuilds binary representation from feature:value pairs
bin/mallet train-classifier --input svmltrain.vectors –trainer MaxEnt --output-classifier svml.modelTrains MaxEnt classifier and stores model
General Usebin/mallet import-svmlight --input
svmltrain.vectors.txt --output svmltrain.vectorsBuilds binary representation from feature:value pairs
bin/mallet train-classifier --input svmltrain.vectors –trainer MaxEnt --output-classifier svml.modelTrains MaxEnt classifier and stores model
bin/mallet classify-svmlight --input svmltest.vectors.txt --output - --classifier svml.modelTests on the new data
Other InformationWebsite:
Download and documentation (such as it is)http://mallet.cs.umass.edu
Other InformationWebsite:
Download and documentation (such as it is)http://mallet.cs.umass.edu
API tutorial:http://mallet.cs.umass.edu/mallet-tutorial.pdf
Other InformationWebsite:
Download and documentation (such as it is)http://mallet.cs.umass.edu
API tutorial:http://mallet.cs.umass.edu/mallet-tutorial.pdf
Local guide (refers to older version 0.4)http://courses.washington.edu/ling572/winter07/
homework/mallet_guide.pdf