Upload
abigayle-walton
View
217
Download
3
Tags:
Embed Size (px)
Citation preview
TagHelper:TagHelper:Basics Part 1Basics Part 1
Carolyn Penstein RosCarolyn Penstein RosééCarnegie Mellon UniversityCarnegie Mellon University
Funded through the Pittsburgh Science of Learning Center and The Office of Naval Research, Cognitive and Neural Sciences Division
OutlineOutline
Setting up your dataSetting up your data Creating a trained modelCreating a trained model Evaluating performanceEvaluating performance Using a trained modelUsing a trained model Overview of basic feature extraction from Overview of basic feature extraction from
texttext
Setting Up Your DataSetting Up Your Data
Setting Up Your DataSetting Up Your Data
How do you know when you have How do you know when you have coded enough data?coded enough data?
What distinguishesQuestions and Statements?
Not all questionsend in a questionmark.
Not all WH wordsoccur in questions
I versus you isnot a reliable predictor
You need to codeenough to avoidlearning rules thatwon’t work
Creating a Trained ModelCreating a Trained Model
Training and TestingTraining and Testing
Start TagHelper tools by Start TagHelper tools by double clicking on the double clicking on the portal.bat icon in your portal.bat icon in your TagHelperTools2 folderTagHelperTools2 folder
You will then see the You will then see the following tool palletfollowing tool pallet
The idea is that you will train The idea is that you will train a prediction model on your a prediction model on your coded data and then apply coded data and then apply that model to uncoded datathat model to uncoded data
Click on Train New ModelsClick on Train New Models
Loading a FileLoading a FileFirst click on Add a File
Then select a file
Simplest UsageSimplest Usage
Click “GO!”Click “GO!” TagHelper will use its TagHelper will use its
default setting to train default setting to train a model on your a model on your coded examplescoded examples
It will use that model It will use that model to assign codes to the to assign codes to the uncoded examplesuncoded examples
More Advanced UsageMore Advanced Usage
The second option is The second option is to modify the default to modify the default settings settings
You get to the options You get to the options you can set by clicking you can set by clicking on >> Optionson >> Options
After you finish that, After you finish that, click “GO!”click “GO!”
OutputOutput
You can find the output in the OUTPUT You can find the output in the OUTPUT folderfolder
There will be a text file named Eval_[name There will be a text file named Eval_[name of coding dimension]_[name of input file].txtof coding dimension]_[name of input file].txt This is a performance reportThis is a performance report E.g., Eval_Code_SimpleExample.xls.txtE.g., Eval_Code_SimpleExample.xls.txt
There will also be a file named [name of There will also be a file named [name of input file]_OUTPUT.xlsinput file]_OUTPUT.xls This is the coded outputThis is the coded output E.g., SimpleExample_OUTPUT.xlsE.g., SimpleExample_OUTPUT.xls
Using the Output file PrefixUsing the Output file Prefix If you use the Output file prefix, If you use the Output file prefix,
the text you enter will be the text you enter will be prepended to the output filesprepended to the output files
There will be a text file named There will be a text file named [prefix]_Eval_[name of coding [prefix]_Eval_[name of coding dimension]_[name of input dimension]_[name of input file].txtfile].txt E.g., E.g.,
Prefix1_Eval_Code_SimpleExample.xls.txtPrefix1_Eval_Code_SimpleExample.xls.txt
There will also be a file named There will also be a file named [prefix]_[name of input [prefix]_[name of input file]_OUTPUT.xlsfile]_OUTPUT.xls E.g., E.g., Prefix1_SimpleExample.xlsPrefix1_SimpleExample.xls
Evaluating PerformanceEvaluating Performance
Performance reportPerformance report
The performance report tells you:The performance report tells you: What dataset was usedWhat dataset was used What the customization settings wereWhat the customization settings were At the bottom of the file are reliability statistics and a At the bottom of the file are reliability statistics and a
confusion matrix that tells you which types of errors are confusion matrix that tells you which types of errors are being madebeing made
Performance reportPerformance report
The performance report tells you:The performance report tells you: What dataset was usedWhat dataset was used What the customization settings wereWhat the customization settings were At the bottom of the file are reliability statistics and a At the bottom of the file are reliability statistics and a
confusion matrix that tells you which types of errors are confusion matrix that tells you which types of errors are being madebeing made
Performance reportPerformance report
The performance report tells you:The performance report tells you: What dataset was usedWhat dataset was used What the customization settings wereWhat the customization settings were At the bottom of the file are reliability statistics and a At the bottom of the file are reliability statistics and a
confusion matrix that tells you which types of errors are confusion matrix that tells you which types of errors are being madebeing made
Output FileOutput File The output file The output file
containscontains The codes for each The codes for each
segmentsegment Note that the Note that the
segments that were segments that were already coded will already coded will retain their original retain their original codecode
The other segments The other segments will have their will have their automatic predictionsautomatic predictions
The prediction The prediction column indicates the column indicates the confidence of the confidence of the predictionprediction
Using a Trained ModelUsing a Trained Model
Applying a Trained ModelApplying a Trained Model
Select a Select a model filemodel file
Then select Then select a testing a testing filefile
Applying a Trained ModelApplying a Trained Model
Testing data should be set up with ? on Testing data should be set up with ? on uncoded examplesuncoded examples
Click Go! to process fileClick Go! to process file
ResultsResults
Overview of Basic Feature Overview of Basic Feature Extraction from TextExtraction from Text
CustomizationsCustomizations To customize the To customize the
settings:settings: Select the file Select the file Click on OptionsClick on Options
Setting the LanguageSetting the Language
You can change thedefault language fromEnglish to German
Chinese requires anadditional license to Academia Sinica inTaiwan
Preparing to get a performance Preparing to get a performance reportreport
You can decidewhether youwant it to preparea performancereport for you.(It runs faster when this is disabled.)
TagHelper CustomizationsTagHelper Customizations
Typical classification algorithmsTypical classification algorithms Naïve BayesNaïve Bayes SMO (Weka’s implementation of SMO (Weka’s implementation of
Support Vector Machines)Support Vector Machines) J48 (decision trees)J48 (decision trees)
Rules of thumb:Rules of thumb: SMO is state-of-the-art for text SMO is state-of-the-art for text
classificationclassification J48 is best with small feature sets – J48 is best with small feature sets –
also handles contingencies also handles contingencies between features wellbetween features well
Naïve Bayes works well for models Naïve Bayes works well for models where decisions are made based where decisions are made based on accumulating evidence rather on accumulating evidence rather than hard and fast rulesthan hard and fast rules
TagHelper CustomizationsTagHelper Customizations
Feature Space DesignFeature Space Design Think like a computer!Think like a computer! Machine learning algorithms look for Machine learning algorithms look for
features that are good predictors, not features that are good predictors, not features that are necessarily meaningfulfeatures that are necessarily meaningful
Look for approximationsLook for approximations If you want to find questions, you don’t If you want to find questions, you don’t
need to do a complete syntactic analysisneed to do a complete syntactic analysis Look for question marksLook for question marks Look for wh-terms that occur immediately Look for wh-terms that occur immediately
before an auxilliary verbbefore an auxilliary verb Look for topics likely to be indicative of Look for topics likely to be indicative of
questions (if you’re talking about ice questions (if you’re talking about ice cream, and someone mentions flavor cream, and someone mentions flavor without mentioning a specific flavor, it without mentioning a specific flavor, it might be a question)might be a question)
TagHelper CustomizationsTagHelper Customizations
Feature Space DesignFeature Space Design Punctuation can be a “stand Punctuation can be a “stand
in” for moodin” for mood ““you think the answer is 9?”you think the answer is 9?” ““you think the answer is 9.”you think the answer is 9.”
Bigrams capture simple Bigrams capture simple lexical patternslexical patterns ““common denominator” versus common denominator” versus
“common multiple”“common multiple” POS bigrams capture stylistic POS bigrams capture stylistic
informationinformation ““the answer which is …” vs the answer which is …” vs
“which is the answer”“which is the answer” Line length can be a proxy for Line length can be a proxy for
explanation depthexplanation depth
TagHelper CustomizationsTagHelper Customizations
Feature Space DesignFeature Space Design Contains non-stop word can be a Contains non-stop word can be a
predictor of whether a predictor of whether a conversational contribution is conversational contribution is contentfulcontentful ““ok sure” versus “the common ok sure” versus “the common
denominator”denominator” Remove stop words removes some Remove stop words removes some
distracting featuresdistracting features Stemming allows some Stemming allows some
generalizationgeneralization Multiple, multiply, multiplicationMultiple, multiply, multiplication
Removing rare features is a cheap Removing rare features is a cheap form of feature selectionform of feature selection Features that only occur once or Features that only occur once or
twice in the corpus won’t twice in the corpus won’t generalize, so they are a waste of generalize, so they are a waste of time to include in the vector spacetime to include in the vector space