TagHelper: Basics Part 1 Carolyn Penstein Rosé Carnegie Mellon University Funded through the Pittsburgh Science of Learning Center and The Office of Naval

TagHelper:TagHelper:Basics Part 1Basics Part 1

Carolyn Penstein RosCarolyn Penstein RosééCarnegie Mellon UniversityCarnegie Mellon University

Funded through the Pittsburgh Science of Learning Center and The Office of Naval Research, Cognitive and Neural Sciences Division

OutlineOutline

Setting up your dataSetting up your data Creating a trained modelCreating a trained model Evaluating performanceEvaluating performance Using a trained modelUsing a trained model Overview of basic feature extraction from Overview of basic feature extraction from

texttext

Setting Up Your DataSetting Up Your Data

Setting Up Your DataSetting Up Your Data

How do you know when you have How do you know when you have coded enough data?coded enough data?

What distinguishesQuestions and Statements?

Not all questionsend in a questionmark.

Not all WH wordsoccur in questions

I versus you isnot a reliable predictor

You need to codeenough to avoidlearning rules thatwon’t work

Creating a Trained ModelCreating a Trained Model

Training and TestingTraining and Testing

Start TagHelper tools by Start TagHelper tools by double clicking on the double clicking on the portal.bat icon in your portal.bat icon in your TagHelperTools2 folderTagHelperTools2 folder

You will then see the You will then see the following tool palletfollowing tool pallet

The idea is that you will train The idea is that you will train a prediction model on your a prediction model on your coded data and then apply coded data and then apply that model to uncoded datathat model to uncoded data

Click on Train New ModelsClick on Train New Models

Loading a FileLoading a FileFirst click on Add a File

Then select a file

Simplest UsageSimplest Usage

Click “GO!”Click “GO!” TagHelper will use its TagHelper will use its

default setting to train default setting to train a model on your a model on your coded examplescoded examples

It will use that model It will use that model to assign codes to the to assign codes to the uncoded examplesuncoded examples

More Advanced UsageMore Advanced Usage

The second option is The second option is to modify the default to modify the default settings settings

You get to the options You get to the options you can set by clicking you can set by clicking on >> Optionson >> Options

After you finish that, After you finish that, click “GO!”click “GO!”

OutputOutput

You can find the output in the OUTPUT You can find the output in the OUTPUT folderfolder

There will be a text file named Eval_[name There will be a text file named Eval_[name of coding dimension]_[name of input file].txtof coding dimension]_[name of input file].txt This is a performance reportThis is a performance report E.g., Eval_Code_SimpleExample.xls.txtE.g., Eval_Code_SimpleExample.xls.txt

There will also be a file named [name of There will also be a file named [name of input file]_OUTPUT.xlsinput file]_OUTPUT.xls This is the coded outputThis is the coded output E.g., SimpleExample_OUTPUT.xlsE.g., SimpleExample_OUTPUT.xls

Using the Output file PrefixUsing the Output file Prefix If you use the Output file prefix, If you use the Output file prefix,

the text you enter will be the text you enter will be prepended to the output filesprepended to the output files

There will be a text file named There will be a text file named [prefix]_Eval_[name of coding [prefix]_Eval_[name of coding dimension]_[name of input dimension]_[name of input file].txtfile].txt E.g., E.g.,

Prefix1_Eval_Code_SimpleExample.xls.txtPrefix1_Eval_Code_SimpleExample.xls.txt

There will also be a file named There will also be a file named [prefix]_[name of input [prefix]_[name of input file]_OUTPUT.xlsfile]_OUTPUT.xls E.g., E.g., Prefix1_SimpleExample.xlsPrefix1_SimpleExample.xls

Evaluating PerformanceEvaluating Performance

Performance reportPerformance report

The performance report tells you:The performance report tells you: What dataset was usedWhat dataset was used What the customization settings wereWhat the customization settings were At the bottom of the file are reliability statistics and a At the bottom of the file are reliability statistics and a

confusion matrix that tells you which types of errors are confusion matrix that tells you which types of errors are being madebeing made







Output FileOutput File The output file The output file

containscontains The codes for each The codes for each

segmentsegment Note that the Note that the

segments that were segments that were already coded will already coded will retain their original retain their original codecode

The other segments The other segments will have their will have their automatic predictionsautomatic predictions

The prediction The prediction column indicates the column indicates the confidence of the confidence of the predictionprediction

Using a Trained ModelUsing a Trained Model

Applying a Trained ModelApplying a Trained Model

Select a Select a model filemodel file

Then select Then select a testing a testing filefile

Applying a Trained ModelApplying a Trained Model

Testing data should be set up with ? on Testing data should be set up with ? on uncoded examplesuncoded examples

Click Go! to process fileClick Go! to process file

ResultsResults

Overview of Basic Feature Overview of Basic Feature Extraction from TextExtraction from Text

CustomizationsCustomizations To customize the To customize the

settings:settings: Select the file Select the file Click on OptionsClick on Options

Setting the LanguageSetting the Language

You can change thedefault language fromEnglish to German

Chinese requires anadditional license to Academia Sinica inTaiwan

Preparing to get a performance Preparing to get a performance reportreport

You can decidewhether youwant it to preparea performancereport for you.(It runs faster when this is disabled.)

TagHelper CustomizationsTagHelper Customizations

Typical classification algorithmsTypical classification algorithms Naïve BayesNaïve Bayes SMO (Weka’s implementation of SMO (Weka’s implementation of

Support Vector Machines)Support Vector Machines) J48 (decision trees)J48 (decision trees)

Rules of thumb:Rules of thumb: SMO is state-of-the-art for text SMO is state-of-the-art for text

classificationclassification J48 is best with small feature sets – J48 is best with small feature sets –

also handles contingencies also handles contingencies between features wellbetween features well

Naïve Bayes works well for models Naïve Bayes works well for models where decisions are made based where decisions are made based on accumulating evidence rather on accumulating evidence rather than hard and fast rulesthan hard and fast rules


Feature Space DesignFeature Space Design Think like a computer!Think like a computer! Machine learning algorithms look for Machine learning algorithms look for

features that are good predictors, not features that are good predictors, not features that are necessarily meaningfulfeatures that are necessarily meaningful

Look for approximationsLook for approximations If you want to find questions, you don’t If you want to find questions, you don’t

need to do a complete syntactic analysisneed to do a complete syntactic analysis Look for question marksLook for question marks Look for wh-terms that occur immediately Look for wh-terms that occur immediately

before an auxilliary verbbefore an auxilliary verb Look for topics likely to be indicative of Look for topics likely to be indicative of

questions (if you’re talking about ice questions (if you’re talking about ice cream, and someone mentions flavor cream, and someone mentions flavor without mentioning a specific flavor, it without mentioning a specific flavor, it might be a question)might be a question)


Feature Space DesignFeature Space Design Punctuation can be a “stand Punctuation can be a “stand

in” for moodin” for mood ““you think the answer is 9?”you think the answer is 9?” ““you think the answer is 9.”you think the answer is 9.”

Bigrams capture simple Bigrams capture simple lexical patternslexical patterns ““common denominator” versus common denominator” versus

“common multiple”“common multiple” POS bigrams capture stylistic POS bigrams capture stylistic

informationinformation ““the answer which is …” vs the answer which is …” vs

“which is the answer”“which is the answer” Line length can be a proxy for Line length can be a proxy for

explanation depthexplanation depth


Feature Space DesignFeature Space Design Contains non-stop word can be a Contains non-stop word can be a

predictor of whether a predictor of whether a conversational contribution is conversational contribution is contentfulcontentful ““ok sure” versus “the common ok sure” versus “the common

denominator”denominator” Remove stop words removes some Remove stop words removes some

distracting featuresdistracting features Stemming allows some Stemming allows some

generalizationgeneralization Multiple, multiply, multiplicationMultiple, multiply, multiplication

Removing rare features is a cheap Removing rare features is a cheap form of feature selectionform of feature selection Features that only occur once or Features that only occur once or

twice in the corpus won’t twice in the corpus won’t generalize, so they are a waste of generalize, so they are a waste of time to include in the vector spacetime to include in the vector space

Documents

TagHelper: Basics Part 1 Carolyn Penstein Rosé Carnegie Mellon University Funded through the Pittsburgh Science of Learning Center and The Office of Naval