MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion

MALLETMAchine Learning for LanguagE Toolkit

Outline• About MALLET

• Representing Data

• Command Line Processing

• Simple Evaluation

• Conclusion





• Conclusion

About MALLET• "MALLET: A Machine Learning for Language Toolkit.“

• written by Andrew McCallum• http://mallet.cs.umass.edu. 2002.• Implemented in Java, currently version 2.0.6

• Motivation:• Text classification and information extraction• Commercial machine learning• Analysis and indexing of academic publications

http://mallet.cs.umass.edu/

About MALLET• Main idea

• Text focus: data is discrete rather than continuous, even when values could be continuous

• How to• Command line scripts:

• bin/mallet [command] --[option] [value] …• Text User Interface (“tui”) classes

• Direct Java API• http://mallet.cs.umass.edu/api





• Conclusion

Representations• Transform text documents to

vectors x1 , x2 …

• Elements of vector are called feature values• Example: “Feature at row 345 is

number of times “dog” appears in document”

• Retain meaning of vector indices

Documents to Vectors





Instances

Instances

Instances




• Developing with MALLET

• Conclusion

Command Line• Importing Data

• Classification

• Sequence Tagging

• Topic Modeling

Importing Data• One Instance per file

• files in the folder:sample-data/web/en or sample-data/web/de

• command line: bin/mallet import-dir --input sample-data/web/* --output web.mallet

• One file, one instance per line• file format:[URL] [language] [text of the page...]

• command line:bin/mallet import-file --input /data/web/data.txt --output web.mallet

Classification• Training a classifier

bin/mallet train-classifier --input training.mallet --output-classifier my.classifier

• Choosing an algorithm• MaxEnt, NaiveBayes, C45, DecisionTree and many others.

bin/mallet train-classifier --input training.mallet --output-classifier my.classifier --trainer MaxEnt

• Evaluation• Random split the data into 90% training instances, which will be used to train the

classifier, and 10% testing instances.

bin/mallet train-classifier --input labeled.mallet --training-portion 0.9

Sequence Tagging• Sequence algorithms

• hidden Markov models (HMMs)• linear chain conditional random fields (CRFs).

• SimpleTagger• a command line interface to the MALLET Conditional Random

Field (CRF) class

SimpleTagger• Input file: [feature1 feature2 ... featuren label]

Bill CAPITALIZED nounslept non-nounhere LOWERCASE STOPWORD non-noun

• Train a CRF• An input file “sample”• A trained CRF in the file "nouncrf"

java -cp “~/mallet/class:~/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --train true --model-file nouncrf sample

SimpleTagger• A file “stest” needed to be labeled

CAPITAL Al slept here

• Label the inputjava -cp “~/mallet/class:~/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --model-file nouncrf stest

• OutputNumber of predicates: 5 noun CAPITAL Al non-noun slept non-noun here

Topic Modeling• Building Topic Models

bin/mallet train-topics --input topic-input.mallet --num-topics 100 --output-state topic-state.gz

--input [FILE]

--num-topics [NUMBER] The number of topics to use. The best number depends on what you are looking for in the model.

--num-iterations [NUMBER] The number of sampling iterations should be a trade off between the time taken to complete sampling and the quality of the topic model.

--output-state [FILENAME] This option outputs a compressed text file containing the words in the corpus with their topic assignments.

Demo





• Conclusion

Methodology• Focus on sequence tagging module in MALLET

• CRF-based implementation• Some scripts written for importing data and evaluating results

• Small corpora collected from web• Divided into two parts, 80% for training, 20% for test

• Evaluate both POS Tagging and Named Entity Recognition• The performance of training• Accuracy (POS Tagging) and Precision, Recall and FB1 (NER)

• All scripts, corpora and results can be found here• http://mallet-eval.googlecode.com

http://mallet-eval.googlecode.com/

A Survey of Named Entity Corpora• Well known named entity corpora

• Language-Independent Named Entity Recognition at CoNLL-2003• A manual annotation of a subset of RCV1 (Reuters Corpus Volume 1)• free and public, but need RCV1 raw texts as the input

• Message Understanding Conference (MUC) 6 / 7• not for free

• Affective Computational Entities (ACE) Training Corpus• not for free

• Other special purpose corpora• Enron Email Dataset

• email messages in this corpus are tagged with person names, dates and times.

• A variety of biomedical corpora• some corpora in this collection are tagged with entities in the biomedical domain,

such as gene name

Small Corpora• Two small corpora collected from web

• Penn Treebank Sample• English POS tagging corpora, ~5% fragment of Penn Treebank, (C)

LDC 1995.• raw, tagged, parsed and combined data from Wall Street Journal• 148120 tokens, 36 Standard treebank POS tagger• http://web.mit.edu/course/6/6.863/OldFiles/share/data/corpora/treebank/

• HIT CIR LTP Corpora Sample• Chinese NER corpora integrated• 10% of the whole corpora (open to public)• 23751 tokens, 7 kinds of named entities• http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm

http://web.mit.edu/course/6/6.863/OldFiles/share/data/corpora/treebank/

http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm

http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm

Environment• Hardware

• CPU: Q8300 Quad Core 2.50 GHz• Memory: 3GB

• Software• Fedora 13 x86_64• Java 1.6.0_18• MALLET 2.0.6

Data Format and Labels• Data Format

• Each token one row, each feature one columnBill nounslept non-nounHere non-noun

• Labels• Standard treebank POS Tagger

• CC Coordinating conjunction | CD Cardinal number | DT Determiner | EX Existential there | FW Foreign word | IN Preposition or subordinating conjunction | JJ Adjective | JJR Adjective, comparative | JJS Adjective, superlative | LS List item marker | MD Modal | NN Noun, singular or mass | NNS Noun, plural … … (36 taggers in all)

• HIT Named Entity• O 不是 NE | S- 单独构成 NE | B- 一个 NE 的开始 | I- 一个 NE 的中间 | E- 一个 NE 的

结尾• Nm 数词 | Ni 机构名 | Ns 地名 | Nh 人名 | Nt 时间 | Nr 日期 | Nz 专有名词• Example: 美国 B-Ni 洛杉矶 I-Ni 警察局 E-Ni

pos chunking ner

Training

Instance # 3982 8936 1286

Tokens # 95767 211727 20913

Time 308m 23s 190m 50s 17m 13s

Test

Tokens # 46452 47377 2829

Accuracy 85.67% 93.97% 98.55%

Precision - 90.54% 86.89%

Recall - 89.89% 86.89%

FB1 - 90.21 86.89

Time 15.80s 4.43s 0.8s

Evaluation

StagesTasks

DEMO

Q&A