Corpus Bootstrapping with NLTK

Preview:

DESCRIPTION

Presented at Strata 2012 Deep Data session.

Citation preview

Corpus Bootstrapping with NLTKby Jacob Perkins

Jacob Perkins

http://www.weotta.com

http://streamhacker.com

http://text-processing.com

https://github.com/japerk/nltk-trainer

@japerk

Problem

you want to do NLProc

many proven supervised training algorithms

but you don’t have a training corpus

Solution

make a custom training corpus

Problems with Manual Annotation

takes time

requires expertise

expert time costs $$$

Solution: Bootstrap

less time

less expertise

costs less

requires thinking & creativity

Corpus Bootstrapping at Weotta

review sentiment

keyword classification

phrase extraction & classification

Bootstrapping Examples

english -> spanish sentiment

phrase extraction

Translating Sentiment

start with english sentiment corpus & classifier

english -> spanish -> spanish

English -> Spanish -> Spanish

1. translate english examples to spanish

2. train classifier

3. classify spanish text into new corpus

4. correct new corpus

5. retrain classifier

6. add to corpus & goto 4 until done

Translate Corpus

$ translate_corpus.py movie_reviews --source english --target spanish

Train Initial Classifier

$ train_classifier.py spanish_movie_reviews

Create New Corpus

$ classify_to_corpus.py spanish_sentiment --input spanish_examples.txt --classifier spanish_movie_reviews_NaiveBayes.pickle

Manual Correction

1. scan each file

2. move incorrect examples to correct file

Train New Classifier

$ train_classifier.py spanish_sentiment

Adding to the Corpus

start with >90% probability

retrain

carefully decrease probability threshold

Add more at a Lower Threshold

$ classify_to_corpus.py categorized_corpus --classifier categorized_corpus_NaiveBayes.pickle --threshold 0.8 --input new_examples.txt

When are you done?

what level of accuracy do you need?

does your corpus reflect real text?

how much time do you have?

Tips

garbage in, garbage out

correct bad data

clean & scrub text

experiment with train_classifier.py options

create custom features

Bootstrapping a Phrase Extractor1. find a pos tagged corpus

2. annotate raw text

3. train pos tagger

4. create pos tagged & chunked corpus

5. tag unknown words

6. train pos tagger & chunker

7. correct errors

8. add to corpus, goto 5 until done

NLTK Tagged Corpora

English: brown, conll2000, treebank

Portuguese: mac_morpho, floresta

Spanish: cess_esp, conll2002

Catalan: cess_cat

Dutch: alpino, conll2002

Indian Languages: indian

Chinese: sinica_treebank

see http://text-processing.com/demo/tag/

Train Tagger

$ train_tagger.py treebank --simplify_tags

Phrase Annotation

Hello world, [this is an important phrase].

Tag Phrases

$ tag_phrases.py my_corpus --tagger treebank_simplify_tags.pickle --input my_phrases.txt

Chunked & Tagged Phrase

Hello/N world/N ,/, [ this/DET is/V an/DET important/ADJ phrase/N ] ./.

Correct Unknown Words

1. find -NONE- tagged words

2. fix tags

Train New Tagger

$ train_tagger.py my_corpus --reader nltk.corpus.reader.ChunkedCorpusReader

Train Chunker

$ train_chunker.py my_corpus --reader nltk.corpus.reader.ChunkedCorpusReader

Extracting Phrasesimport collections, nltk.datafrom nltk import tokenizefrom nltk.tag import untag

tagger = nltk.data.load('taggers/my_corpus_tagger.pickle')chunker = nltk.data.load('chunkers/my_corpus_chunker.pickle')

def extract_phrases(t): d = collections.defaultdict(list) for sub in t.subtrees(lambda s: s.node != 'S'): d[sub.node].append(' '.join(untag(sub.leaves()))) return d

sents = tokenize.sent_tokenize(text)words = tokenize.word_tokenize(sents[0])d = extract_phrases(chunker.parse(tagger.tag(words)))# defaultdict(<type 'list'>, {'PHRASE_TAG': ['phrase']})

Final Tips

error correction is faster than manual annotation

find close enough corpora

use nltk-trainer to experiment

iterate -> quality

no substitute for human judgement

Recommended