31
Corpus Bootstrapping with NLTK by Jacob Perkins

Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Embed Size (px)

Citation preview

Page 1: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Corpus Bootstrapping with NLTKby Jacob Perkins

Page 2: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Jacob Perkins

http://www.weotta.com

http://streamhacker.com

http://text-processing.com

https://github.com/japerk/nltk-trainer

@japerk

Page 3: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Problem

you want to do NLProc

many proven supervised training algorithms

but you don’t have a training corpus

Page 4: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Solution

make a custom training corpus

Page 5: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Problems with Manual Annotation

takes time

requires expertise

expert time costs $$$

Page 6: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Solution: Bootstrap

less time

less expertise

costs less

requires thinking & creativity

Page 7: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Corpus Bootstrapping at Weotta

review sentiment

keyword classification

phrase extraction & classification

Page 8: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Bootstrapping Examples

english -> spanish sentiment

phrase extraction

Page 9: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Translating Sentiment

start with english sentiment corpus & classifier

english -> spanish -> spanish

Page 10: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

English -> Spanish -> Spanish

1. translate english examples to spanish

2. train classifier

3. classify spanish text into new corpus

4. correct new corpus

5. retrain classifier

6. add to corpus & goto 4 until done

Page 11: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Translate Corpus

$ translate_corpus.py movie_reviews --source english --target spanish

Page 12: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Train Initial Classifier

$ train_classifier.py spanish_movie_reviews

Page 13: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Create New Corpus

$ classify_to_corpus.py spanish_sentiment --input spanish_examples.txt --classifier spanish_movie_reviews_NaiveBayes.pickle

Page 14: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Manual Correction

1. scan each file

2. move incorrect examples to correct file

Page 15: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Train New Classifier

$ train_classifier.py spanish_sentiment

Page 16: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Adding to the Corpus

start with >90% probability

retrain

carefully decrease probability threshold

Page 17: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Add more at a Lower Threshold

$ classify_to_corpus.py categorized_corpus --classifier categorized_corpus_NaiveBayes.pickle --threshold 0.8 --input new_examples.txt

Page 18: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

When are you done?

what level of accuracy do you need?

does your corpus reflect real text?

how much time do you have?

Page 19: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Tips

garbage in, garbage out

correct bad data

clean & scrub text

experiment with train_classifier.py options

create custom features

Page 20: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Bootstrapping a Phrase Extractor1. find a pos tagged corpus

2. annotate raw text

3. train pos tagger

4. create pos tagged & chunked corpus

5. tag unknown words

6. train pos tagger & chunker

7. correct errors

8. add to corpus, goto 5 until done

Page 21: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

NLTK Tagged Corpora

English: brown, conll2000, treebank

Portuguese: mac_morpho, floresta

Spanish: cess_esp, conll2002

Catalan: cess_cat

Dutch: alpino, conll2002

Indian Languages: indian

Chinese: sinica_treebank

see http://text-processing.com/demo/tag/

Page 22: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Train Tagger

$ train_tagger.py treebank --simplify_tags

Page 23: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Phrase Annotation

Hello world, [this is an important phrase].

Page 24: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Tag Phrases

$ tag_phrases.py my_corpus --tagger treebank_simplify_tags.pickle --input my_phrases.txt

Page 25: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Chunked & Tagged Phrase

Hello/N world/N ,/, [ this/DET is/V an/DET important/ADJ phrase/N ] ./.

Page 26: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Correct Unknown Words

1. find -NONE- tagged words

2. fix tags

Page 27: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Train New Tagger

$ train_tagger.py my_corpus --reader nltk.corpus.reader.ChunkedCorpusReader

Page 28: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Train Chunker

$ train_chunker.py my_corpus --reader nltk.corpus.reader.ChunkedCorpusReader

Page 29: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Extracting Phrasesimport collections, nltk.datafrom nltk import tokenizefrom nltk.tag import untag

tagger = nltk.data.load('taggers/my_corpus_tagger.pickle')chunker = nltk.data.load('chunkers/my_corpus_chunker.pickle')

def extract_phrases(t): d = collections.defaultdict(list) for sub in t.subtrees(lambda s: s.node != 'S'): d[sub.node].append(' '.join(untag(sub.leaves()))) return d

sents = tokenize.sent_tokenize(text)words = tokenize.word_tokenize(sents[0])d = extract_phrases(chunker.parse(tagger.tag(words)))# defaultdict(<type 'list'>, {'PHRASE_TAG': ['phrase']})

Page 30: Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping with NLTK... · Extracting Phrases import collections, nltk.data from nltk import

Final Tips

error correction is faster than manual annotation

find close enough corpora

use nltk-trainer to experiment

iterate -> quality

no substitute for human judgement