Corpus Bootstrapping with NLTK

Corpus Bootstrapping with NLTKby Jacob Perkins

Jacob Perkins

http://www.weotta.com

http://streamhacker.com

http://text-processing.com

https://github.com/japerk/nltk-trainer

@japerk

Problem

you want to do NLProc

many proven supervised training algorithms

but you don’t have a training corpus

Solution

make a custom training corpus

Problems with Manual Annotation

takes time

requires expertise

expert time costs $$$

Solution: Bootstrap

less time

less expertise

costs less

requires thinking & creativity

Corpus Bootstrapping at Weotta

review sentiment

keyword classification

phrase extraction & classification

Bootstrapping Examples

english -> spanish sentiment

phrase extraction

Translating Sentiment

start with english sentiment corpus & classifier

english -> spanish -> spanish

English -> Spanish -> Spanish

1. translate english examples to spanish

2. train classifier

3. classify spanish text into new corpus

4. correct new corpus

5. retrain classifier

6. add to corpus & goto 4 until done

Translate Corpus

$ translate_corpus.py movie_reviews --source english --target spanish

Train Initial Classifier

$ train_classifier.py spanish_movie_reviews

Create New Corpus

$ classify_to_corpus.py spanish_sentiment --input spanish_examples.txt --classifier spanish_movie_reviews_NaiveBayes.pickle

Manual Correction

1. scan each file

2. move incorrect examples to correct file

Train New Classifier

$ train_classifier.py spanish_sentiment

Adding to the Corpus

start with >90% probability

retrain

carefully decrease probability threshold

Add more at a Lower Threshold

$ classify_to_corpus.py categorized_corpus --classifier categorized_corpus_NaiveBayes.pickle --threshold 0.8 --input new_examples.txt

When are you done?

what level of accuracy do you need?

does your corpus reflect real text?

how much time do you have?

garbage in, garbage out

correct bad data

clean & scrub text

experiment with train_classifier.py options

create custom features

Bootstrapping a Phrase Extractor1. find a pos tagged corpus

2. annotate raw text

3. train pos tagger

4. create pos tagged & chunked corpus

5. tag unknown words

6. train pos tagger & chunker

7. correct errors

8. add to corpus, goto 5 until done

NLTK Tagged Corpora

English: brown, conll2000, treebank

Portuguese: mac_morpho, floresta

Spanish: cess_esp, conll2002

Catalan: cess_cat

Dutch: alpino, conll2002

Indian Languages: indian

Chinese: sinica_treebank

see http://text-processing.com/demo/tag/

Train Tagger

$ train_tagger.py treebank --simplify_tags

Phrase Annotation

Hello world, [this is an important phrase].

Tag Phrases

$ tag_phrases.py my_corpus --tagger treebank_simplify_tags.pickle --input my_phrases.txt

Chunked & Tagged Phrase

Hello/N world/N ,/, [ this/DET is/V an/DET important/ADJ phrase/N ] ./.

Correct Unknown Words

1. find -NONE- tagged words

2. fix tags

Train New Tagger

$ train_tagger.py my_corpus --reader nltk.corpus.reader.ChunkedCorpusReader

Train Chunker

$ train_chunker.py my_corpus --reader nltk.corpus.reader.ChunkedCorpusReader

Extracting Phrasesimport collections, nltk.datafrom nltk import tokenizefrom nltk.tag import untag

tagger = nltk.data.load('taggers/my_corpus_tagger.pickle')chunker = nltk.data.load('chunkers/my_corpus_chunker.pickle')

def extract_phrases(t): d = collections.defaultdict(list) for sub in t.subtrees(lambda s: s.node != 'S'): d[sub.node].append(' '.join(untag(sub.leaves()))) return d

sents = tokenize.sent_tokenize(text)words = tokenize.word_tokenize(sents[0])d = extract_phrases(chunker.parse(tagger.tag(words)))# defaultdict(<type 'list'>, {'PHRASE_TAG': ['phrase']})

Final Tips

error correction is faster than manual annotation

find close enough corpora

use nltk-trainer to experiment

iterate -> quality

no substitute for human judgement

http://www.nltk.org

https://github.com/japerk/nltk-trainer

http://text-processing.com

Corpus Bootstrapping with NLTK

Technology

NLTK Tutorial: Introduction

Structural bootstrapping - A novel, generative …h2t.anthropomatik.kit.edu/pdf/Woergoetter2015.pdfStructural bootstrapping - A novel, generative mechanism for faster and ... bootstrapping

09 bootstrapping

Basic NLP with Python and NLTK

NLTK Tagging

HowtoperformsomecommonNLPtasksusing NLTK · 2011. 9. 6. · NLTK ExampleNLPPipeline WordNet Classiﬁers HowtoperformsomecommonNLPtasksusing NLTK MichaelGabilondo CAP5636,Fall2011

Introduction to NLTK

NLTK и Python для работы с текстами

NLTK introduction

Lecture 21 Computational Lexical Semantics Topics Features in NLTK III Computational Lexical Semantics Semantic Web USCReadings: NLTK book Chapter 10 Text

NLTK: Natural Language Processing made easy

Python 3 March 15, 2011. NLTK import nltk nltk.download()

Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Slide Nltk Kt

October 2005CSA3180: Text Processing II1 CSA3180: Natural Language Processing Text Processing 2 Shallow Parsing and Chunking Python and NLTK NLTK Exercises

Appendix Bootstrapping

Bootstrapping Coursepad

Python NLTK

We love NLTK

NLTK and Lexical Information - GitHub Pages · NLTK and Lexical Information Text Statistics References NLTK book examples Concordances Lexical Dispersion Plots Diachronic vs Synchronic