37
Natural Language Processing + Python by Ann C. Tan-Pohlmann February 22, 2014

Natural Language Processing and Python

  • Upload
    anntp

  • View
    2.704

  • Download
    10

Embed Size (px)

DESCRIPTION

A short presentation on basic NLP concepts and computational challenges using Python tools such as NLTK & Gensim. Presented by Ann C. Tan-Pohlmann

Citation preview

  • 1.Natural Language Processing + Python by Ann C. Tan-PohlmannFebruary 22, 2014

2. Outline NLP Basics NLTK Text Processing Gensim (really, really short ) Text Classification2 3. Natural Language Processing computer science, artificial intelligence, and linguistics humancomputer interaction natural language understanding natural language generation - Wikipedia3 4. Star Trek's Universal Translatorhttp://www.youtube.com/watch?v=EaeSKU V2zp0 5. Spoken Dialog Systems5 6. NLP Basics Morphology study of word formation how word forms vary in a sentence Syntax branch of grammar how words are arranged in a sentence to show connections of meaning Semantics study of meaning of words, phrases and sentences 6 7. NLTK: Getting Started Natural Language Took Kit for symbolic and statistical NLP teaching tool, study tool and as a platform for prototyping Python 2.7 is a prerequisite >>> import nltk >>> nltk.download()7 8. Some NLTK methods Frequency Distributiontext.similar(str) concordance(str) len(text) len(set(text)) lexical_diversity len(text)/ len(set(text))fd = FreqDist(text) fd.inc(str) fd[str] fd.N() fd.max() text.collocations() - sequence of words that occur together oftenMORPHOLOGY > Syntax > Semantics8 9. Frequency Distribution fd = FreqDist(text) fd.inc(str) increment count fd[str] returns the number of occurrence for sample str fd.N() total number of samples fd.max() sample with the greatest count9 10. Corpus large collection of raw or categorized text on one or more domain Examples: Gutenberg, Brown, Reuters, Web & Chat Txt >>> from nltk.corpus import brown >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', ' humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] >>> adventure_text = brown.words(categories='adventure')10 11. Corpora in Other Languages >>> from nltk.corpus import udhr >>> languages = nltk.corpus.udhr.fileids() >>> languages.index('Filipino_Tagalog-Latin1') >>> tagalog = nltk.corpus.udhr.raw('Filipino_Tagalog-Latin1') >>> tagalog_words = nltk.corpus.udhr.words('Filipino_Tagalog-Latin1') >>> tagalog_tokens = nltk.word_tokenize(tagalog) >>> tagalog_text = nltk.Text(tagalog_tokens) >>> fd = FreqDist(tagalog_text) >>> for sample in fd: ... print sample11 12. Using Corpus from Palito Corpus large collection of raw or categorized text >>> import nltk >>> from nltk.corpus import PlaintextCorpusReader >>> corpus_dir = '/Users/ann/Downloads' >>> tagalog = PlaintextCorpusReader(corpus_dir, 'Tagalog_Literary_Text.txt') >>> raw = tagalog.raw() >>> sentences = tagalog.sents() >>> words = tagalog.words() >>> tokens = nltk.word_tokenize(raw) >>> tagalog_text = nltk.Text(tokens) 12 13. Spoken Dialog SystemsMORPHOLOGY > Syntax > Semantics13 14. Tokenization Tokenization breaking up of string into words and punctuations>>> tokens = nltk.word_tokenize(raw) >>> tagalog_tokens = nltk.Text(tokens) >>> tagalog_tokens = set(sample.lower() for sample in tagalog_tokens)MORPHOLOGY > Syntax > Semantics14 15. Stemming Stemming normalize words into its base form, result may not be the 'root' word >>> def stem(word): ... for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']: ... if word.endswith(suffix): ... return word[:-len(suffix)] ... return word ... >>> stem('reading') 'read' >>> stem('moment') 'mo'MORPHOLOGY > Syntax > Semantics15 16. Lemmatization Lemmatization uses vocabulary list and morphological analysis (uses POS of a word) >>> def stem(word): ... for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']: ... if word.endswith(suffix) and word[:-len(suffix)] in brown.words(): ... return word[:-len(suffix)] ... return word ... >>> stem('reading') 'read' >>> stem('moment') 'moment'MORPHOLOGY > Syntax > Semantics16 17. NLTK Stemmers & Lemmatizer Porter Stemmer and Lancaster Stemmer >>> porter = nltk.PorterStemmer() >>> lancaster = nltk.LancasterStemmer() >>> [porter.stem(w) for w in brown.words()[:100]] Word Net Lemmatizer >>> wnl = nltk.WordNetLemmatizer() >>> [wnl.lemmatize(w) for w in brown.words()[:100]] Comparison >>> [wnl.lemmatize(w) for w in ['investigation', 'women']] >>> [porter.stem(w) for w in ['investigation', 'women']] >>> [lancaster.stem(w) for w in ['investigation', 'women']]MORPHOLOGY > Syntax > Semantics17 18. Using Regular Expression Operator . ^abc abc$ [abc] [A-Z0-9] ed|ing|s * + ? {n} {n,} {,n} {m,n} a(b|c)+Behavior Wildcard, matches any character Matches some pattern abc at the start of a string Matches some pattern abc at the end of a string Matches one of a set of characters Matches one of a range of characters Matches one of the specified strings (disjunction) Zero or more of previous item, e.g. a*, [a-z]* (also known as Kleene Closure) One or more of previous item, e.g. a+, [a-z]+ Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]? Exactly n repeats where n is a non-negative integer At least n repeats No more than n repeats At least m and no more than n repeats Parentheses that indicate the scope of the operatorsMORPHOLOGY > Syntax > Semantics18 19. Using Regular Expression >>> import re >>> re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'reading') [('read', 'ing')] >>> def stem(word): ... regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$' ... stem, suffix = re.findall(regexp, word)[0] ... return stem ... >>> stem('reading') 'read' >>> stem('moment') 'moment'MORPHOLOGY > Syntax > Semantics19 20. Spoken Dialog SystemsMorphology > SYNTAX > Semantics20 21. Lexical Resources collection of words with association information (annotation) Ex: stopwords high-frequency words with little lexical content >>> from nltk.corpus import stopwords >>> stopwords.words('english') >>> stopwords.words('german')MORPHOLOGY > Syntax > Semantics21 22. Part-of-Speech (POS) Tagging the process of labeling and classifying words to a particular part of speech based on its definition and contextMorphology > SYNTAX > Semantics22 23. NLTKs POS Tag Sets* 1/2 Tag ADJ ADV CNJ DET EX FW MOD N NPMeaning adjective adverb conjunction determiner existential foreign word modal verb noun proper nounExamples new, good, high, special, big, local really, already, still, early, now and, or, but, if, while, although the, a, some, most, every, no there, there's dolce, ersatz, esprit, quo, maitre will, can, would, may, must, should year, home, costs, time, education Alison, Africa, April, Washington*simplified Morphology > SYNTAX > Semantics23 24. NLTKs POS Tag Sets* 2/2 Tag NUM PRO P TO UH V VD VG VN WHMeaning number pronoun preposition the word to interjection verb past tense present participle past participle wh determinerExamples twenty-four, fourth, 1991, 14:24 he, their, her, its, my, I, us on, of, at, with, by, into, under to ah, bang, ha, whee, hmpf, oops is, has, get, do, make, see, run said, took, told, made, asked making, going, playing, working given, taken, begun, sung who, which, when, what, where, how*simplified Morphology > SYNTAX > Semantics24 25. NLTK POS Tagger (Brown) >>> nltk.pos_tag(brown.words()[:30]) [('The', 'DT'), ('Fulton', 'NNP'), ('County', 'NNP'), ('Grand', 'NNP'), ('Jury', 'NNP'), ('said', 'VBD'), ('Friday', 'NNP'), ('an', 'DT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'JJ'), ('recent', 'JJ'), ('primary', 'JJ'), ('election', 'NN'), ('produced', 'VBN'), ('``', '``'), ('no', 'DT'), ('evidence', 'NN'), ("''", "''"), ('that', 'WDT'), ('any', 'DT'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.'), ('The', 'DT'), ('jury', 'NN'), ('further', 'RB'), ('said', 'VBD'), ('in', 'IN')] >>> brown.tagged_words(simplify_tags=True) [('The', 'DET'), ('Fulton', 'NP'), ('County', 'N'), ...]Morphology > SYNTAX > Semantics25 26. NLTK POS Tagger (German) >>> german = nltk.corpus.europarl_raw.german >>> nltk.pos_tag(german.words()[:30]) [(u'Wiederaufnahme', 'NNP'), (u'der', 'NN'), (u'Sitzungsperiode', 'NNP'), (u'Ich', 'NNP'), (u'erklxe4re', 'NNP'), (u'die', 'VB'), (u'am', 'NN'), (u'Freita g', 'NNP'), (u',', ','), (u'dem', 'NN'), (u'17.', 'CD'), (u'Dezember', 'NNP'), (u' unterbrochene', 'NN'), (u'Sitzungsperiode', 'NNP'), (u'des', 'VBZ'), (u'Eur opxe4ischen', 'JJ'), (u'Parlaments', 'NNS'), (u'fxfcr', 'JJ'), (u'wiederaufg enommen', 'NNS'), (u',', ','), (u'wxfcnsche', 'NNP'), (u'Ihnen', 'NNP'), (u' nochmals', 'NNS'), (u'alles', 'VBZ'), (u'Gute', 'NNP'), (u'zum', 'NN'), (u'Ja hreswechsel', 'NNP'), (u'und', 'NN'), (u'hoffe', 'NN'), (u',', ',')]xe4 = xfc = !!! DOES NOT WORK FOR GERMANMorphology > SYNTAX > Semantics26 27. NLTK POS Dictionary >>> pos = nltk.defaultdict(lambda:'N') >>> pos['eat'] 'N' >>> pos.items() [('eat', 'N')] >>> for (word, tag) in brown.tagged_words(simplify_tags=True): ... if word in pos: ... if isinstance(pos[word], str): ... new_list = [pos[word]] ... pos[word] = new_list ... if tag not in pos[word]: ... pos[word].append(tag) ... else: ... pos[word] = [tag] ... >>> pos['eat'] ['N', 'V'] Morphology > SYNTAX > Semantics27 28. What else can you do with NLTK? Other Taggers Unigram Tagging nltk.UnigramTagger() train tagger using tagged sentence data N-gram Tagging Text classification using machine learning techniques decision trees nave Bayes classification (supervised) Markov Models Morphology > SYNTAX > SEMANTICS28 29. Gensim Tool that extracts semantic structure of documents, by examining word statistical cooccurrence patterns within a corpus of training documents. Algorithms: 1. Latent Semantic Analysis (LSA) 2. Latent Dirichlet Allocation (LDA) or Random Projections Morphology > Syntax > SEMANTICS29 30. Gensim Features memory independent wrappers/converters for several data formats Vector representation of the document as an array of features or question-answer pair 1. 2. 3.(word occurrence, count) (paragraph, count) (font, count) Model transformation from one vector to another learned from a training corpus without supervision Morphology > Syntax > SEMANTICS30 31. Wiki document classificationhttp://radimrehurek.com/gensim/wiki.html31 32. Other NLP tools for Python TextBlob part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation https://pypi.python.org/pypi/textblob Pattern part-of-speech taggers, n-gram search, sentiment analysis, WordNet, machine learning http://www.clips.ua.ac.be/pattern 32 33. Star Trek technology that became a realityhttp://www.youtube.com/watch?v=sRZxwR IH9RI 34. Installation Guides NLTK http://www.nltk.org/install.html http://www.nltk.org/data.html Gensim http://radimrehurek.com/gensim/install.html Palito http://ccs.dlsu.edu.ph:8086/Palito/find_project.js p 34 35. Using iPython http://ipython.org/install.html >>> documents = ["Human machine interface for lab abc computer applications", >>> "A survey of user opinion of computer system response time", >>> "The EPS user interface management system", >>> "System and human system engineering testing of EPS", >>> "Relation of user perceived response time to error measurement", >>> "The generation of random binary unordered trees", >>> "The intersection graph of paths in trees", >>> "Graph minors IV Widths of trees and well quasi ordering", >>> "Graph minors A survey"]35 36. References Natural Language Processing with Python By Steven Bird, Ewan Klein, Edward Loper http://www.nltk.org/book/ http://radimrehurek.com/gensim/tutorial.htm l36 37. Thank You! For questions and comments: - ann at auberonsolutions dot com37