Click here to load reader

October 2005CSA3180: Text Processing II1 CSA3180: Natural Language Processing Text Processing 2 Shallow Parsing and Chunking Python and NLTK NLTK Exercises

  • View
    218

  • Download
    2

Embed Size (px)

Text of October 2005CSA3180: Text Processing II1 CSA3180: Natural Language Processing Text Processing 2...

  • October 2005CSA3180: Text Processing II*CSA3180: Natural Language ProcessingText Processing 2 Shallow Parsing and Chunking Python and NLTK NLTK Exercises

    CSA3180: Text Processing II

  • October 2006csa3180: Parsing Algorithms 1*Parsing ProblemGiven grammar G and sentence A discover all valid parse trees for G that exactly cover ASVPNPVDetNomNbookthatflight

    csa3180: Parsing Algorithms 1

  • October 2006csa3180: Parsing Algorithms 1*The elephant is in the gardenI shot an elephant in my gardenNPVPNPPPNPS

    csa3180: Parsing Algorithms 1

  • October 2006csa3180: Parsing Algorithms 1*I own the gardenI shot an elephant in my gardenNPVPPPNPS

    csa3180: Parsing Algorithms 1

  • October 2006csa3180: Parsing Algorithms 1*Parsing as SearchSearch within a space defined byStart StateGoal StateState to state transformationsTwo distinct parsing strategies:Top downBottom upDifferent parsing strategy, different state space, different problem.N.B. Parsing strategy search strategy

    csa3180: Parsing Algorithms 1

  • October 2005CSA3180: Text Processing II*Shallow/Chunk ParsingGoal: divide a sentence into a sequence of chunks.Chunks are non-overlapping regions of a text[I] saw [a tall man] in [the park].Chunks are non-recursive A chunk can not contain other chunksChunks are non-exhaustive Not all words are included in chunks

    CSA3180: Text Processing II

  • October 2005CSA3180: Text Processing II*Chunk Parsing ExamplesNoun-phrase chunking:[I] saw [a tall man] in [the park].Verb-phrase chunking:The man who [was in the park] [saw me].Prosodic chunking:

    [I saw] [a tall man] [in the park].Question answering:What [Spanish explorer] discovered [the Mississippi River]?

    CSA3180: Text Processing II

  • October 2005CSA3180: Text Processing II*MotivationLocating informatione.g., text retrievalIndex a document collection on its noun phrasesIgnoring informationGeneralize in order to study higher-level patternse.g. phrases involving gave in Penn treebank:gave NP; gave up NP in NP; gave NP up; gave NP help; gave NP to NPSometimes a full parse has too much structureToo nestedChunks usually are not recursive

    CSA3180: Text Processing II

  • October 2005CSA3180: Text Processing II*RepresentationBIO (or IOB) Trees

    CSA3180: Text Processing II

  • October 2005CSA3180: Text Processing II*Comparison with Full ParsingParsing is usually an intermediate stageBuilds structures that are used by later stages of processingFull parsing is a sufficient but not necessary intermediate stage for many NLP tasksParsing often provides more information than we need Shallow parsing is an easier problemLess word-order flexibility within chunks than between chunksMore locality:Fewer long-range dependenciesLess context-dependenceLess ambiguity

    CSA3180: Text Processing II

  • October 2005CSA3180: Text Processing II*Chunks and Constituency

    Constituents: [[a tall man] [ in [the park]]].Chunks: [a tall man] in [the park].

    A constituent is part of some higher unit in the hierarchical syntactic parse Chunks are not constituents Constituents are recursiveBut, chunks are typically subsequences of constituents Chunks do not cross major constituent boundaries

    CSA3180: Text Processing II

  • October 2005CSA3180: Text Processing II*UnchunkingRemove any chunk with a given patterne.g., unChunkRule(+, Unchunk NNDT)Combine with Chunk Rule +Chunk all matching subsequences:Input: the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NNApply chunk rule[the/DT little/JJ cat/NN] sat/VBD on/IN [the/DT mat/NN]Apply unchunk rule[the/DT little/JJ cat/NN] sat/VBD on/IN the/DT mat/NN

    CSA3180: Text Processing II

  • October 2005CSA3180: Text Processing II*ChinkingA chink is a subsequence of the text that is not a chunk.Define a regular expression that matches the sequences of tags in a chinkA simple chink regexp for finding NP chunks: (|)+First apply chunk rule to chunk everythingInput: the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NNChunkRule('+', Chunk everything)[the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN]Apply Chink rule above:[the/DT little/JJ cat/NN] sat/VBD on/IN [the/DT mat/NN]

    CSA3180: Text Processing II

  • October 2005CSA3180: Text Processing II*MergingCombine adjacent chunks into a single chunkDefine a regular expression that matches the sequences of tags on both sides of the point to be mergedExample:Merge a chunk ending in JJ with a chunk starting with NNMergeRule(, , Merge adjs and nouns) [the/DT little/JJ] [cat/NN] sat/VBD on/IN the/DT mat/NN [the/DT little/JJ cat/NN] sat/VBD on/IN the/DT mat/NN Splitting is the opposite of merging

    CSA3180: Text Processing II

  • October 2005CSA3180: Text Processing II*Python and NLTKNatural Language Toolkit (NLTK)http://nltk.sourceforge.net/NLTK Slides partly based on Diane Litman LecturesChunk parsing slides partly based on Marti Hearst Lectures

    CSA3180: Text Processing II

  • October 2005CSA3180: Text Processing II*Python for NLPPython is a great language for NLP:SimpleEasy to debug:ExceptionsInterpreted languageEasy to structureModulesObject oriented programmingPowerful string manipulation

    CSA3180: Text Processing II

  • October 2005CSA3180: Text Processing II*Python Modules and PackagesPython modules package program code and data for reuse. (Lutz)Similar to library in C, package in Java. Python packages are hierarchical modules (i.e., modules that contain other modules). Three commands for accessing modules:importfromimportreload

    CSA3180: Text Processing II

  • October 2005CSA3180: Text Processing II*Import CommandThe import command loads a module:# Load the regular expression module>>> import re To access the contents of a module, use dotted names:# Use the search method from the re module>>> re.search(\w+, str)To list the contents of a module, use dir:>>> dir(re) [DOTALL, I, IGNORECASE,]

    CSA3180: Text Processing II

  • October 2005CSA3180: Text Processing II*from...importThe fromimport command loads individual functions and objects from a module:# Load the search function from the re module>>> from re import searchOnce an individual function or object is loaded with fromimport, it can be used directly:# Use the search method from the re module>>> search (\w+, str)

    CSA3180: Text Processing II

  • October 2005CSA3180: Text Processing II*Import vs. from...importImportKeeps module functions separate from user functions.Requires the use of dotted names.Works with reload.fromimportPuts module functions and user functions together.More convenient names.Does not work with reload.

    CSA3180: Text Processing II

  • October 2005CSA3180: Text Processing II*ReloadIf you edit a module, you must use the reload command before the changes become visible in Python:>>> import mymodule...>>> reload (mymodule)The reload command only affects modules that have been loaded with import; it does not update individual functions and objects loaded with from...import.

    CSA3180: Text Processing II

  • October 2005CSA3180: Text Processing II*NLTK IntroductionThe Natural Language Toolkit (NLTK) provides:Basic classes for representing data relevant to natural language processing.Standard interfaces for performing tasks, such as tokenization, tagging, and parsing.Standard implementations of each task, which can be combined to solve complex problems.

    CSA3180: Text Processing II

  • October 2005CSA3180: Text Processing II*NLTK Example Modulesnltk.token: processing individual elements of text, such as words or sentences.nltk.probability: modeling frequency distributions and probabilistic systems.nltk.tagger: tagging tokens with supplemental information, such as parts of speech or wordnet sense tags.nltk.parser: high-level interface for parsing texts.nltk.chartparser: a chart-based implementation of the parser interface.nltk.chunkparser: a regular-expression based surface parser.

    CSA3180: Text Processing II

  • October 2005CSA3180: Text Processing II*Chunk Parsing in NLTKChunk parsers usually ignore lexical contentOnly need to look at part-of-speech tagsPossible steps in chunk parsingChunking, unchunkingChinkingMerging, splittingEvaluationCompare to a BaselineEvaluate in terms of Precision, Recall, F-MeasureMissed (False Negative), Incorrect (False Positive)

    CSA3180: Text Processing II

  • October 2005CSA3180: Text Processing II*Chunk Parsing in NLTKDefine a regular expression that matches the sequences of tags in a chunk

    A simple noun phrase chunk regexp:(Note that matches any tag starting with NN)? *

    Chunk all matching subsequences:

    the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN[the/DT little/JJ cat/NN] sat/VBD on/IN [the/DT mat/NN]

    If matching subsequences overlap, first 1 gets priority

    CSA3180: Text Processing II

  • October 2005CSA3180: Text Processing II*NLTK Exercises for Next WeekSeries of tutorials by Steven Bird, Edward Klein and Edward Loperhttp://nltk.sourceforge.net/University of PennsylvaniaBy next lecture please read and do exercises in:IntroductionProgrammingTokenizeTag

    CSA3180: Text Processing II

  • October 2005CSA3180: Text Processing II*Next SessionsNatural Language Toolkit (NLTK) Exerciseshttp://nltk.sourceforge.net/Discovery of Word AssociationsText ClassificationClustering/Data MiningTF.IDFLinear and Non-Linear ClassificationBinary ClassificationMulti-Class Classification

    CSA3180: Text Processing II

Search related