24
Methods in Computational Linguistics II with reference to Matt Huenerfauth’s Language Technology material Lecture 4: Matching Things. Regular Expressions

Lecture 4: Matching Things. Regular Expressions

  • Upload
    drew

  • View
    47

  • Download
    0

Embed Size (px)

DESCRIPTION

Lecture 4: Matching Things. Regular Expressions. Methods in Computational Linguistics II with reference to Matt Huenerfauth’s Language Technology material. Today. Regular Expressions Snippet on Speech Recognition At least half of it. Regular Expressions. - PowerPoint PPT Presentation

Citation preview

Page 1: Lecture  4: Matching Things. Regular Expressions

Methods in Computational Linguistics IIwith reference to Matt Huenerfauth’s

Language Technology material

Lecture 4: Matching Things. Regular Expressions

Page 2: Lecture  4: Matching Things. Regular Expressions

2

Today

• Regular Expressions• Snippet on Speech Recognition

– At least half of it.

Page 3: Lecture  4: Matching Things. Regular Expressions

3

Regular Expressions

• Can be viewed as a way to specify – Search patterns over a text string– Design a particular kind of machine, a Finite

State Automaton (FSA) • we probably won’t cover this today.

– Define a formal “language” i.e. a set of strings

Page 4: Lecture  4: Matching Things. Regular Expressions

4

Uses of Regular Expressions

• Simple powerful tools for large corpus analysis and ‘shallow’ processing– What word is most likely to begin a sentence– What word is most likely to begin a question?– Are you more or less polite than the people

you correspond with?

Page 5: Lecture  4: Matching Things. Regular Expressions

5

Definitions

• Regular Expression: Formula in algebraic notation for specifying a set of strings

• String: Any sequence of characters• Regular Expression Search

– Pattern: specifies the set of strings we want to search for

– Corpus: the texts we want to search through

Page 6: Lecture  4: Matching Things. Regular Expressions

6

Simple Example

Page 7: Lecture  4: Matching Things. Regular Expressions

7

More Examples

Page 8: Lecture  4: Matching Things. Regular Expressions

8

And still more examples

Page 9: Lecture  4: Matching Things. Regular Expressions

9

Optionality and Repetition

• /[Ww]oodchucks?/

• /colou?r/• /he{3}/• /(he){3}/• /(he){3},/

Page 10: Lecture  4: Matching Things. Regular Expressions

10

Character Groups

• Some groups of characters are used very frequently, so the RE language includes shorthands for them

Page 11: Lecture  4: Matching Things. Regular Expressions

11

Special Characters

• These enable the matching of multiple occurrences of a pattern

Page 12: Lecture  4: Matching Things. Regular Expressions

12

Escape Characters

• Sometimes you want to use an asterisk “*” as an asterisk and not as a modifier.

Page 13: Lecture  4: Matching Things. Regular Expressions

13

RE Matching in Python NLTK

• Set up:– import re– from nltk.util import re_show– sent = “colourless green ideas sleep furiously

• re_show(pattern, str)– shows where the pattern matches

Page 14: Lecture  4: Matching Things. Regular Expressions

14

Substitutions

• Replace every l with an s

• re.sub(‘l’, ‘s’, sent)– ‘cosoursess green ideas sseep furioussy’

• re.sub(‘green’, ‘red’, sent)– ‘colourless red ideas sleep furiously’

Page 15: Lecture  4: Matching Things. Regular Expressions

15

Findall

• re.findall(pattern, sent)– will return all of the substrings that match the

pattern– re.findall(‘(green|sleep)’, sent)

• [‘green’, ‘sleep’]

Page 16: Lecture  4: Matching Things. Regular Expressions

16

Match

• Matches from the beginning of the string• match(pattern, string)

– Returns: a Match object or None (if not found)

• Match objects contain information about the search

Page 17: Lecture  4: Matching Things. Regular Expressions

17

Methods in Match

Page 18: Lecture  4: Matching Things. Regular Expressions

18

More Match Methods

Page 19: Lecture  4: Matching Things. Regular Expressions

19

Search

• re.search(pattern, string)– Finds the pattern anywhere in the string.

– re.search(‘\d+’, ‘ 1034 ’).group() • ‘1034’

– re.search(‘\d+’, ‘ abc123 ‘).group()• ‘123’

Page 20: Lecture  4: Matching Things. Regular Expressions

20

Splitting

• ‘text can be made into lists’.split()

• re.split(pattern, split)– uses the pattern to identify the split point– re.split(‘\d+’, “I want 4 cats and 13 dogs”)

• [“I want ”, “ cats and ”, “ dogs”]– re.split(‘\s*\d+\s*’, “I want 4 cats and 13 dogs”)

• [“I want”, “cats and”, “dogs”]

Page 21: Lecture  4: Matching Things. Regular Expressions

21

Joining

• ‘ ‘.[‘lists’, ‘can’, ‘be’, ‘made’, ‘into’, ‘strings’]

• This simple formatting can be helpful to report results or merge information

Page 22: Lecture  4: Matching Things. Regular Expressions

22

Stemming with Regular Expressions

def stem(word):regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|

es|s|ment)?$'stem, suffix = re.findall(regexp, word)[0]return stem

Page 23: Lecture  4: Matching Things. Regular Expressions

23

Play with some code

Page 24: Lecture  4: Matching Things. Regular Expressions

24

Snippet on Speech Recognition