Transcript
Page 1: Lecture  4: Matching Things. Regular Expressions

Methods in Computational Linguistics IIwith reference to Matt Huenerfauth’s

Language Technology material

Lecture 4: Matching Things. Regular Expressions

Page 2: Lecture  4: Matching Things. Regular Expressions

2

Today

• Regular Expressions• Snippet on Speech Recognition

– At least half of it.

Page 3: Lecture  4: Matching Things. Regular Expressions

3

Regular Expressions

• Can be viewed as a way to specify – Search patterns over a text string– Design a particular kind of machine, a Finite

State Automaton (FSA) • we probably won’t cover this today.

– Define a formal “language” i.e. a set of strings

Page 4: Lecture  4: Matching Things. Regular Expressions

4

Uses of Regular Expressions

• Simple powerful tools for large corpus analysis and ‘shallow’ processing– What word is most likely to begin a sentence– What word is most likely to begin a question?– Are you more or less polite than the people

you correspond with?

Page 5: Lecture  4: Matching Things. Regular Expressions

5

Definitions

• Regular Expression: Formula in algebraic notation for specifying a set of strings

• String: Any sequence of characters• Regular Expression Search

– Pattern: specifies the set of strings we want to search for

– Corpus: the texts we want to search through

Page 6: Lecture  4: Matching Things. Regular Expressions

6

Simple Example

Page 7: Lecture  4: Matching Things. Regular Expressions

7

More Examples

Page 8: Lecture  4: Matching Things. Regular Expressions

8

And still more examples

Page 9: Lecture  4: Matching Things. Regular Expressions

9

Optionality and Repetition

• /[Ww]oodchucks?/

• /colou?r/• /he{3}/• /(he){3}/• /(he){3},/

Page 10: Lecture  4: Matching Things. Regular Expressions

10

Character Groups

• Some groups of characters are used very frequently, so the RE language includes shorthands for them

Page 11: Lecture  4: Matching Things. Regular Expressions

11

Special Characters

• These enable the matching of multiple occurrences of a pattern

Page 12: Lecture  4: Matching Things. Regular Expressions

12

Escape Characters

• Sometimes you want to use an asterisk “*” as an asterisk and not as a modifier.

Page 13: Lecture  4: Matching Things. Regular Expressions

13

RE Matching in Python NLTK

• Set up:– import re– from nltk.util import re_show– sent = “colourless green ideas sleep furiously

• re_show(pattern, str)– shows where the pattern matches

Page 14: Lecture  4: Matching Things. Regular Expressions

14

Substitutions

• Replace every l with an s

• re.sub(‘l’, ‘s’, sent)– ‘cosoursess green ideas sseep furioussy’

• re.sub(‘green’, ‘red’, sent)– ‘colourless red ideas sleep furiously’

Page 15: Lecture  4: Matching Things. Regular Expressions

15

Findall

• re.findall(pattern, sent)– will return all of the substrings that match the

pattern– re.findall(‘(green|sleep)’, sent)

• [‘green’, ‘sleep’]

Page 16: Lecture  4: Matching Things. Regular Expressions

16

Match

• Matches from the beginning of the string• match(pattern, string)

– Returns: a Match object or None (if not found)

• Match objects contain information about the search

Page 17: Lecture  4: Matching Things. Regular Expressions

17

Methods in Match

Page 18: Lecture  4: Matching Things. Regular Expressions

18

More Match Methods

Page 19: Lecture  4: Matching Things. Regular Expressions

19

Search

• re.search(pattern, string)– Finds the pattern anywhere in the string.

– re.search(‘\d+’, ‘ 1034 ’).group() • ‘1034’

– re.search(‘\d+’, ‘ abc123 ‘).group()• ‘123’

Page 20: Lecture  4: Matching Things. Regular Expressions

20

Splitting

• ‘text can be made into lists’.split()

• re.split(pattern, split)– uses the pattern to identify the split point– re.split(‘\d+’, “I want 4 cats and 13 dogs”)

• [“I want ”, “ cats and ”, “ dogs”]– re.split(‘\s*\d+\s*’, “I want 4 cats and 13 dogs”)

• [“I want”, “cats and”, “dogs”]

Page 21: Lecture  4: Matching Things. Regular Expressions

21

Joining

• ‘ ‘.[‘lists’, ‘can’, ‘be’, ‘made’, ‘into’, ‘strings’]

• This simple formatting can be helpful to report results or merge information

Page 22: Lecture  4: Matching Things. Regular Expressions

22

Stemming with Regular Expressions

def stem(word):regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|

es|s|ment)?$'stem, suffix = re.findall(regexp, word)[0]return stem

Page 23: Lecture  4: Matching Things. Regular Expressions

23

Play with some code

Page 24: Lecture  4: Matching Things. Regular Expressions

24

Snippet on Speech Recognition


Recommended