Methods in Computational Linguistics IIwith reference to Matt Huenerfauth’s
Language Technology material
Lecture 4: Matching Things. Regular Expressions
2
Today
• Regular Expressions• Snippet on Speech Recognition
– At least half of it.
3
Regular Expressions
• Can be viewed as a way to specify – Search patterns over a text string– Design a particular kind of machine, a Finite
State Automaton (FSA) • we probably won’t cover this today.
– Define a formal “language” i.e. a set of strings
4
Uses of Regular Expressions
• Simple powerful tools for large corpus analysis and ‘shallow’ processing– What word is most likely to begin a sentence– What word is most likely to begin a question?– Are you more or less polite than the people
you correspond with?
5
Definitions
• Regular Expression: Formula in algebraic notation for specifying a set of strings
• String: Any sequence of characters• Regular Expression Search
– Pattern: specifies the set of strings we want to search for
– Corpus: the texts we want to search through
6
Simple Example
7
More Examples
8
And still more examples
9
Optionality and Repetition
• /[Ww]oodchucks?/
• /colou?r/• /he{3}/• /(he){3}/• /(he){3},/
10
Character Groups
• Some groups of characters are used very frequently, so the RE language includes shorthands for them
11
Special Characters
• These enable the matching of multiple occurrences of a pattern
12
Escape Characters
• Sometimes you want to use an asterisk “*” as an asterisk and not as a modifier.
13
RE Matching in Python NLTK
• Set up:– import re– from nltk.util import re_show– sent = “colourless green ideas sleep furiously
• re_show(pattern, str)– shows where the pattern matches
14
Substitutions
• Replace every l with an s
• re.sub(‘l’, ‘s’, sent)– ‘cosoursess green ideas sseep furioussy’
• re.sub(‘green’, ‘red’, sent)– ‘colourless red ideas sleep furiously’
15
Findall
• re.findall(pattern, sent)– will return all of the substrings that match the
pattern– re.findall(‘(green|sleep)’, sent)
• [‘green’, ‘sleep’]
16
Match
• Matches from the beginning of the string• match(pattern, string)
– Returns: a Match object or None (if not found)
• Match objects contain information about the search
17
Methods in Match
18
More Match Methods
19
Search
• re.search(pattern, string)– Finds the pattern anywhere in the string.
– re.search(‘\d+’, ‘ 1034 ’).group() • ‘1034’
– re.search(‘\d+’, ‘ abc123 ‘).group()• ‘123’
20
Splitting
• ‘text can be made into lists’.split()
• re.split(pattern, split)– uses the pattern to identify the split point– re.split(‘\d+’, “I want 4 cats and 13 dogs”)
• [“I want ”, “ cats and ”, “ dogs”]– re.split(‘\s*\d+\s*’, “I want 4 cats and 13 dogs”)
• [“I want”, “cats and”, “dogs”]
21
Joining
• ‘ ‘.[‘lists’, ‘can’, ‘be’, ‘made’, ‘into’, ‘strings’]
• This simple formatting can be helpful to report results or merge information
22
Stemming with Regular Expressions
def stem(word):regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|
es|s|ment)?$'stem, suffix = re.findall(regexp, word)[0]return stem
23
Play with some code
24
Snippet on Speech Recognition