13
CS2304: Python for Java Programmers Monti and McQuain 2014 CS2304: Regular Expressions

CS2304: Regular Expressions - Virginia Techcs2304/spring2014/Notes/T09... · 2014. 5. 1. · CS2304: Python for Java Programmers Monti and McQuain 2014 Regular Expressions • A regular

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • CS2304: Python for Java Programmers

    Monti and McQuain 2014

    CS2304: Regular Expressions

  • CS2304: Python for Java Programmers

    Monti and McQuain 2014

    Regular Expressions

    •  Many of these slides come from or were inspired by William McQuain’s slides on grep REs.

    •  These have been adapted for Python 3’s re module. The original version can be seen here:

    •  http://courses.cs.vt.edu/~cs2505/fall2013/Notes/T05_RegularExpressions.pdf

  • CS2304: Python for Java Programmers

    Monti and McQuain 2014

    Regular Expressions

    •  A regular expression is a sequence of characters that specifies a set of strings, which are said to match the regular expression.

    •  For example, in one flavor of regular expression syntax: •  gli..ering à set of strings that begin with "gli”,

    followed by any two characters, followed by "ering"

    •  Most characters simply stand for themselves. •  As an exmaple: “g” in glimmering just matches a “g”

    •  Meta-characters have special meaning.

  • CS2304: Python for Java Programmers

    Monti and McQuain 2014

    RE Syntax Meta-characters period (.) matches any single character

    a.c is matched by aac, abc, a)c, etc. b..t is matched by beet, best, boot, bart, etc.

    asterisk (*) matches zero or more occurrences of the preceding RE

    ab*c is matched by ac, abc, abbc, abbbc, etc. .* is matched by all strings

    plus (+) matches one or more occurrences of the preceding RE

    ab+c is matched by abc, abbc, abbbc, etc., but not by ac

  • CS2304: Python for Java Programmers

    Monti and McQuain 2014

    Example: How To Use REs

    •  So how would we use RE gli..er with Python? •  Let’s say we’d like to find all of the lines which

    match the RE: import re re_string = r“gli..er” f = open(“input.txt”, “r”) lines = f.readlines() for line in lines:

    if re.search(re_string, line): print(line)

  • CS2304: Python for Java Programmers

    Monti and McQuain 2014

    Example: How To Use REs

    •  If the input file contains the text of Moby Dick, our code would produce this output:

    a fine frosty night; how Orion glitters; what northern lights! Let them glittering teeth resembling ivory saws; others were tufted with knots of footfall in the passage, and saw a glimmer of light come into the room glittering in the clear, cold air. Huge hills and mountains of casks on glittering expression--all this sufficiently proclaimed him an inheritor looked celestial; seemed some plumed and glittering god uprising from suddenly relieved hull rolled away from it, to far down her glittering the wife sat frozen at the window, with tearless eyes, glitteringly glittering fiddle-bows of whale ivory, were presiding over the hilarious to glimmer into sight. Glancing upwards, he cried: "See! see!" and once At the first faintest glimmering of the dawn, his iron voice was heard leeward; and Ahab heading the onset. A pale, death-glimmer lit up glittering mouth yawned beneath the boat like an open-doored marble methodic intervals, the whale's glittering spout was regularly announced the moment, intolerably glittered and glared like a glacier; and

  • CS2304: Python for Java Programmers

    Monti and McQuain 2014

    Example: How To Use REs

    •  Another example, here the input file contains the output on the last slide, and our RE is “f..”

    a fine frosty night; how Orion glitters; what northern lights! Let them glittering teeth resembling ivory saws; others were tufted with knots of footfall in the passage, and saw a glimmer of light come into the room glittering in the clear, cold air. Huge hills and mountains of casks on glittering expression--all this sufficiently proclaimed him an inheritor looked celestial; seemed some plumed and glittering god uprising from suddenly relieved hull rolled away from it, to far down her glittering the wife sat frozen at the window, with tearless eyes, glitteringly glittering fiddle-bows of whale ivory, were presiding over the hilarious At the first faintest glimmering of the dawn, his iron voice was heard

  • CS2304: Python for Java Programmers

    Monti and McQuain 2014

    Other Common Meta-characters question mark (?) matches zero or one occurrence of the preceding RE

    ab?c is matched by ac and abc, but not by abbc b.?t is matched by bt, bat, bet, bxt, etc.

    logical or (|) matches the RE before | or the RE after |

    abc|def is matched by abc and def and nothing else caret (^) used outside brackets, matches only at the beginning of a line

    ^D.* is matched by any line beginning with D Semantics are normally different inside brackets…

  • CS2304: Python for Java Programmers

    Monti and McQuain 2014

    Other Common Meta-characters dollar sign ($) matches only at the end of a line

    .*d$ is matched by any line ending with a d backslash (\) escapes other metacharacters

    now\. is matched by "now.” parentheses () forms a group of characters to be treated as a unit

    a(bc)+ is matched by abc, abcbc, abcbcbc, etc.

  • CS2304: Python for Java Programmers

    Monti and McQuain 2014

    A Couple More Meta-characters square brackets [] specify a set of characters as a set; any character in the set will match

    [aeiou] is matched by any vowel [a-z] is matched by any lower-case letter ^ specifies the complement (negation) of the set [^aeiou] is matched by any character but 'a', 'e', 'i', 'o' and 'u’

    braces {} specifies the number of repetitions of an RE

    [a-z]{3} is matched by any three lower-case letters

  • CS2304: Python for Java Programmers

    Monti and McQuain 2014

    A More Complex Example

    •  Another example, here the input file contains the output from earlier Moby Dick example (slide 6)

    •  Our RE is: •  r“(^[gG].*g)|(hu(l){2})”

    glittering teeth resembling ivory saws; others were tufted with knots of glittering in the clear, cold air. Huge hills and mountains of casks on glittering expression--all this sufficiently proclaimed him an inheritor suddenly relieved hull rolled away from it, to far down her glittering glittering fiddle-bows of whale ivory, were presiding over the hilarious glittering mouth yawned beneath the boat like an open-doored marble

  • CS2304: Python for Java Programmers

    Monti and McQuain 2014

    Character Classes

    •  Python also has meta-characters that match a whole “type” of character.

    •  Often start with a \ \w

    Matches characters considered alphanumeric in the ASCII character set; [a-zA-Z0-9_]

    \s

    Matches whitespace characters. [ \t\n\r\f\v] \d

    Matches numbers, same as [0-9]

  • CS2304: Python for Java Programmers

    Monti and McQuain 2014

    Other Python RE Functions

    •  We’ve only seen search but the re module provides many other useful functions:

    re.match(pattern, string[, flags])

    If zero or more characters at the beginning of string match the regular expression pattern return a corresponding MatchObject instance.

    re.split(pattern, string[, maxsplit=0, flags=0]) Split string by the occurrences of pattern.

    re.findall(pattern, string[, flags])

    Return all non-overlapping matches of pattern in string, as a list of strings