Secrets of RegexpHiro Asari
Red Hat, Inc.
Let's Talk AboutRegular Expressions
Let's Talk AboutRegular Expressions
• There is no regular expression
Let's Talk AboutRegular Expressions
• A good approximation as a name
Let's Talk AboutRegexp
Some people, when confronted with a problem, think, "I know, I'll use regular expressions."
Now they have two problems.
Jaime Zawinski12 Aug, 1997
http://regex.info/blog/2006-09-15/247http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html
The point is not so much the evils of regular expressions, but the evils of overuse of it.
Formal Language Theory
• The Language L
• Over Alphabet Σ
Formal Language Theory
• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)
Formal Language Theory
• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)
• Words over Σ: "a", "b", "ab", "aequafdhfad"
Formal Language Theory
• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)
• Words over Σ: "a", "b", "ab", "aequafdhfad"
• Σ*: The set of all words over Σ
Formal Languageover Σ
• A subset L of Σ* (with various properties)
• L can be finite, and enumerate well-formed words, but often infinite
Example
• Language L over Σ = {a,b}
• 'a' is a word
• a word may be obtained by appending 'ab' to an existing word
• only words thus formed are legal
aaabaabab
Well-formed words
baaaababb
Ill-formed words
Succinctly…
• a(ab)*
Expression
• Textual representation of the formal language against which an input is tested whether it is a well-formed word in that language
Regular Languages
• ∅ (empty language) is regular
Regular Languages
• ∅ (empty language) is regular
• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.
Regular Languages
• ∅ (empty language) is regular
• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.
• If A and B are regular languages, then A ∪ B (union), A•B (concatenation), and A* (Kleene star) are regular languages
Regular Languages
• ∅ (empty language) is regular
• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.
• If A and B are regular languages, then A ∪ B (union), A•B (concatenation), and A* (Kleene star) are regular languages
• No other languages over Σ are regular.
Regular Expressions
• Expressions of regular languages
Regular Expressions
• Expressions of regular languages
Not
Regular? Expressions
• It turns out that some expressions are more powerful and expresses non-regular languages
• Language of 'squares': (.*)\1
• a, aa, aaaa, WikiWiki
How does Regexp work?
• Build a finite state automaton representing a given regular expression
• Feed the String to the regular expression and see if the match succeeds
a
a
ab*
a
b
.*
.
a$
a $
a?
a
ε
a|b
a
b
(ab|c)
c
a b
(ab+|c)
c
a
b
b
Match is attempted at every character, left to
right
zyxwvutsrqponmlkjihgfedcba^
/a$/
Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line
zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^
/a$/
Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line
zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^
/a$/
Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line
zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^
/a$/
Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line
zyxwvutsrqponmlkjihgfedcba^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^zyxwvutsrqponmlkjihgfedcba ^⋮zyxwvutsrqponmlkjihgfedcba ^
/a$/
Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line
abc d a dfadg ^ abc d a dfadg ^ abc d a dfadg ^ abc d a dfadg ^
# matches 'abc d a dfadg '
^\s*(.*)\s*$
def pathological(n=5) Regexp.new('a?' * n + 'a' * n)end
1.upto(40) do |n| print n, ": " print Time.now, "\n" if 'a'*n =~ pathological(n)end
a?a?a?…a?aaa…a
aaa^
a?a?a?aaa
Regexp tips
UP_TO_256 = /\b(?:25[0-5] # 250-255|2[0-4][0-9] # 200-249|1[0-9][0-9] # 100-199|[1-9][0-9] # 2-digit numbers|[0-9]) # single-digit numbers\b/x
IPV4_ADDRESS = /#{UP_TO_256}(?:\.#{UP_TO_256}){3}/
Use /x
\A, \z for strings^, $ for lines
• \A: the beginning of the string
• \z: the end of the string
• ^: after \n
• $: before \n
always in Ruby
\A, \z for strings^, $ for lines
• \A: the beginning of the string
• \z: the end of the string
• ^: after \n
• $: before \n
What's the problem?
also note the difference in what /m means
#! /usr/bin/env perl$a = "abc\ndef";if ($a =~ /^d/) { print "yes\n";}if ($a =~ /^d/m) { print "yes now\n";}# prints 'yes now'
What's the problem?
also note the difference in what /m means
#! /usr/bin/env ruby
a = "abc\ndef";if (a =~ /^d/) p "yes"end
What's the problem?
http://guides.rubyonrails.org/security.html#regular-expressions
class File < ActiveRecord::Base!!validates :name, :format => /^[\w\.\-\+]+$/end
Security Implications
http://guides.rubyonrails.org/security.html#regular-expressions
file.txt%0A<script>alert(‘hello’)</script>
file.txt%0A<script>alert(‘hello’)</script>
file.txt\n<script>alert(‘hello’)</script>
file.txt\n<script>alert(‘hello’)</script>
/^[\w\.\-\+]+$/
file.txt\n<script>alert(‘hello’)</script>
/^[\w\.\-\+]+$/
Match succeedsActiveRecord validation succeeds
file.txt\n<script>alert(‘hello’)</script>
/\A[\w\.\-\+]+\z/
file.txt\n<script>alert(‘hello’)</script>
/\A[\w\.\-\+]+\z/
Match failsActiveRecord validation fails
require 'benchmark'
# simple benchmark for alternations and character class
n = 5_000
str = 'cafebabedeadbeef'*5_000
Benchmark.bmbm do |x| x.report('alternation') do str =~ /^(a|b|c|d|e|f)+$/ end x.report('character class') do str =~ /^[a-f]+$/ endend
Prefer Character Class to Alterations
Ruby 1.8.7 user system total realalternation 0.030000 0.010000 0.040000 ( 0.036702)character class 0.000000 0.000000 0.000000 ( 0.004704)
Ruby 2.0.0 user system total realalternation 0.020000 0.010000 0.030000 ( 0.023139)character class 0.000000 0.000000 0.000000 ( 0.009641)
JRuby 1.7.4.dev user system total realalternation 0.030000 0.000000 0.030000 ( 0.021000)character class 0.010000 0.000000 0.010000 ( 0.007000)
Benchmarks
# case-insensitively match any non-word character…
# one is unlike the others'r' =~ /(?i:[\W])/'s' =~ /(?i:[\W])/'t' =~ /(?i:[\W])/
Beware of Character Classes
matches, even if 's' is a word character
https://bugs.ruby-lang.org/issues/4044
/^1?$|^(11+?)\1+$/
/^1?$|^(11+?)\1+$/
Matches '1' or ''
/^1?$|^(11+?)\1+$/
Non-greedily match 2 or more 1's
/^1?$|^(11+?)\1+$/
1 or more additional times
/^1?$|^(11+?)\1+$/
matches a composite number
/^1?$|^(11+?)\1+$/
Matches a string of 1's if and only if there are a non-prime # of 1's
class Integer def prime? "1" * self !~ /^1?$|^(11+?)\1+$/ endend
Integer#prime?
No performance guarantee
Attributed a Perl hacker Abigail
• @hiro_asari
• Github: BanzaiMan