39
Efficient Regular Expressions that produce Parse Trees Aaron Karper Niko Schwarz University of Bern January 7, 2014 Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 1 / 38

Regular expression that produce parse trees

Embed Size (px)

DESCRIPTION

Presenting a regular expression engine, that gives parse trees in a single pass by modifying the standard non-deterministic finite-state automaton algorithm. My master thesis.

Citation preview

Page 1: Regular expression that produce parse trees

Efficient Regular Expressions that produce Parse Trees

Aaron Karper Niko Schwarz

University of Bern

January 7, 2014

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 1 / 38

Page 2: Regular expression that produce parse trees

Regular expressions so far

Regular expressions

https? : // (([a− z ] + \.) + ([a− z ]+))︸ ︷︷ ︸domain

((/[a− z0− 9]+)/?)︸ ︷︷ ︸path segments

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 2 / 38

Page 3: Regular expression that produce parse trees

Regular expressions so far

Regular expressions

https? : // (([a− z ] + \.) + ([a− z ]+))︸ ︷︷ ︸domain

((/[a− z0− 9]+)/?)︸ ︷︷ ︸path segments

http : // www︸ ︷︷ ︸domain

. reddit︸ ︷︷ ︸domain

. com︸︷︷︸domain

/ r︸︷︷︸path

/ computerscience︸ ︷︷ ︸path

/ comments︸ ︷︷ ︸path

/ 1sg69d︸ ︷︷ ︸path

/

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 2 / 38

Page 4: Regular expression that produce parse trees

Regular expressions so far

Regular expressions are greedy by default:(a+)(a?) on "aaa" → "aaa" in group 0 and "" in group 1.

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 3 / 38

Page 5: Regular expression that produce parse trees

Regular expressions so far

Regular expressions so far

Posix gives only one match.Regular languages are recognized, but parsing with combinatorical parserstakes O(n3).Backtracking implementations (Java, python, perl, . . . ) are exponentiallyslow in the worst case.

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 4 / 38

Page 6: Regular expression that produce parse trees

Benchmarks

Parsing with https?://(([a-z]+\.)+([a-z]+))((/[a-z0-9]+)/?)

2http:// www. reddit. com /r /computerscience /comments /1sg69d

143

0

Figure : Posix

http:// www. reddit. com /r /computerscience /comments /1sg69d2

0

221 3

4 4 4 4

Figure : Our approach

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 5 / 38

Page 7: Regular expression that produce parse trees

Benchmarks

Benchmarks

Matching ((a+b)+c)+ against(a200bc)2000.

Tool Time

JParsec 4,498java.util.regex 1,992

Ours 5,332

Extract all class names from our projectwith complex regular expression1.

Tool Time

java.util.regex 11,319Ours 8,047

1(.*?([a-z]+\.)*([A-Z][a-zA-Z]*))*.*?Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 6 / 38

Page 8: Regular expression that produce parse trees

Benchmarks Optimizations of the algorithm

Benchmarks – Optimizations of the algorithm

Typically most time is spent in long repetitions, we optimize for that case by:Lazily compile deterministic FA.Avoiding to recreate state if seen similar state.Use compressed representation if in static repetition.

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 7 / 38

Page 9: Regular expression that produce parse trees

Benchmarks NFA interpretation

Example: (a?(a)b)+

Parse(a?(a)b)+

over”a0a1b2a3b4”

a a b a b0 1 2 3 4

1 122

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 8 / 38

Page 10: Regular expression that produce parse trees

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1

[[], [], [], []]

q2 q3 q4

q9

q5 q6 q7 q8

-

-

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 9 / 38

Page 11: Regular expression that produce parse trees

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1

[[], [], [], []]

q2

[[0], [], [], []]

q3 q4

q9

q5 q6 q7 q8

-

-

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 10 / 38

Page 12: Regular expression that produce parse trees

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1

[[], [], [], []]

q2

[[0], [], [], []]

q3

[[0], [], [], []]

q4

q9

q5 q6 q7 q8

-

-

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 11 / 38

Page 13: Regular expression that produce parse trees

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1

[[], [], [], []]

q2

[[0], [], [], []]

q3

[[0], [], [], []]

q4

[[0], [], [0], []]

q9

q5 q6 q7 q8

-

-

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 12 / 38

Page 14: Regular expression that produce parse trees

Benchmarks NFA interpretation

Threads

h1h1 h2 h3 h4 h5 h6

State:

Histories:

qCopy of thread is modified.Copy of array of histories makesreading a character O(m2)

Need faster persistent datastructure to get O(m logm).

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 13 / 38

Page 15: Regular expression that produce parse trees

Benchmarks NFA interpretation

Optimized thread forking

Set entry 2 to 20:

1

2

3

4 5

6

7 8

9

10

11 12

13

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 14 / 38

Page 16: Regular expression that produce parse trees

Benchmarks NFA interpretation

Optimized thread forking

Set entry 2 to 20:

1

2

3

4 5

6

7 8

9

10

11 12

13

1

20

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 15 / 38

Page 17: Regular expression that produce parse trees

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1

[[], [], [], []]

q2

[[0], [], [], []]

q3

[[0], [], [], []]

q4

[[0], [], [0], []]

q9

q5 q6 q7 q8

-

-

For each character read, threads start hungry and must eat immediately.

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 16 / 38

Page 18: Regular expression that produce parse trees

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1

[[], [], [], []]

q2

[[0], [], [], []]

q3

[[0], [], [], []]

q4

q9

q5

[[0], [], [0], []]

q6 q7 q8

-

-

For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 17 / 38

Page 19: Regular expression that produce parse trees

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1

[[], [], [], []]

q2

[[0], [], [], []]

q3

[[0], [], [], []]

q4

q9

q5

[[0], [], [0], []]

q6

[[0], [], [0], [0]]

q7 q8

-

-

For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 18 / 38

Page 20: Regular expression that produce parse trees

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1

[[], [], [], []]

q2

[[0], [], [], []]

q3 q4

q9

q5

[[0], [], [0], []]

q6

[[0], [], [0], [0]]

q7 q8

-

-

For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 19 / 38

Page 21: Regular expression that produce parse trees

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1 q2 q3

[[0], [], [], []]

q4

[[0], [], [1], []]

q9

q5

[[0], [], [0], []]

q6

[[0], [], [0], [0]]

q7 q8

-

-

For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 20 / 38

Page 22: Regular expression that produce parse trees

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1 q2 q3 q4

[[0], [], [1], []]

q9

q5 q6

[[0], [], [0], [0]]

q7 q8

-

-

For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 21 / 38

Page 23: Regular expression that produce parse trees

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1 q2 q3 q4

[[0], [], [1], []]

q9

q5 q6

[[0], [], [0], [0]]

q7 q8

-

-

For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 22 / 38

Page 24: Regular expression that produce parse trees

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1 q2 q3 q4

q9

q5

[[0], [], [1], []]

q6

[[0], [], [1], [1]]

q7 q8

-

-

For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 23 / 38

Page 25: Regular expression that produce parse trees

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1 q2 q3 q4

q9

q5 q6

[[0], [], [1], [1]]

q7 q8

-

-

For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 24 / 38

Page 26: Regular expression that produce parse trees

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1

[[0], [2], [1], [1]]

q2

[[0,2], [2], [1], [1]]

q3

[[0,2], [2], [1], [1]]

q4

[[0,2], [2], [1,3], [1]]

q9

[[0], [2], [1], [1]]

q5 q6 q7

[[0], [], [1], [1]]

q8

[[0], [2], [1], [1]]

-

-

For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 25 / 38

Page 27: Regular expression that produce parse trees

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1

[[0], [2], [1], [1]]

q2

[[0,2], [2], [1], [1]]

q3

[[0,2], [2], [1], [1]]

q4

[[0,2], [2], [1,3], [1]]

q9

[[0], [2], [1], [1]]

q5 q6 q7

[[0], [], [1], [1]]

q8

[[0], [2], [1], [1]]

-

-

For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 26 / 38

Page 28: Regular expression that produce parse trees

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1 q2 q3

[[0,2], [2], [1], [1]]

q4

[[0,2], [2], [1,4], [1]]

q9

q5

[[0,2], [2], [1,3], [1]]

q6

[[0,2], [2], [1,3], [1,3]]

q7 q8

-

-

For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 27 / 38

Page 29: Regular expression that produce parse trees

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1 q2 q3

[[0,2], [2], [1], [1]]

q4

[[0,2], [2], [1,4], [1]]

q9

q5

[[0,2], [2], [1,3], [1]]

q6

[[0,2], [2], [1,3], [1,3]]

q7 q8

-

-

For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 28 / 38

Page 30: Regular expression that produce parse trees

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1

[[0,2], [2,4], [1,3], [1,3]]

q2

[[0,2,5], [2,4], [1,3], [1,3]]

q3

[[0,2,5], [2,4], [1,3], [1,3]]

q4

[[0,2,5], [2,4,5], [1,3], [1,3]]

q9

[[0,2], [2,4], [1,3], [1,3]]

q5 q6 q7

[[0,2], [2], [1,3], [1,3]]

q8

[[0,2], [2,4], [1,3], [1,3]]

-

-

For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 29 / 38

Page 31: Regular expression that produce parse trees

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q9

[[0,2], [2,4], [1,3], [1,3]]

a a b a b0 1 2 3 4

1 122

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 30 / 38

Page 32: Regular expression that produce parse trees

Download

https://github.com/nes1983/tree-regex

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 31 / 38

Page 33: Regular expression that produce parse trees

NFA construction

S2

S1

-

AlternationS1|S2

S

-

OptionalS?

S

Capture group(S)

S

-

Star operationS*?

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 32 / 38

Page 34: Regular expression that produce parse trees

Backtracking’s nightmare

(a + a+) + b

against”anb”

will backtrack Θ(2n) times.

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 33 / 38

Page 35: Regular expression that produce parse trees

Backtracking’s nightmare

Extract the first cell in a CSV that starts with "P"1:

∧(.∗?, ) + (P.∗?),

failing against”1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13”

is exponential.

1From http://www.regular-expressions.info/catastrophic.htmlAaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 34 / 38

Page 36: Regular expression that produce parse trees

Thread execution order matters

.*(a?)

q1start

q2

q3 q4 q5

any

τ1 ↑ a τ1 ↓

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 35 / 38

Page 37: Regular expression that produce parse trees

Priority matters

(a)|(a)

q1start

q2

q3

q4

q5

q6

τ1 ↑

τ2 ↑

a

a

τ1 ↓

τ2 ↓

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 36 / 38

Page 38: Regular expression that produce parse trees

Optimization Pipeline

1 Convert to nondeterministic FA2 Interpret nondeterministic FA, building deterministic FA lazily.3 Find similar/mappable states to avoid creating infinite DFA.4 Run on DFA if possible5 Compactify DFA if creation of new states wasn’t necessary for a while.

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 37 / 38

Page 39: Regular expression that produce parse trees

NFA interpretation

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 38 / 38