38
Regular Expressions Minh Hoang TO Portal Team

Regular expression made by To Minh Hoang - Portal team

Embed Size (px)

DESCRIPTION

This is a presentation from eXo Platform SEA.

Citation preview

Page 1: Regular expression made by To Minh Hoang - Portal team

Regular Expressions

Minh Hoang TOPortal Team

Page 2: Regular expression made by To Minh Hoang - Portal team

2

Agenda

» Finite State Machine

» Pattern Parser

» Java Regex » Parsers in GateIn

» Advanced Theory

Page 3: Regular expression made by To Minh Hoang - Portal team

Finite State Machine

Page 4: Regular expression made by To Minh Hoang - Portal team

4

State Diagram

Page 5: Regular expression made by To Minh Hoang - Portal team

5

JIRA Issue Lifecycle

Page 6: Regular expression made by To Minh Hoang - Portal team

6

Java Thread Lifecycle

Page 7: Regular expression made by To Minh Hoang - Portal team

7

Java Compilation Flow

Page 8: Regular expression made by To Minh Hoang - Portal team

8

Finite State Machine - FSM

» Behavioral model to describe working flow of a system

Page 9: Regular expression made by To Minh Hoang - Portal team

9

Finite State Machine - FSM

» Directed graph with labeled edges

Page 10: Regular expression made by To Minh Hoang - Portal team

Pattern Parser

Page 11: Regular expression made by To Minh Hoang - Portal team

11

Classic Problem

» A – Finite characters set

Ex:

A = {a, b, c, d,..., z} or A = { a, b, c,..., z, public, class, extends, implements, while, if,...}

» Pattern P and input sequence INPUT made of A 's elements

Ex:

P = “a.*b” or P = “class.*extends.*”INPUT = “aaabbbcc” or INPUT = a Java source file

→ Parser reads character-by-character INPUT and recognizes all subsequences matching pattern P

Page 12: Regular expression made by To Minh Hoang - Portal team

12

Classic Problem - Samples

» Split a sequence of characters into an array of subsequences

String path = “/portal/en/classic/home”; String[] segments = path.split(“/”);

» Handle comment block encountered in a file

» Override readLine() in BufferedReader

» Extract data from REST response

» Write an XML parser from scratch

Page 13: Regular expression made by To Minh Hoang - Portal team

13

Finite State Machine & Classic Problem

» Acceptor FSM?

» How to transform Classic Problem into graph traversing problem with well-known generic solution?

Find pattern occurrences ↔ Traversing directed graph with labeled edges

Page 14: Regular expression made by To Minh Hoang - Portal team

14

FSM – Word Accepting

» Consider a word W – sequence of characters from character set A

W = “abcd...xyz”

FSM having graph edges labeled with characters from A, accepts W if there exists a path connecting START node to one of END nodes

START = S1 → S2 → … → Sn = END

1. Duplicate of intermediate nodes is allowed

2. The transition from S_i → S_(i+1) is determined (labeled) by i-th character of W

Page 15: Regular expression made by To Minh Hoang - Portal team

15

Acceptor FSM

» Given a pattern P, a FSM is called Acceptor FSM if it accepts any word matching pattern P.

Ex:

Acceptor FSM of “a[0-9]b” accepts any elements from word set

{ “a0b”, “a1b”, “a2b”, “a3b”, “a4b”, “a5b”, “a6b”, “a7b”, “a8b”, “a9b”}

Page 16: Regular expression made by To Minh Hoang - Portal team

16

How Pattern Parser Works?

Traversing directed graph associated with Acceptor FSM

1. Start from root node 2. Read next characters from INPUT, then makes move according to transition rules 3. Repeat second step until visiting one leaf node or INPUT becomes empty

4. Return OK if leaf node refers to success match.

Page 17: Regular expression made by To Minh Hoang - Portal team

17

Example One

» Recognize pattern

eXo.*er

in:

AAAeXo123erBBBeXoerCCCeXoeXoerDDD

Page 18: Regular expression made by To Minh Hoang - Portal team

18

Example One

» Acceptor FSM with 8 states:

START – Start reading input sequence

e – encounter eeX – encounter eX

eXo – encounter eXo

eXo.* – encounter eXo.*

eXo.*e – encounter eXo.*e

END – subsequence matching eXo.*er foundFAILURE

Page 19: Regular expression made by To Minh Hoang - Portal team

19

Page 20: Regular expression made by To Minh Hoang - Portal team

20

Example Two

» Recognize comment block

/* */in:

/* Don't ask * /final int innerClassVariable;

Page 21: Regular expression made by To Minh Hoang - Portal team

21

Example Two

» Acceptor FSM with 5 states:

START – start reading input sequence

OUT – stay away from comment blocks

ENTERING – at the beginning of comment block

IN – stay inside a comment block

LEAVING – at the end of comment block

Page 22: Regular expression made by To Minh Hoang - Portal team

22

Page 23: Regular expression made by To Minh Hoang - Portal team

23

Finite State Machine With Stack

» Example Two is slightly harder than Example One as transition decision depends on past information → We must keep something in memory

»

FSM with Stack = Ordinary FSM + Stack Structure storing past info

Contextual transition is determined by (next input character ,stack state)

Page 24: Regular expression made by To Minh Hoang - Portal team

Java Regex

Page 25: Regular expression made by To Minh Hoang - Portal team

25

Model

» Pattern: Acceptor Finite State Machine

» Matcher: Parser

Page 26: Regular expression made by To Minh Hoang - Portal team

26

java.util.regex.Pattern

» Construct FSM accepting pattern

Pattern p = Pattern.compile(“a.*b”);

FSM states are instances of java.util.regex.Pattern$Node

» Generate parser working on input sequence

Matcher matcher = p.matcher(“aaabbbb”);

Page 27: Regular expression made by To Minh Hoang - Portal team

27

java.util.regex.Matcher

» Find next subsequence matching pattern

find()

» Get capturing groups from latest match

group()

Page 28: Regular expression made by To Minh Hoang - Portal team

28

Capturing Group

Two Pattern objects

Pattern p = Pattern.compile(“abcd.*efgh”);Pattern q = Pattern.compile(“abcd(.*)efgh”);String text = “abcd12345efgh”;Matcher pM = p.match(text);Matcher qM = q.match(text);

» pM.find() == qM.find();

» pM.group(1) != qM.group(1);

Page 29: Regular expression made by To Minh Hoang - Portal team

29

Capturing Group

» Hold additional information on each match

while(matcher.find()){ matcher.group(index);}

» Pattern P = (A)(B(C))

matcher.group(0) = the whole sequence ABCmatcher.group(1) = ABCmatcher.group(2) = BCmatcher.group(3) = C

Page 30: Regular expression made by To Minh Hoang - Portal team

30

Capturing Group

» Pattern.compile(“abc(defgh”);Pattern.compile(“abcdef)gh”);

→ PatternSyntaxException

» Pattern.compile(“abc\\(defgh”);Pattern.compile(“abcdef\\)gh”);

→ Success thanks to escape character '\'

Page 31: Regular expression made by To Minh Hoang - Portal team

31

Operators

» Union

[a-zA-Z-0-9]» Negation

[^abc]

[^X]

Page 32: Regular expression made by To Minh Hoang - Portal team

32

Contextual Match

» X(?=Y)

Once match X, look ahead to find Y

» X(?!= Y)

Once match X, look ahead and expect not find Y

» X(?<= Y)

Once match X, look behind to find Y

» X(?<!= Y)

Once match X, look behind and expect not find Y

Page 33: Regular expression made by To Minh Hoang - Portal team

33

Tips

» Pattern is stateless → Maximize reuse

We often see:

static final Pattern p = Pattern.compile(“a*b”);

» Be careful with String.split

String.split vs Java loop + String.charAt

Page 34: Regular expression made by To Minh Hoang - Portal team

Parsers in GateIn

Page 35: Regular expression made by To Minh Hoang - Portal team

35

Parsers in GateIn

» JavaScript Compressor

» CSS Compressor

» Groovy Template Optimizer

» Navigation Controller

Extracting URL param = Regex matching + Backtracking algorithm

» StaxNavigator (Nice XML parser based on StAX)

Page 36: Regular expression made by To Minh Hoang - Portal team

Advanced Theory

Page 37: Regular expression made by To Minh Hoang - Portal team

37

Grammar & Language

» Any word matching pattern eXo.*er is a combination transforms, starting from S

S → eXoQerQ → RQTQ → ''R → {a,b,c,d,...}T → {a,b,c,d,...}

» Language of a Grammar = Vocabularies generated by finite-combination of transforms, starting from S

Ex: Any valid Java source file is generated by a finite number of transforms mentioned in Java Grammar (JLS)

Page 38: Regular expression made by To Minh Hoang - Portal team

38

Finite State Machine & Language

» Language accepted by a FSM with Stack must be built from a context-free grammar

Explicit steps to build such context-free grammar are described in Kleene theorem

» Context-free grammar Language is accepted by a FSM with Stack

Explicit steps to build such Finite State Machine aredescribed in Kleene theorem