41
Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler [email protected]

Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler [email protected]

Embed Size (px)

Citation preview

Page 1: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Regular Expressions

January 27, 2015Linux and basic scripting course

Arzu Tugce [email protected]

Page 2: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

What is it?

A specific kind of

text pattern

sequence of characters

that can be used for:

• pattern matching with strings

Split a text

o eg. Tryptic digestion

>seq = VGTKCCTKPESERMPCTEDYLSLILNR

>split(/(?!P)(?<=[RK])/, seq)

>> VGTK

>> CCTKPESER

>> MPCTEDYLSLILNR

Page 3: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

What is it?

Finding the text pattern in the input

o eg. Finding certain patterns of sequences

Find E[IL]+T IN

Replace text matching the pattern with other text

o eg. Translate a DNA sequence to a peptide sequence

Page 4: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

What is it?

Matching patterns in strings using simple rules and symbols

Page 5: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Some background..

Stems from mathematics and computer science theory.

• Mathematical expressions “Regularity”

• Can be implemented using a deterministic finite automaton.

Page 6: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

#

+

>START INCREASE ACCEPT

Some background.. Finite Automaton

STATES

TRANSITIONS

Page 7: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

NFA and DFA

• Normally in computer science, regular expressions are represented either

by non-deterministic finite automaton or deterministic finite automaton.

• Every DFA is also an NFA. And every NFA can be translated into an

equivalent DFA.

• Since this is a short course, we are NOT going to proceed with them,

instead represent regular expressions with informal/unbound state

diagrams for the sake of simplicity.

Page 8: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

[…] matching one of the characters or symbols in character list

A A T

A

[A] [AT]

A

T

Page 9: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

A

T

[AT]G

[AT][G]A

[AT][G]T

[AT][G]G

[AT][G]C

[AT][G][ATGC]

G

G

A

T

G

C

Page 10: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

A

T

[AT]G

[AT][G]A

[AT][G]T

[AT][G]G

[AT][G]C

[AT][G][ATGC]

A G A

G

G

A

T

G

C

Page 11: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

A

T

[AT]G

[AT][G]A

[AT][G]T

[AT][G]G

[AT][G]C

[AT][G][ATGC]

A G A

G

G

A

T

G

C

A G T

Page 12: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

A

T

[AT]G

[AT][G]A

[AT][G]T

[AT][G]G

[AT][G]C

[AT][G][ATGC]

A G A

G

G

A

T

G

C

A G T

A G G

Page 13: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

A

T

[AT]G

[AT][G]A

[AT][G]T

[AT][G]G

[AT][G]C

[AT][G][ATGC]

A G A

G

G

A

T

G

C

A G T

A G G

A G C

Page 14: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

A

T

[AT]G

[AT][G]A

[AT][G]T

[AT][G]G

[AT][G]C

[AT][G][ATGC]

A G A

G

G

A

T

G

C

A G T

A G G

A G C

T G A

Page 15: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

A

T

[AT]G

[AT][G]A

[AT][G]T

[AT][G]G

[AT][G]C

[AT][G][ATGC]

A G A

G

G

A

T

G

C

A G T

A G G

A G C

T G A

T G T

Page 16: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

A

T

[AT]G

[AT][G]A

[AT][G]T

[AT][G]G

[AT][G]C

[AT][G][ATGC]

A G A

G

G

A

T

G

C

A G T

A G G

A G C

T G A

T G T

T G G

Page 17: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

A

T

[AT]G

[AT][G]A

[AT][G]T

[AT][G]G

[AT][G]C

[AT][G][ATGC]

A G A

G

G

A

T

G

C

A G T

A G G

A G C

T G A

T G T

T G G

T G C

Page 18: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

A

T

[AT]G

[AT][G]A

[AT][G]T

[AT][G]G

[AT][G]C

[AT][G][ATGC]

G

G

A

T

G

C

CCGCGCTGATT

CCGCGCTGATT

Page 19: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

^ the beginning of the string

^[AT][G][ATGC]

CCGCGCTGATT

AAGAT

AGAT

TGACA

TGA

TGCGGTCGATT

Page 20: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

$ the end of the string

[AT][G][ATGC]$

CCGCGCTGATT

AAGAT

AGAT

TGACA

TGA

GGTCGATTTGC

Page 21: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

+ one or more of the preceding patterns

AB+C

A AB+ AB+C

B

A B C

B C

Page 22: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

+ one or more of the preceding patterns

AB+C

A AB+ AB+C

B

A B C

A B B C

B C

Page 23: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

+ one or more of the preceding patterns

AB+C

A AB+ AB*C

A B C

A B B C

A B B B B C

B

B C

Page 24: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

* zero or more of the preceding patterns

AB*C

A AB+ AB*C

A C

B

B C

C

Page 25: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

* zero or more of the preceding patterns

AB*C

A AB+ AB*C

A C

A B C

B

B C

C

Page 26: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

* zero or more of the preceding patterns

AB*C

A AB+ AB*C

A C

A B C

A B B B C

B

B C

C

Page 27: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

? zero or one of the preceding patterns

COLOU?R

C O

C CO COL COLO COLOU COLOU?RO L O U R

R

L O R

Page 28: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

? zero or one of the preceding patterns

COLOU?R

C O

C CO COL COLO COLOU COLOU?RO L O U R

R

L O R

C O L O U R

Page 29: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

. one of any character

. .A .AT

H A T

A T

.AT

Page 30: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

. one of any character

. .A .AT

H A T

C A T

A T

.AT

Page 31: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

. one of any character

. .A .AT

H A T

C A T

A T

R A T

F A T

.AT

Page 32: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

| or

TAT(A. |.A)T TATxyT , where x or y is A

() is for precedence

Page 33: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

| or

TAT(A. |.A)T

T TA TAT

TATA TATA.

TAT(A. |.A)T

TAT.

A T

A

A,T,G,C

A,T,G,C

TAT.AA

T

T

T A T C A T

Page 34: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

| or

TAT(A. |.A)T

T TA TAT

TATA TATA.

TAT.

A T

A

A,T,G,C

A,T,G,C

TAT.AA

T

T

T A T C A T

T A T A G T

TAT(A. |.A)T

Page 35: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

| or

TAT(A. |.A)T

T TA TAT

TATA TATA.

TAT.

A T

A

A,T,G,C

A,T,G,C

TAT.AA

T

T

T A T C A T

T A T A G T

TAT(A. |.A)T

T A T A A T

Page 36: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

| or

TAT(A. |.A)T

T TA TAT

TATA TATA.

TAT.

A T

A

A,T,G,C

A,T,G,C

TAT.AA

T

T

T A T C A T

T A T A G T

TAT(A. |.A)T

T A T A A T

T A T A A T

Page 37: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

• 0-9 any digit 0 to 9

• A-Z any uppercase letter from A to Z

• a-z any lowercase letter from a to z

• [^…] matching any other character than those inside brackets

• {n, m} match at least n and at most m of the preceding pattern

• {n, } match at least n of the preceding pattern

• {, m} match at most m of the preceding pattern

Page 38: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

[^PRK] [^RK] {4,} [RK]

1. Any character except P, R and K

2. Followed by, minimum 4 characters that are neither R nor K

3. Followed by R or K

Tryptic peptide with no missed clevage

Minimum 6 aa.

Page 39: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Symbols

[0-9A-Z] {3,6} _DANRE

1. At least 3, at most 6 occurences of the preceding

2. Followed by _DANRE

Uniprot Zebrafish entry

Page 40: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Regular Expression flavors

• Basic regular expressions are normally supported by all the utilities that support regular expressions.

• Sometimes, support for extended regular expressions are needed for specific regular expressions.

Page 41: Regular Expressions January 27, 2015 Linux and basic scripting course Arzu Tugce Guler a.t.guler@lumc.nl

Practicing Regex

An online tool for visually testing regex:

https://www.debuggex.com/