Upload
alicia-rebecca-carr
View
214
Download
0
Embed Size (px)
Citation preview
What is it?
A specific kind of
text pattern
sequence of characters
that can be used for:
• pattern matching with strings
Split a text
o eg. Tryptic digestion
>seq = VGTKCCTKPESERMPCTEDYLSLILNR
>split(/(?!P)(?<=[RK])/, seq)
>> VGTK
>> CCTKPESER
>> MPCTEDYLSLILNR
What is it?
Finding the text pattern in the input
o eg. Finding certain patterns of sequences
Find E[IL]+T IN
Replace text matching the pattern with other text
o eg. Translate a DNA sequence to a peptide sequence
…
What is it?
Matching patterns in strings using simple rules and symbols
Some background..
Stems from mathematics and computer science theory.
• Mathematical expressions “Regularity”
• Can be implemented using a deterministic finite automaton.
#
+
>START INCREASE ACCEPT
Some background.. Finite Automaton
STATES
TRANSITIONS
NFA and DFA
• Normally in computer science, regular expressions are represented either
by non-deterministic finite automaton or deterministic finite automaton.
• Every DFA is also an NFA. And every NFA can be translated into an
equivalent DFA.
• Since this is a short course, we are NOT going to proceed with them,
instead represent regular expressions with informal/unbound state
diagrams for the sake of simplicity.
Symbols
[…] matching one of the characters or symbols in character list
A A T
A
[A] [AT]
A
T
Symbols
A
T
[AT]G
[AT][G]A
[AT][G]T
[AT][G]G
[AT][G]C
[AT][G][ATGC]
G
G
A
T
G
C
Symbols
A
T
[AT]G
[AT][G]A
[AT][G]T
[AT][G]G
[AT][G]C
[AT][G][ATGC]
A G A
G
G
A
T
G
C
Symbols
A
T
[AT]G
[AT][G]A
[AT][G]T
[AT][G]G
[AT][G]C
[AT][G][ATGC]
A G A
G
G
A
T
G
C
A G T
Symbols
A
T
[AT]G
[AT][G]A
[AT][G]T
[AT][G]G
[AT][G]C
[AT][G][ATGC]
A G A
G
G
A
T
G
C
A G T
A G G
Symbols
A
T
[AT]G
[AT][G]A
[AT][G]T
[AT][G]G
[AT][G]C
[AT][G][ATGC]
A G A
G
G
A
T
G
C
A G T
A G G
A G C
Symbols
A
T
[AT]G
[AT][G]A
[AT][G]T
[AT][G]G
[AT][G]C
[AT][G][ATGC]
A G A
G
G
A
T
G
C
A G T
A G G
A G C
T G A
Symbols
A
T
[AT]G
[AT][G]A
[AT][G]T
[AT][G]G
[AT][G]C
[AT][G][ATGC]
A G A
G
G
A
T
G
C
A G T
A G G
A G C
T G A
T G T
Symbols
A
T
[AT]G
[AT][G]A
[AT][G]T
[AT][G]G
[AT][G]C
[AT][G][ATGC]
A G A
G
G
A
T
G
C
A G T
A G G
A G C
T G A
T G T
T G G
Symbols
A
T
[AT]G
[AT][G]A
[AT][G]T
[AT][G]G
[AT][G]C
[AT][G][ATGC]
A G A
G
G
A
T
G
C
A G T
A G G
A G C
T G A
T G T
T G G
T G C
Symbols
A
T
[AT]G
[AT][G]A
[AT][G]T
[AT][G]G
[AT][G]C
[AT][G][ATGC]
G
G
A
T
G
C
CCGCGCTGATT
CCGCGCTGATT
Symbols
^ the beginning of the string
^[AT][G][ATGC]
CCGCGCTGATT
AAGAT
AGAT
TGACA
TGA
TGCGGTCGATT
Symbols
$ the end of the string
[AT][G][ATGC]$
CCGCGCTGATT
AAGAT
AGAT
TGACA
TGA
GGTCGATTTGC
Symbols
+ one or more of the preceding patterns
AB+C
A AB+ AB+C
B
A B C
B C
Symbols
+ one or more of the preceding patterns
AB+C
A AB+ AB+C
B
A B C
A B B C
B C
Symbols
+ one or more of the preceding patterns
AB+C
A AB+ AB*C
A B C
A B B C
A B B B B C
B
B C
Symbols
* zero or more of the preceding patterns
AB*C
A AB+ AB*C
A C
B
B C
C
Symbols
* zero or more of the preceding patterns
AB*C
A AB+ AB*C
A C
A B C
B
B C
C
Symbols
* zero or more of the preceding patterns
AB*C
A AB+ AB*C
A C
A B C
A B B B C
B
B C
C
Symbols
? zero or one of the preceding patterns
COLOU?R
C O
C CO COL COLO COLOU COLOU?RO L O U R
R
L O R
Symbols
? zero or one of the preceding patterns
COLOU?R
C O
C CO COL COLO COLOU COLOU?RO L O U R
R
L O R
C O L O U R
Symbols
. one of any character
. .A .AT
H A T
A T
.AT
Symbols
. one of any character
. .A .AT
H A T
C A T
A T
.AT
Symbols
. one of any character
. .A .AT
H A T
C A T
A T
R A T
F A T
.AT
Symbols
| or
TAT(A. |.A)T TATxyT , where x or y is A
() is for precedence
Symbols
| or
TAT(A. |.A)T
T TA TAT
TATA TATA.
TAT(A. |.A)T
TAT.
A T
A
A,T,G,C
A,T,G,C
TAT.AA
T
T
T A T C A T
Symbols
| or
TAT(A. |.A)T
T TA TAT
TATA TATA.
TAT.
A T
A
A,T,G,C
A,T,G,C
TAT.AA
T
T
T A T C A T
T A T A G T
TAT(A. |.A)T
Symbols
| or
TAT(A. |.A)T
T TA TAT
TATA TATA.
TAT.
A T
A
A,T,G,C
A,T,G,C
TAT.AA
T
T
T A T C A T
T A T A G T
TAT(A. |.A)T
T A T A A T
Symbols
| or
TAT(A. |.A)T
T TA TAT
TATA TATA.
TAT.
A T
A
A,T,G,C
A,T,G,C
TAT.AA
T
T
T A T C A T
T A T A G T
TAT(A. |.A)T
T A T A A T
T A T A A T
Symbols
• 0-9 any digit 0 to 9
• A-Z any uppercase letter from A to Z
• a-z any lowercase letter from a to z
• [^…] matching any other character than those inside brackets
• {n, m} match at least n and at most m of the preceding pattern
• {n, } match at least n of the preceding pattern
• {, m} match at most m of the preceding pattern
Symbols
[^PRK] [^RK] {4,} [RK]
1. Any character except P, R and K
2. Followed by, minimum 4 characters that are neither R nor K
3. Followed by R or K
Tryptic peptide with no missed clevage
Minimum 6 aa.
Symbols
[0-9A-Z] {3,6} _DANRE
1. At least 3, at most 6 occurences of the preceding
2. Followed by _DANRE
Uniprot Zebrafish entry
Regular Expression flavors
• Basic regular expressions are normally supported by all the utilities that support regular expressions.
• Sometimes, support for extended regular expressions are needed for specific regular expressions.
Practicing Regex
An online tool for visually testing regex:
https://www.debuggex.com/