Upload
yannick-wurm
View
946
Download
0
Embed Size (px)
Citation preview
Regular expressions (regex): Text search on steroids.
Regular expression FindsDavid David
Dav(e|(id)) David, DaveDav(e|(id)|(ide)|o) David, Dave, Davide, Davo
At{1,2}enborough Attenborough, Atenborough
Atte[nm]borough Attenborough, Attemborough
At{1,2}[ei][nm]bo{0,1}ro((ugh)|w){0,1}Atimbro,
attenbrough,ateinborow
Easy counting, replacing all with “Sir David Attenborough”
Regex Special symbolsRegular expression Finds Example
[aeiou] any single vowel “e”
[aeiou]* between 0 and infinity vowels vowels, e.g.’ “eeooouuu"
[aeoiu]{1,3} between 1 and 3 vowels “oui”
a|i one of the 2 characters “"
((win)|(fail)) one of the two words in () fail
More Regex Special symbols
• Google “Regular expression cheat sheet”• ?regexp
Synonymous with[:digit:] [0-9]
[A-z] [A-z], ie [A-Za-z]
\s whitespace
. any single character
.+ one to many of anything
b* between 0 and infinity letter ‘b’
[^abc] any character other than a, b or c.
\( (
[:punct:] any of these: ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { |
You want to scan a protein sequence database for a particular binding site. Type a single regular expression that will match the first two of the following peptide sequences,
but NOT the last one:
"HATSOMIKTIP""HAVSONYYIKTIP""HAVSQMIKTIP"
Variants of a microsatellite sequence are responsible for differential expression of vasopressin receptor, and in turn for
differences in social behaviour in voles & others. Create a regular expression that finds AGAGAGAGAGAGAGAG dinucleotide
microsatellite repeats with lengths of 5 to 500
Again
Make a regular expression
• matching “LMTSOMIKTIP” and “LMVSONYYIKTIP” but not “LMVSQMIKTIP”
• matching all variants of “ok” (e.g., “O.K.”, “Okay”…)
Which species names include ‘y’?Create a vector with only species names, but replace all ‘y’ with ‘Y!
ants <- read.table("https://goo.gl/3Ek1dL") colnames(ants) <- c("genus", "species")
Remove all vowels
Replace all vowels with ‘o’
Functions• R has many. e.g.: plot(), t.test()
• Making your own:
tree_age_estimate <- function(diameter, species) { growth_rate <- growth_rates[ species ] age_estimate <- diameter / growth_rate return(age_estimate)}
> tree_age_estimate(25, “White Oak”)+ 66> tree_age_estimate(60, “Carya ovata”)+ 190
“for” Loop
> possible_colours <- c('blue', 'cyan', 'sky-blue', 'navy blue', 'steel blue', 'royal blue', 'slate blue', 'light blue', 'dark blue', 'prussian blue', 'indigo', 'baby blue', 'electric blue')
> possible_colours [1] "blue" "cyan" "sky-blue" "navy blue" [5] "steel blue" "royal blue" "slate blue" "light blue" [9] "dark blue" "prussian blue" "indigo" "baby blue" [13] "electric blue"
> for (colour in possible_colours) {+ print(paste("The sky is oh so, so", colour))+ }
[1] "The sky is so, oh so blue"[1] "The sky is so, oh so cyan"[1] "The sky is so, oh so sky-blue"[1] "The sky is so, oh so navy blue"[1] "The sky is so, oh so steel blue"[1] "The sky is so, oh so royal blue"[1] "The sky is so, oh so slate blue"[1] "The sky is so, oh so light blue"[1] "The sky is so, oh so dark blue"[1] "The sky is so, oh so prussian blue"[1] "The sky is so, oh so indigo"[1] "The sky is so, oh so baby blue"[1] "The sky is so, oh so electric blue"
for (letter in LETTERS) { begins_with <- paste("^", letter, sep="") matches <- grep(pattern = begins_with, x = ants$genus) print(paste(length(matches), "begin with", letter))}
> LETTERS [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"[20] "T" "U" "V" "W" "X" "Y" "Z"> ants <- read.table("https://goo.gl/3Ek1dL")> colnames(ants) <- c("genus", “species")> head(ants) genus species1 Anergates atratulus2 Camponotus sp.3 Crematogaster scutellaris4 Formica aquilonia5 Formica cunicularia6 Formica exsecta
What does this loop do?
Jasmin Zohren
Kim Warren
Bruno Vieira
Rodrigo Pracana
Leandro Santiago
JamesWright
Jingyuan Zhu
Hernani Oliveira
Andrea Hatlen