30
Regular expressions (regex): Text search on steroids.

2015 10-7-9am regex-functions-loops.key

Embed Size (px)

Citation preview

Regular expressions (regex): Text search on steroids.

Regular expressions (regex): Text search on steroids.

Regular expression FindsDavid David

Dav(e|(id)) David, DaveDav(e|(id)|(ide)|o) David, Dave, Davide, Davo

At{1,2}enborough Attenborough, Atenborough

Atte[nm]borough Attenborough, Attemborough

At{1,2}[ei][nm]bo{0,1}ro((ugh)|w){0,1}Atimbro,

attenbrough,ateinborow

Easy counting, replacing all with “Sir David Attenborough”

Regex Special symbolsRegular expression Finds Example

[aeiou] any single vowel “e”

[aeiou]* between 0 and infinity vowels vowels, e.g.’ “eeooouuu"

[aeoiu]{1,3} between 1 and 3 vowels “oui”

a|i one of the 2 characters “"

((win)|(fail)) one of the two words in () fail

More Regex Special symbols

• Google “Regular expression cheat sheet”• ?regexp

Synonymous with[:digit:] [0-9]

[A-z] [A-z], ie [A-Za-z]

\s whitespace

. any single character

.+ one to many of anything

b* between 0 and infinity letter ‘b’

[^abc] any character other than a, b or c.

\( (

[:punct:] any of these: ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { |

You want to scan a protein sequence database for a particular binding site. Type a single regular expression that will match the first two of the following peptide sequences,

but NOT the last one:

"HATSOMIKTIP""HAVSONYYIKTIP""HAVSQMIKTIP"

(rubular)

Variants of a microsatellite sequence are responsible for differential expression of vasopressin receptor, and in turn for

differences in social behaviour in voles & others. Create a regular expression that finds AGAGAGAGAGAGAGAG dinucleotide

microsatellite repeats with lengths of 5 to 500

Again

Make a regular expression

• matching “LMTSOMIKTIP” and “LMVSONYYIKTIP” but not “LMVSQMIKTIP”

• matching all variants of “ok” (e.g., “O.K.”, “Okay”…)

Ok… so how do we use this?

• ?grep

• ?gsub

Which species names include ‘y’?Create a vector with only species names, but replace all ‘y’ with ‘Y!

ants <- read.table("https://goo.gl/3Ek1dL") colnames(ants) <- c("genus", "species")

Remove all vowels

Replace all vowels with ‘o’

Functions

Functions• R has many. e.g.: plot(), t.test()

• Making your own:

tree_age_estimate <- function(diameter, species) { growth_rate <- growth_rates[ species ] age_estimate <- diameter / growth_rate return(age_estimate)}

> tree_age_estimate(25, “White Oak”)+ 66> tree_age_estimate(60, “Carya ovata”)+ 190

Make a function• That converts fahrenheit to celsius

(subtract 32 then divide the result by 1.8)

Loops

“for” Loop

> possible_colours <- c('blue', 'cyan', 'sky-blue', 'navy blue', 'steel blue', 'royal blue', 'slate blue', 'light blue', 'dark blue', 'prussian blue', 'indigo', 'baby blue', 'electric blue')

> possible_colours [1] "blue" "cyan" "sky-blue" "navy blue" [5] "steel blue" "royal blue" "slate blue" "light blue" [9] "dark blue" "prussian blue" "indigo" "baby blue" [13] "electric blue"

> for (colour in possible_colours) {+ print(paste("The sky is oh so, so", colour))+ }

[1] "The sky is so, oh so blue"[1] "The sky is so, oh so cyan"[1] "The sky is so, oh so sky-blue"[1] "The sky is so, oh so navy blue"[1] "The sky is so, oh so steel blue"[1] "The sky is so, oh so royal blue"[1] "The sky is so, oh so slate blue"[1] "The sky is so, oh so light blue"[1] "The sky is so, oh so dark blue"[1] "The sky is so, oh so prussian blue"[1] "The sky is so, oh so indigo"[1] "The sky is so, oh so baby blue"[1] "The sky is so, oh so electric blue"

What does this loop do?for (index in 10:1) { print(paste(index, "mins befo lunch"))}

Again

• What does the following code do (decompose on pen and paper)

for (letter in LETTERS) { begins_with <- paste("^", letter, sep="") matches <- grep(pattern = begins_with, x = ants$genus) print(paste(length(matches), "begin with", letter))}

> LETTERS [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"[20] "T" "U" "V" "W" "X" "Y" "Z"> ants <- read.table("https://goo.gl/3Ek1dL")> colnames(ants) <- c("genus", “species")> head(ants) genus species1 Anergates atratulus2 Camponotus sp.3 Crematogaster scutellaris4 Formica aquilonia5 Formica cunicularia6 Formica exsecta

What does this loop do?

Jasmin Zohren

Kim Warren

Bruno Vieira

Rodrigo Pracana

Leandro Santiago

JamesWright

Jingyuan Zhu

Hernani Oliveira

Andrea Hatlen

Programming in R?

If/else

Logical Operators

going further