17
GBIO009-1 - Bioinformatics _________________________________________________________________________________________________ ___________________ Homework 1 and 2 review session Presented by Kirill Bessonov November 2012

Homework 1 and 2 review session

  • Upload
    ace

  • View
    38

  • Download
    0

Embed Size (px)

DESCRIPTION

Homework 1 and 2 review session. Presented by Kirill Bessonov November 2012. HW1: classical Q & A (GenomeGraphs) (1). First two questions were on Bioconductor libraries. There are BioC 608 packages To get citations on particular library use citation(" library_name ") - PowerPoint PPT Presentation

Citation preview

Page 1: Homework 1 and 2 review session

GBIO009-1 - Bioinformatics

____________________________________________________________________________________________________________________

Kirill Bessonov 1

Homework 1 and 2 review session

Presented by

Kirill Bessonov

November 2012

Page 2: Homework 1 and 2 review session

GBIO009-1 - Bioinformatics

____________________________________________________________________________________________________________________

Kirill Bessonov 2

HW1: classical Q & A (GenomeGraphs) (1)• First two questions were on Bioconductor libraries. There are BioC 608 packages• To get citations on particular library usecitation("library_name")

• You were asked to get genomic data on specific genelibrary(GenomeGraphs)#download the whole database of Ensemble IDsensembl_Human_Genes =

useMart("ensembl",dataset="hsapiens_gene_ensembl");#get info on gene form the database on the Ensemble IDgene <- makeGene(id = "ENSG00000115145", type="ensembl_gene_id",

biomart = ensembl_Human_Genes )#get info on transcripttranscript <- makeTranscript(id = "ENSG00000115145",

type="ensembl_gene_id", biomart= ensembl_Human_Genes)gdPlot ( list("gene"=gene, "transcripts"=transcript))#retrieve info from the database displaying first 25 entriesgetBM(c("ensembl_gene_id", "hgnc_symbol", "description"),

filter=c("with_exon_transcript", "with_protein_id", "with_transcript_variation"),values=list(TRUE, TRUE, TRUE), ensembl_Human_Genes )[1:25,]

Page 3: Homework 1 and 2 review session

GBIO009-1 - Bioinformatics

____________________________________________________________________________________________________________________

Kirill Bessonov 3

HW1: classical Q & A (GenomeGraphs) (2)

• What is the gene name (i.e. hgnc_symbol) and function represented by the Ensembl ID - ENSG00000115145?

geneInfo=getBM(c("ensembl_gene_id", "hgnc_symbol", "description"), filter=c("with_exon_transcript", "with_protein_id", "with_transcript_variation"),values=list(TRUE, TRUE, TRUE), ensembl_Human_Genes )

> geneInfo[geneInfo$ensembl_gene_id == "ENSG00000115145", ]

ensembl_gene_id hgnc_symbol description

4829 ENSG00000115145 STAM2 signal transducing adaptor molecule (SH3 domain and ITAM motif) 2

• How many exons does the ensemble id ENSG00000115145 has? 51 exons

attr(gene, "ens")ensembl_gene_id ensembl_transcript_id ensembl_exon_id exon_chrom_start exon_chrom_end rank strand biotype

1 ENSG00000115145 ENST00000263904 ENSE00001351655 153032117 153032506 1 -1 protein_coding• ENSG00000115145 ENST00000263904 ENSE00002888710 153006659 153006743 2 -1 protein_coding

……

48 ENSG00000115145 ENST00000494589 ENSE00002785037 153004538 153004636 3 -1 protein_coding

49 ENSG00000115145 ENST00000494589 ENSE00002808134 153003676 153003822 4 -1 protein_coding

50 ENSG00000115145 ENST00000494589 ENSE00002929781 153001402 153001471 5 -1 protein_coding

51 ENSG00000115145 ENST00000494589 ENSE00001828491 153000503 153000527 6 -1 protein_coding

Page 4: Homework 1 and 2 review session

GBIO009-1 - Bioinformatics

____________________________________________________________________________________________________________________

Kirill Bessonov 4

HW1: classical Q & A (GenomeGraphs) (3)• Execute the following command. How many

chromosomes do you see? 25 chromosomes. 22 autosomal pairs, 1 sex pair and one

mitochondrial chromosome• Why the number of chromosomes in this Ensembl

dataset is greater than 23 chromosome pairs? What does “MT”, “X” and “Y” refer to?

Because of the MT chromosome, since X and Y can be grouped to a single pair

> getBM("chromosome_name","","", ensembl_Human_Genes)[c(1:22,433:435),1][1] "1" "10" "11" "12" "13" "14" "15" "16" "17" "18"

"19" "2" "20" "21" "22" "3" "4" "5" "6" "7" "8" "9" "MT" "X" "Y"

Page 5: Homework 1 and 2 review session

GBIO009-1 - Bioinformatics

____________________________________________________________________________________________________________________

Kirill Bessonov 5

HW2: Pairwise alignments (classical Q&A)

Page 6: Homework 1 and 2 review session

GBIO009-1 - Bioinformatics

____________________________________________________________________________________________________________________

Kirill Bessonov 6

HW2: Pairwise alignments (classical Q&A) Q1

• Please align globally using Needleman–Wunsch algorithm the following DNA sequences. Use

• The following scoring rules: a) gap -5; b) match between two bases +5; c) mismatch between two bases +3;

Page 7: Homework 1 and 2 review session

GBIO009-1 - Bioinformatics

____________________________________________________________________________________________________________________

Kirill Bessonov 7

HW2: Pairwise alignments (classical Q&A) Q3• Do local protein alignment using BLOSUM 62 matrix on the HEAGAWGHEE and PAWHAE

sequence. The scoring rules are a) gap -8; matches and mismatches are given in BLOSUM 62 matrix.

Page 8: Homework 1 and 2 review session

GBIO009-1 - Bioinformatics

____________________________________________________________________________________________________________________

Kirill Bessonov 8

HW2: Pairwise alignments (classical Q&A) Q5

Produce a dot plot of Human and Mouse p53 proteins from previous question and paste the plot below.

Complete the lines of R code to get the dot plot. Are both proteins similar?Yes, very similar since we see clear diagonal corresponding to >90% of sequences length Where is/are the region(s) of greatest variation occur? Between 50-100

Page 9: Homework 1 and 2 review session

GBIO009-1 - Bioinformatics

____________________________________________________________________________________________________________________

Kirill Bessonov 9

HW2: Pairwise alignments (classical Q&A) Q7

• What global alignment score do you get for the two p53 proteins, when you use the BLOSUM62 alignment matrix, a gap opening penalty of -10 and a gap extension penalty of -0.5? Answer: score of 1556

query("p53_HUMAN", "AC=P04637");p53_HUMAN_seq = getSequence(p53_HUMAN); query("p53_MOUSE", "AC=P02340");p53_MOUSE_seq = getSequence(p53_MOUSE);globalAlign <- pairwiseAlignment(p53_HUMAN_seq, p53_MOUSE_seq, substitutionMatrix =

"BLOSUM62", gapOpening = -10, gapExtension = -0.5)

• Errors: the R-code was not stated and the ID of proteins were not given such as Uniprot ID P04637

Page 10: Homework 1 and 2 review session

GBIO009-1 - Bioinformatics

____________________________________________________________________________________________________________________

Kirill Bessonov 10

HW2: Computer StyleImplementation of NW algorithm in R

Page 11: Homework 1 and 2 review session

GBIO009-1 - Bioinformatics

____________________________________________________________________________________________________________________

Kirill Bessonov 11

HW2: Computer style (NW algorithm) [1]

• Given the pseudo-code implement NW algorithm in R– Algorithm has two parts• Calculation of the alignment F-matrix• Finding the optimal path(s) through the matrix

for to length(A) F(i,0) ← d*ifor j=0 to length(B) F(0,j) ← d*jfor i=1 to length(A){ for j=1 to length(B) { Match ← F(i-1,j-1) + S(Ai, Bj) Delete ← F(i-1, j) + d Insert ← F(i, j-1) + d F(i,j) ← max(Match, Insert, Delete) } }

d = gap penalty scorei and j = positions in A & B sequences

Page 12: Homework 1 and 2 review session

GBIO009-1 - Bioinformatics

____________________________________________________________________________________________________________________

Kirill Bessonov 12

HW2: Computer style (NW algorithm) [2]Fmatrix = function(A,B){

fmatrix = matrix(0, nrow = (nchar(A)+1) , ncol = nchar(B)+1)

d = -8 #this is gap penalty

for(i in 0 : nchar(A)){

fmatrix[i+1,1] = d * i #populates initial row with gap penalty

}

for(j in 0 : nchar(B)){

fmatrix[1,j+1] = d * i

}

for(i in 1 : nchar(A)){

for(j in 1 : nchar(B)) {

score = rules(A,B) #get me sccore for the pair of aa or nt

match = fmatrix[i,j] + score

delete = fmatrix[i,j+1] + d

insert = fmatrix[i+1,j] + d

fmatrix[i+1,j+1] = max(match,delete,insert)

}

}

colnames(fmatrix) = strsplit( paste(" " , B, sep=""), "")[[1]];

rownames(fmatrix) = strsplit( paste(" " , A, sep=""), "")[[1]];

return(fmatrix)}

Page 13: Homework 1 and 2 review session

GBIO009-1 - Bioinformatics

____________________________________________________________________________________________________________________

Kirill Bessonov 13

HW2: Computer style (NW algorithm) [3]

rules = function(A,B){s.matrix <- matrix(rep(0,16), nrow = 4, ncol=4, byrow=TRUE,

dimnames = list(c("A","C","G","T"),c("A","C","T","G")))

s.matrix["A",] = c(2,-1,-1,-1)

s.matrix["C",] = c(-1,2,-1,-1)

s.matrix["T",] = c(-1,-1,2,-1)

s.matrix["G",] = c(-1,-1,-1,2)

}> s.matrix A C T GA 2 -1 -1 -1C -1 2 -1 -1G -1 -1 2 -1T -1 -1 -1 2

Page 14: Homework 1 and 2 review session

GBIO009-1 - Bioinformatics

____________________________________________________________________________________________________________________

Kirill Bessonov 14

HW2: Computer style (NW algorithm) [4]• Check the F-matrixfmatrix=Fmatrix("ATCG", "TG") T G -32 -32 -32A -8 -16 -24T -16 -6 -14C -24 -14 -4G -32 -22 -12

• Start finding the optimal path(s) through the matrixAlignmentA = "" AlignmentB = ""

i = nchar(A) + 1j = nchar(B) + 1

while(i > 1 && j > 1){ CurrentScore = fmatrix[i,j] #get score at current position of F-matrix ScoreDiag = fmatrix[i - 1, j - 1] ScoreUp = fmatrix[i, j - 1] what is around that F-matrix cell? ScoreLeft = fmatrix[i - 1, j]

Page 15: Homework 1 and 2 review session

GBIO009-1 - Bioinformatics

____________________________________________________________________________________________________________________

Kirill Bessonov 15

HW1: Computer style (NW algorithm) [5]• Selecting the bottom right cell and starting to trace-back the path of optimal alignmentAlignmentA = ""

AlignmentB = ""

while(i > 1 && j > 1){

CurrentScore = fmatrix[i,j]

ScoreDiag = fmatrix[i - 1, j - 1]

ScoreUp = fmatrix[i, j - 1]

ScoreLeft = fmatrix[i - 1, j]

#considering the score came from diagonal

if (CurrentScore == ScoreDiag + s.matrix[substr(A,i,i), substr(B,j,j)) ){

AlignmentA = paste(substr(A,i-1,i-1),AlignmentA, sep = "")

AlignmentB = paste(substr(B,j-1,j-1),AlignmentB, sep = "")

i = i - 1

j = j - 1

}

On diagonal path: previous + next cell

Which cell of the F-matrix I am now?

Page 16: Homework 1 and 2 review session

GBIO009-1 - Bioinformatics

____________________________________________________________________________________________________________________

Kirill Bessonov 16

HW2: Computer style (NW algorithm) [6]#considering if the score comes from left (introducing a gap)

else if(CurrentScore == ScoreLeft + d){

AlignmentA = paste(substr(A,i-1,i-1),AlignmentA, sep = "")

AlignmentB = paste( "-", AlignmentB, sep = "")

i = i - 1 }

#considering if the score comes from upper cell (introducing a gap)

else if(CurrentScore == ScoreUp + d) {

AlignmentA = paste( "-", AlignmentA, sep = "")

AlignmentB = paste(substr(B,j-1,j-1), AlignmentB, sep = "")

j = j – 1 }

print(AlignmentA)

print(AlignmentB)

finalScore = cat("Final score :",fmatrix[(nchar(A)+1),(nchar(B)+1)])

Page 17: Homework 1 and 2 review session

GBIO009-1 - Bioinformatics

____________________________________________________________________________________________________________________

Kirill Bessonov 17

HW2: Computer style (NW algorithm) [7]

• The scoring matrices could have been accessed though character indices not requiring conversion and making code faster

• How one would output more than one BEST possible alignments?

• Please use more comments in your R-code• Would be nice to see trace-backs visually• Also the scoring rules were not stated clearly