8
Name: Are you a graduate or undergraduate student? Please circle one. Bioinformatics Take Home Test #3 Due Date Monday 10/7/2013 before class (This is an open book exam based on the honors system -- you can use notes, lecture notes, online manuals, and text books. Teamwork is not allowed on the exams, write down your own answers, do not cut and paste from webpages. If your answer uses a citation, give the source of the quoted text.) Notes on Formatting Quizzes: Please make sure each answer is only on one page, by using page breaks. Splitting an answer onto two pages tend to lead to grading errors. Please do not write or type in font smaller than 12 point or write in cursive. If you submit your quiz via email, please remove the instructions and extras (blank lines, alternative answers for multiple choice questions) from your document, so that only your answers, a minimal amount white space, and optionally the questions, are left. 1. 1pt True/False Databanks with a gatekeeper are slow in incorporating new data and accurate, while databanks without a gatekeeper are fast and inaccurate. 1pt 2. 1pt True/False Without selection for function there would be no similarity in the pairwise comparison of homologous protein from an Archaeon and a Eukaryote. 3. 1pt True/False Command line versions of the BLAST programs are available for all operating system. 4. 1pt True/False When searching a database with a query sequence, the P-value is proportional to the size of the databank and can be larger than 1. 5. 1pt True/False RNA alone never has catalytic activity. To be catalytically active it always needs to collaborate with proteins. RNA only provides specificity due to base pairing. 6. 1pt In a BLAST search, what does the filter for low-complexity do? A) It allows retrieving of "Warning Sequences" that are part of the databank and alert you to the fact that your query is of low complexity. B) It replaces regions of low complexity in the databank with the symbol for any residue. C) It replaces regions of low complexity in the query sequence with the symbol for any residue. D) None of the above. 7. 1pt Usually E values smaller than a certain threshold are considered to demonstrate homology. This threshold is usually about A) about 10 4 , B) about 10 -4 , C) about 10 -40

Bioinformatics Take Home Test #3 - Gogarten Lab | UConngogarten.uconn.edu/mcb3421_2013/TakeHomeExam03_2013.pdf · Bioinformatics Take Home Test #3 Due Date Monday 10/7/2013 before

  • Upload
    lamdiep

  • View
    215

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Bioinformatics Take Home Test #3 - Gogarten Lab | UConngogarten.uconn.edu/mcb3421_2013/TakeHomeExam03_2013.pdf · Bioinformatics Take Home Test #3 Due Date Monday 10/7/2013 before

Name: Are you a graduate or undergraduate student? Please circle one.

Bioinformatics Take Home Test #3 Due Date Monday 10/7/2013 before class (This is an open book exam based on the honors system -- you can use notes, lecture notes, online manuals, and text books. Teamwork is not allowed on the exams, write down your own answers, do not cut and paste from webpages. If your answer uses a citation, give the source of the quoted text.) Notes on Formatting Quizzes: Please make sure each answer is only on one page, by using page breaks. Splitting an answer onto two pages tend to lead to grading errors. Please do not write or type in font smaller than 12 point or write in cursive. If you submit your quiz via email, please remove the instructions and extras (blank lines, alternative answers for multiple choice questions) from your document, so that only your answers, a minimal amount white space, and optionally the questions, are left.

1. 1pt True/False Databanks with a gatekeeper are slow in incorporating new data and

accurate, while databanks without a gatekeeper are fast and inaccurate. 1pt

2. 1pt True/False Without selection for function there would be no similarity in the pairwise comparison of homologous protein from an Archaeon and a Eukaryote.

3. 1pt True/False Command line versions of the BLAST programs are available for all operating system.

4. 1pt True/False When searching a database with a query sequence, the P-value is proportional to the size of the databank and can be larger than 1.

5. 1pt True/False RNA alone never has catalytic activity. To be catalytically active it always needs to collaborate with proteins. RNA only provides specificity due to base pairing.

6. 1pt In a BLAST search, what does the filter for low-complexity do? A) It allows retrieving of "Warning Sequences" that are part of the databank and alert you to the fact that your query is of low complexity. B) It replaces regions of low complexity in the databank with the symbol for any residue. C) It replaces regions of low complexity in the query sequence with the symbol for any residue. D) None of the above.

7. 1pt Usually E values smaller than a certain threshold are considered to demonstrate

homology. This threshold is usually about A) about 104, B) about 10-4, C) about 10-40

Page 2: Bioinformatics Take Home Test #3 - Gogarten Lab | UConngogarten.uconn.edu/mcb3421_2013/TakeHomeExam03_2013.pdf · Bioinformatics Take Home Test #3 Due Date Monday 10/7/2013 before

8. 1pt If you want to BLAST the non-redundant database using a new protein sequence as query, which is the BEST search program to use? A) blastp, B) blastn, C) tblastx, D) blastx, E) PRSS, F) blastq

9. 1pt Usually a Z values of which magnitude is considered to demonstrate homology. A) larger than 3 B) smaller than 10-4 C) smaller than 3 D) This can only be determined by the distribution of alignment scores when shuffling the data

10. 1pt If you load a multiple sequence FASTA formatted file into an alignment program and the program only recognizes a single sequence, what could have gone wrong? A) the program expects the sequences to be in Clustal format with the word “CLUSTAL” written in the header. B) the “>” signs at the beginning of the annotation line are not part of the ASCI code C) the text file used different end of line conventions than the alignment program D) the program expects the sequences to follow the extended NCBI convention for FASTA formatted sequences. This convention requires a * at the end of the sequence.

11. 1pt What is a GI number? A) A unique number that is given to every submitted sequence. If the sequence is changed, it retains this number. This makes it easy to track changes that occurred to a sequence. B) A unique number that is given to every submitted sequence. If the sequence is changed, a suffix is added to the number. This makes it easy to track changes that occurred to a sequence. C) A unique number that is given to every submitted sequence. If the sequence is changed, it receives a new GI number. D) The Genomic Isoform number is given to every type of enzyme. This number provides easy access to enzymes from different organisms that have the same or similar function.

12. 1pt What advantages does a command line interface have over a graphics user interface? A. One can point their way through programming. B. It offers less control and no one wants control. C. One can write a program to run on the command line that does your work for you. D. It identifies cases of homology with 100% accuracy, all the time, every time.

13. 1pt When aligning two sequences that are about 80% identical, which of the following scoring matrices would be most appropriate: (A) PAM 1 (B) PAM 8

Page 3: Bioinformatics Take Home Test #3 - Gogarten Lab | UConngogarten.uconn.edu/mcb3421_2013/TakeHomeExam03_2013.pdf · Bioinformatics Take Home Test #3 Due Date Monday 10/7/2013 before

(C) PAM 25 (D) PAM 200 (E) PAM 0.8

14. 1pt If you want to align two sequences that are about 90% identical, which of the following scoring matrices would be most appropriate: (A) Blosum 35 (B) Blosum 80 (C) Blosum 90 (D) Blosum 65 (E) Blosum 10

15. 1pt You do a databank search using FASTA with an amino acid sequence as query. The only reported match has an E-value of 0.00001. What does this mean for the homology of the two sequences? A) this proves (beyond reasonable doubt) that the target sequence is not homologous to the query B) the target sequence is a candidate for a homologous sequence, but an E-value of this magnitude does not prove homology C) This proves (beyond reasonable doubt) that the two sequences are homologs. D) None of the above

16. 1pt Using a random shuffling approach (using PRSS) you find that two sequences have an E value (assuming 10000 comparisons) of 710. This A) proves homology B) disproves homology C) does not exclude the possibility that the two sequences might be homologous D) proves sequence similarity, but not homology

17. 1pt Some students still have difficulties to discriminate between the term homology (=shared ancestry) and significant similarity. Which of the following statements is correct: A) All complex sequences that show significant similarity in a pairwise sequence comparison are homologous. B) All homologous sequences show significant similarity in a pairwise sequence comparison. C) Both of the above statements are correct

18. 1pt If BLAST returns a match with an E-value of 3.7 e-15, what is the probability that this match represent a false positive? A) 0 B) 3.7 e-15

Page 4: Bioinformatics Take Home Test #3 - Gogarten Lab | UConngogarten.uconn.edu/mcb3421_2013/TakeHomeExam03_2013.pdf · Bioinformatics Take Home Test #3 Due Date Monday 10/7/2013 before

C) 3.7 10-15

D) The rate of false positive cannot easily be estimated.

19. 1pt In the above example, what is the frequency of false negatives in the databank? A) 0 B) 3.7 e-15 C) 3.7 10-15

D) The rate of false negatives cannot easily be estimated.

20. 1pt What are two of the most commonly used scoring matrices for data bank searches and for aligning protein sequences? A) Gonnet and JTT B) GTR and Blosum C) PAM and Blosum D) none of the above, explain:

21. 1pt In a multiple sequence fasta file the A) conserved residues are indicated by a *, conservative substitutions are indicated by a double colon or a period B) end of a sequence is indicated by a ! C) a new sequence always begins following an annotation line; the annotation line begins with a ">" character D) Sequence names are not allowed to contain a question mark ? E) All of the above

22. 1pt If you do a databank search using FASTA with an amino acid sequence as query and the only reported match has an E-value of 12, what does this mean for the homology of the two sequences? A) This proves (beyond reasonable doubt) that the two sequences are homologs. B) the target sequence is a candidate for a homologous sequence, but an E-value of this magnitude does not prove homology. C) this proves (beyond reasonable doubt) that the target sequence is not homologous to the query. D) None of the above.

23. 1pt Comparing sequence A to sequence B you obtain an alignment that matches sequences A and B over their whole length. The P-value for this alignment is <10-9. Sequence B also has a significant match to sequence C (P<10-12). You consider these P-values as sufficient proof for homology. A) This shows that sequence A is homologous to sequence C B) These findings cannot be used to infer homology between sequences A and C C) This is suggestive of homology between A and C, but to be sure you need to calculate the P-value for the match between A and C. D. None of the above.

Page 5: Bioinformatics Take Home Test #3 - Gogarten Lab | UConngogarten.uconn.edu/mcb3421_2013/TakeHomeExam03_2013.pdf · Bioinformatics Take Home Test #3 Due Date Monday 10/7/2013 before

24. 1pt You align two sequences that are about 20% identical. Which of the following scoring matrices would be most appropriate: (A) PAM 1 (B) PAM 8 (C) PAM 25 (D) PAM 200

25. 1pt You do a pairwise BLAST comparison of two protein sequences. You know that these sequences fulfill the same function in the two organisms, but BLAST2SEQ does not report any significant matches. What could you do to visualize the existing similarity? A) Increase the “expect value”. B) Use the encoding nucleotide sequence. C) Turn off the low complexity filter D) A + B; E) B + C; F) A + C

26. 2 pts Some of the following can be done with BLAST and some of them should NOT be done. Sort them into the correct bin below.

1. Determine which type of gene a sequence belongs to (if there is a sufficiently strong match) 2. Detect Horizontal Gene Transfers 3. Find genes that might have been acquired by horizontal gene transfer 4. Determine which species a sequence belongs to (if there is a sufficiently strong match) 5. Change the substitution matrix used to calculate pairwise alignments 6. Show homology between two sequences (if there is a sufficiently strong match) 7. Prove that there is NO homology between two sequences (if there is a sufficiently weak match) 8. Remove regions of low complexity or include them 9. Create multiple sequence alignments (with more than 2 sequences) A. Numbers corresponding to things that can be done with BLAST: B. Numbers corresponding to things that should NOT be done with BLAST:

Page 6: Bioinformatics Take Home Test #3 - Gogarten Lab | UConngogarten.uconn.edu/mcb3421_2013/TakeHomeExam03_2013.pdf · Bioinformatics Take Home Test #3 Due Date Monday 10/7/2013 before

27. 2pt The following diagram gives the relation between the number of substitutions that have occurred during evolution (x axis) and the observed fraction of sequence differences. The depicted curve corresponds to the Jukes Cantor correction for a nucleotide sequence. This correction is only correct, if all sites have the same probability to undergo a substitution, and if all nucleotides occur with the same frequency.

a. Provide a rough sketch of how this relationship would change, if the different sites would have different substitution frequencies, and some sites would only very rarely undergo a substitution. b. Using a different color, indicate how the curve would change, if the sequences have a strong compositional bias (e.g., 40% A, 40% T, 10% G, 10% C), but all have the same probability to undergo a substitution event. (If combinatorics is not your expertise, it might help to think about a sequence that only consists of As and Ts.)

28. 1pt You have a collection of 10000 genes and you perform a databank search with each of them with the aim for an overall probability to identify a false positive of 5%. Using the Bonferroni correction, which E-value do you apply to the individual databank search?

29. 1pt You compare two circular genomes from closely related organisms, each with a single origin of replication (ori) using a Genome Dot Plot using blast hits (green dots) and top scoring blast hits (red Xs). See the plot comparing two Frankia genomes below. The numbering of the genomes begins with the origin of replication. You observe a red diagonal in the lower left and the upper right hand corner, but the central part of the plot looks pretty random. To which part of the circular genome does the central part of the plot correspond? (Indicate the region in the sketch below.

Page 7: Bioinformatics Take Home Test #3 - Gogarten Lab | UConngogarten.uconn.edu/mcb3421_2013/TakeHomeExam03_2013.pdf · Bioinformatics Take Home Test #3 Due Date Monday 10/7/2013 before

Extra credit: Up to 3pt Describe a few processes (max two sentences each) that in you opinion go beyond the most simple definition of natural selection (offspring similar to parents but random inherited variation, more offspring than necessary for replacement, selection due to limited resources).

ori

Page 8: Bioinformatics Take Home Test #3 - Gogarten Lab | UConngogarten.uconn.edu/mcb3421_2013/TakeHomeExam03_2013.pdf · Bioinformatics Take Home Test #3 Due Date Monday 10/7/2013 before

For Graduate Students:

30. 1pt You want to find all transposons in a completely sequenced genome. You know the amino acid sequence of the transposase as well as the encoding nucleotide sequence. Which of the following programs (in real life you should consider psi-blast, which we will discuss later) would be the best choice to find all transposons in the genome, including those whose Open Reading Frames are decaying? (For each briefly indicate the advantage/disadvantage) A) blastp B) blastn C) blastx D) tblastx E) tblastn

31. 4pts Answer the following questions about LUCA and the late heavy bombardment:

a. What caused the late heavy bombardment? How do we know? b. What impact did it have on Earth in terms of the habitability for life? c. Was LUCA a thermophilic or mesophilic organism then? How do we know? d. Was the Last Common Ancestor of the Bacteria a thermophilic or mesophilic organism? Was the Last Common Ancestor of the Archaea a thermophilic or mesophilic organism? And relate this back to the late heavy bombardment.