20
Second bioinformatics lab: Exercise on disease (developed in part by Sarah C. R. Elgin, Washington University) It is well known that smoking leads to an increased risk for lung cancer, but how does genetics play into the risk? The transformation of a normal cell into a cancerous cell can result from many causes. In one model, factors that lead to an increased rate of mutation in DNA increases the chances that a protooncogene (the normal form of a gene) will be mutated into an oncogene (a cancer-causing gene), causing a normal cell to be transformed into a cancerous cell. In this module, you will examine one proto-oncogene, K- Ras that has been associated with many cancers, including lung cancer. You will be examining a cDNA sequence for K-Ras that contains a mutation. You will analyze the mutant K-Ras protein using the bioinformatics tools presented in lab. You will investigate the mutation, and find out what is known, if anything, about the biological impact of the mutation. See our textbook for a discussion of Ras: pages 407, 412, and 596-597 (6 th ed). Summarizing, insulin or a growth factor binds to a receptor, two bound receptors come together (dimerization) and activate each other. Each growth factor receptor is a tyrosine kinase that puts a phosphate (i.e., phosphorylate) on tyrosines 1

Second bioinformatics lab:Exercise on disease:clasfaculty.ucdenver.edu/bstith/bioinformaticsprotocol2.doc · Web viewThen go to your Word file and make sure that you are on the stop

Embed Size (px)

Citation preview

Page 1: Second bioinformatics lab:Exercise on disease:clasfaculty.ucdenver.edu/bstith/bioinformaticsprotocol2.doc · Web viewThen go to your Word file and make sure that you are on the stop

Second bioinformatics lab: Exercise on disease (developed in part by Sarah C. R. Elgin, Washington University)

It is well known that smoking leads to an increased risk for lung cancer, but how does genetics play into the risk? The transformation of a normal cell into a cancerous cell can result from many causes. In one model, factors that lead to an increased rate of mutation in DNA increases the chances that a protooncogene (the normal form of a gene) will be mutated into an oncogene (a cancer-causing gene), causing a normal cell to be transformed into a cancerous cell.

In this module, you will examine one proto-oncogene, K-Ras that has been associated with many cancers, including lung cancer. You will be examining a cDNA sequence for K-Ras that contains a mutation. You will analyze the mutant K-Ras protein using the bioinformatics tools presented in lab. You will investigate the mutation, and find out what is known, if anything, about the biological impact of the mutation.

See our textbook for a discussion of Ras: pages 407, 412, and 596-597 (6th ed). Summarizing, insulin or a growth factor binds to a receptor, two bound receptors come together (dimerization) and activate each other. Each growth factor receptor is a tyrosine kinase that puts a phosphate (i.e., phosphorylate) on tyrosines (a particular amino acid) located on the other receptor’s tail. Thus, they put phosphate on each other to activate each other. Next, adaptor proteins (GRB2, which has an SH2 domain that binds to phosphorylated tyrosines, and SOS, which is a guanine nucleotide exchange factor or GEF) come in and are activated by binding to phosphorylated tyrosines located on the receptor. The adapator proteins then activate Ras. Ras is off when GDP is bound, but “on” when a new GTP comes in and displaces the old GDP. Ras activates Raf, which activates MEK, which activates map kinase (one map kinase type is ERK- the kinase that we studied earlier). ERK turns on transcription factors that turn on certain genes required for cell division. Fig. 19-41 is shown (see also Fig. 14-18). In a cancerous cell, the mutant Ras cannot be shut off and the cell will divide and divide to form a tumor.

1

Page 2: Second bioinformatics lab:Exercise on disease:clasfaculty.ucdenver.edu/bstith/bioinformaticsprotocol2.doc · Web viewThen go to your Word file and make sure that you are on the stop

Working with Primary Protein Structure InformationUsing the tools from the last lab, we will search for your k-Ras gene in the Gene database. Gene will contain the RefSeq sequence for your protein, which you will download in FASTA format. FASTA format is defined in your Glossary. Be sure to review the FASTA definition before begin your work. QUESTIONS ARE POSTED AT THE END, TAKE GOOD NOTES, TRY TO ANSWER SOME OF THEM AS YOU GO ALONG (or else you will have to redo some of this work). Ask me about any answers to these questions (if you do not know the answer and have made a good attempt). Some of these questions may be on the final exam.

You will continue to learn about your protein using the SwissProt database. TheSwissProt database is a database maintained by the Swiss Bioinformatics Institute and contains entries for thousands of proteins. You can search for the k-Ras protein that you are studying by using the gene name given in Gene. The SwissProt entry contains some of the same information that you found in Gene, but also contains a lot information about the protein sequence, structure, and function that is summarized in a short, easy-to-read format.

The ultimate goal for today’s lab is to create a multiple sequence alignment for your protein using Clustal W. You will use this alignment to identify the protein mutation, to observe regions of high sequence conservation, and to make an evolutionary tree for the evolution of Ras. The protein mutation is important to identify since it is important for the understanding of the link between the protein structure and cancer. The regions of high sequence conservation are important because they often correspond to regions in the protein that are important to the protein’s function (e.g., active site, regulatory region).

Part 1 – Obtaining the basics: Getting sequence information and viewingthe SwissProt and GenBank entries for your protein.

Directions: Follow this guide sheet and answer the questions at the end of this file. Translating your patient’s cDNA1. Below is the mutant cDNA sequence for a patient with a cancer caused by a mutant k-Ras protein-highlight all the sequence and then hit control-C to copy to clipboard. BTW, we know that this is DNA since there are no U bases (only RNA has U, instead DNA uses T). Also, we always report the coding strand sequence (not the template strand that actually helps make the mRNA). Furthermore, this cDNA came from RT PCR since the sequence ends in polyA (polyA tails found in mRNA).GGCCGCGGCGGCGGAGGCAGCAGCGGCGGCGGCAGTGGCGGCGGCGAAGGTGGCGGCGGCTCGGCCAGTACTCCCGGCCCCCGCCATTTCGGACTGGGAGCGAGCGCGGCGCAGGCACTGAAGGCGGCGGCGGGGCCAGAGGCTCAGCGGCTCCCAGGTGCGGGAGAGAGGCCTGCTGAAAATGACTGAATATAAACTTGTGGTAGTTGGAGCTTGTGGCGTAGGCAAGAGTGCCTTGACGATACAGCTAATTCAGAATCATTTTGTGGACGAATATGATCCAACAATAGAGGATTCCTACAGGAAGCAAGTAGTAATTGATGGAGAAACCTGTCTCTTGGATATTCTCGACACAGCAGGTCAAGAGGAGTACAGTGCAATGAGGGACCAGTACATGAGGACTGGGGAGGGCTTTCTTTGTGTATTTGCCATAAATAATACTAAATCATTTGAAGATATTCACCATTATAGAGAACAAATTAAAAGAGTTAAGGACTCTGAAGATGTACCTATGGTCCTAGTAGGAAATAAATGTGATTTGCCTTCTAGA

2

Page 3: Second bioinformatics lab:Exercise on disease:clasfaculty.ucdenver.edu/bstith/bioinformaticsprotocol2.doc · Web viewThen go to your Word file and make sure that you are on the stop

ACAGTAGACACAAAACAGGCTCAGGACTTAGCAAGAAGTTATGGAATTCCTTTTATTGAAACATCAGCAAAGACAAGACAGGGTGTTGATGATGCCTTCTATACATTAGTTCGAGAAATTCGAAAACATAAAGAAAAGATGAGCAAAGATGGTAAAAAGAAGAAAAAGAAGTCAAAGACAAAGTGTGTAATTATGTAAATACAATTTGTACTTTTTTCTTAAGGCATACTAGTACAAGTGGTAATTTTTGTACATTACACTAAATTATTAGCATTTGTTTTAGCATTACCTAATTTTTTTCCTGCTCCATGCAGACTGTTAGCTTTTACCTTAAATGCTTATTTTAAAATGACAGTGGAAGTTTTTTTTTCCTCTAAGTGCCAGTATTCCCAGAGTTTTGGTTTTTGAACTAGCAATGCCTGTGAAAAAGAAACTGAATACCTAAGATTTCTGTCTTGGGGTTTTTGGTGCATGCAGTTGATTACTTCTTATTTTTCTTACCAATTGTGAATGTTGGTGTGAAACAAATTAATGAAGCTTTTGAATCATCCCTATTCTGTGTTTTATCTAGTCACATAAATGGATTAATTACTAATTTCAGTTGAGACCTTCTAATTGGTTTTTACTGAAACATTGAGGGAACACAAATTTATGGGCTTCCTGATGATGATTCTTCTAGGCATCATGTCCTATAGTTTGTCATCCCTGATGAATGTAAAGTTACACTGTTCACAAAGGTTTTGTCTCCTTTCCACTGCTATTAGTCATGGTCACTCTCCCCAAAATATTATATTTTTTCTATAAAAAGAAAAAAATGGAAAAAAATTACAAGGCAATGGAAACTATTATAAGGCCATTTCCTTTTCACATTAGATAAATTACTATAAAGACTCCTAATAGCTTTTCCTGTTAAGGCAGACCCAGTATGAAATGGGGATTATTATAGCAACCATTTTGGGGCTATATTTACATGCTACTAAATTTTTATAATAATTGAAAAGATTTTAACAAGTATAAAAAATTCTCATAGGAATTAAATGTAGTCTCCCTGTGTCAGACTGCTCTTTCATAGTATAACTTTAAATCTTTTCTTCAACTTGAGTCTTTGAAGATAGTTTTAATTCTGCTTGTGACATTAAAAGATTATTTGGGCCAGTTATAGCTTATTAGGTGTTGAAGAGACCAAGGTTGCAAGGCCAGGCCCTGTGTGAACCTTTGAGCTTTCATAGAGAGTTTCACAGCATGGACTGTGTCCCCACGGTCATCCAGTGTTGTCATGCATTGGTTAGTCAAAATGGGGAGGGACTAGGGCAGTTTGGATAGCTCAACAAGATACAATCTCACTCTGTGGTGGTCCTGCTGACAAATCAAGAGCATTGCTTTTGTTTCTTAAGAAAACAAACTCTTTTTTAAAAATTACTTTTAAATATTAACTCAAAAGTTGAGATTTTGGGGTGGTGGTGTGCCAAGACATTAATTTTTTTTTTAAACAATGAAGTGAAAAAGTTTTACAATCTCTAGGTTTGGCTAGTTCTCTTAACACTGGTTAAATTAACATTGCATAAACACTTTTCAAGTCTGATCCATATTTAATAATGCTTTAAAATAAAAATAAAAACAATCCTTTTGATAAATTTAAAATGTTACTTATTTTAAAATAAATGAAGTGAGATGGCATGGTGAGGTGAAAGTATCACTGGACTAGGAAGAAGGTGACTTAGGTTCTAGATAGGTGTCTTTTAGGACTCTGATTTTGAGGACATCACTTACTATCCATTTCTTCATGTTAAAAGAAGTCATCTCAAACTCTTAGTTTTTTTTTTTTACAACTATGTAATTTATATTCCATTTACATAAGGATACACTTATTTGTCAAGCTCAGCACAATCTGTAAATTTTTAACCTATGTTACACCATCTTCAGTGCCAGTCTTGGGCAAAATTGTGCAAGAGGTGAAGTTTATATTTGAATATCCATTCTCGTTTTAGGACTCTTCTTCCATATTAGTGTCATCTTGCCTCCCTACCTTCCACATGCCCCATGACTTGATGCAGTTTTAATACTTGTAATTCCCCTAACCATAAGATTTACTGCTGCTGTGGATATCTCCATGAAGTTTTCCCACTGAGTCACATCAGAAATGCCCTACATCTTATTTCCTCAGGGCTCAAGAGAATCTGACAGATACCATAAAGGGATTTGACCTAATCACTAATTTTCAGGTGGTGGCTGATGCTTTGAACATCTCTTTGCTGCCCAATCCATTAGCGACAGTAGGATTTTTCAAACCTGGTATGAATAGACAGAACCCTATCCAGTGGAAGGAGAATTTAATAAAGATAGTGCTGAAAGAATTCCTTAGGTA

3

Page 4: Second bioinformatics lab:Exercise on disease:clasfaculty.ucdenver.edu/bstith/bioinformaticsprotocol2.doc · Web viewThen go to your Word file and make sure that you are on the stop

ATCTATAACTAGGACTACTCCTGGTAACAGTAATACATTCCATTGTTTTAGTAACCAGAAATCTTCATGCAATGAAAAATACTTTAATTCATGAAGCTTACTTTTTTTTTTTGGTGTCAGAGTCTCGCTCTTGTCACCCAGGCTGGAATGCAGTGGCGCCATCTCAGCTCACTGCAACCTCCATCTCCCAGGTTCAAGCGATTCTCGTGCCTCGGCCTCCTGAGTAGCTGGGATTACAGGCGTGTGCCACTACACTCAACTAATTTTTGTATTTTTAGGAGAGACGGGGTTTCACCCTGTTGGCCAGGCTGGTCTCGAACTCCTGACCTCAAGTGATTCACCCACCTTGGCCTCATAAACCTGTTTTGCAGAACTCATTTATTCAGCAAATATTTATTGAGTGCCTACCAGATGCCAGTCACCGCACAAGGCACTGGGTATATGGTATCCCCAAACAAGAGACATAATCCCGGTCCTTAGGTAGTGCTAGTGTGGTCTGTAATATCTTACTAAGGCCTTTGGTATACGACCCAGAGATAACACGATGCGTATTTTAGTTTTGCAAAGAAGGGGTTTGGTCTCTGTGCCAGCTCTATAATTGTTTTGCTACGATTCCACTGAAACTCTTCGATCAAGCTACTTTATGTAAATCACTTCATTGTTTTAAAGGAATAAACTTGATTATATTGTTTTTTTATTTGGCATAACTGTGATTCTTTTAGGACAATTACTGTACACATTAAGGTGTATGTCAGATATTCATATTGACCCAAATGTGTAATATTCCAGTTTTCTCTGCATAAGTAATTAAAATATACTTAAAAATTAATAGTTTTATCTGGGTACAAATAAACAGGTGCCTGAACTAGTTCACAGACAAGGAAACTTCTATGTAAAAATCACTATGATTTCTGAATTGCTATGTGAAACTACAGATCTTTGGAACACTGTTTAGGTAGGGTGTTAAGACTTACACAGTACCTCGTTTCTACACAGAGAAAGAAATGGCCATACTTCAGGAACTGCAGTGCTTATGAGGGGATATTTAGGCCTCTTGAATTTTTGATGTAGATGGGCATTTTTTTAAGGTAGTGGTTAATTACCTTTATGTGAACTTTGAATGGTTTAACAAAAGATTTGTTTTTGTAGAGATTTTAAAGGGGGAGAATTCTAGAAATAAATGTTACCTAATTATTACAGCCTTAAAGACAAAAATCCTTGTTGAAGTTTTTTTAAAAAAAGCTAAATTACATAGACTTAGGCATTAACATGTTTGTGGAAGAATATAGCAGACGTATATTGTATCATTTGAGTGAATGTTCCCAAGTAGGCATTCTAGGCTCTATTTAACTGAGTCACACTGCATAGGAATTTAGAACCTAACTTTTATAGGTTATCAAAACTGTTGTCACCATTGCACAATTTTGTCCTAATATATACATAGAAACTTTGTGGGGCATGTTAAGTTACAGTTTGCACAAGTTCATCTCATTTGTATTCCATTGATTTTTTTTTTCTTCTAAACATTTTTTCTTCAAACAGTATATAACTTTTTTTAGGGGATTTTTTTTTAGACAGCAAAAACTATCTGAAGATTTCCATTTGTCAAAAAGTAATGATTTCTTGATAATTGTGTAGTAATGTTTTTTAGAACCCAGCAGTTACCTTAAAGCTGAATTTATATTTAGTAACTTCTGTGTTAATACTGGATAGCATGAATTCTGCATTGAGAAACTGAATAGCTGTCATAAAATGAAACTTTCTTTCTAAAGAAAGATACTCACATGAGTTCTTGAAGAATAGTCATAACTAGATTAAGATCTGTGTTTTAGTTTAATAGTTTGAAGTGCCTGTTTGGGATAATGATAGGTAATTTAGATGAATTTAGGGGAAAAAAAAGTTATCTGCAGATATGTTGAGGGCCCATCTCTCCCCCCACACCCCCACAGAGCTAACTGGGTTACAGTGTTTTATCCGAAAGTTTCCAATTCCACTGTCTTGTGTTTTCATGTTGAAAATACTTTTGCATTTTTCCTTTGAGTGCCAATTTCTTACTAGTACTATTTCTTAATGTAACATGTTTACCTGGAATGTATTTTAACTATTTTTGTATAGTGTAAACTGAAACATGCACATTTTGTACATTGTGCTTTCTTTTGTGGGACATATGCAGTGTGATCCAGTTGTTTTCCATCATTTGGTTGCGCTGACCTAGGAATGTTGGTCATATCAAACATTAAAAATGACCACTCTTTTAATTGAAATTAACTTTTAAATGTTTATAGGAGTATGTGCTGTGAAGTGATCTAAAATTTGTAATATTTTTGTCATGAACTGTACTACTCCTAATTATTGTAATGTAATA

4

Page 5: Second bioinformatics lab:Exercise on disease:clasfaculty.ucdenver.edu/bstith/bioinformaticsprotocol2.doc · Web viewThen go to your Word file and make sure that you are on the stop

AAAATAGTTACAGTGACAAAAAAAAAAAAAAA

2. Go to the Sequence Manipulation Site (http://bioinformatics.org/sms/ ). We want to get the amino acid sequence from this cDNA sequence from the cancer patient. We will compare this amino acid sequence with DNA from a person with the normal gene.

3. In the menu to the left, Click on “show translation” found under the heading “DNA figures.” Paste the above sequence into the first box, and under this box, “Show the translation for…” you want to click on the drop down box and click “reading frame 2.” After hitting submit, a new window pops up and it contains the original nucleotide base sequence with a one letter amino acid symbol. Note that the amino acid is listed above the three bases and that amino acid number 61 is M (stands for methionine- see table to right from your text)- this is the actual beginning of the protein (all proteins begin with methionine). Highlight the web page with the info and paste into a Word file (remember to take the word file with you or send to yourself by email when done).

4. In the menu to the left, Click on “Translate” found under the heading “DNA analysis”. Clear the search box, then paste your patient’s cDNA sequence into the search box. Choose a reading frame from the pull-down menu. Use “Reading Frame 2” when translating the sequence at the Sequence Manipulation Suite. Click “Submit.”

5. You should be able to find the sequence of your protein by finding the first methionine (M), then continuing until you see the first “*” which is a stop codon. Copy the protein sequence in that region, starting with the first “M” and paste it into a word document. Save the results in the same word file that you have started. Now you have saved the file of the mutant protein sequence.

6. Using “Entrez Gene” on the NCBI website (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=gene), find the entry for the protein you are studying by searching with the protein name- note the other possible ways of searching. Cut and paste in the following gene name: Homo sapiens KRas2. After the search, you will find 5 entries, pick the one that is “KRAS” and from Homo sapiens. For this KRAS entry, answer the following:a. what is the official symbol: b. note some other names (aliases): c. located on which chromosome (remember we have 23 pairs of homologous chromosomes- 46 total with Mom giving you one of the homologous chrosomes, and Dad giving you the other):d. Gene ID:

5

Page 6: Second bioinformatics lab:Exercise on disease:clasfaculty.ucdenver.edu/bstith/bioinformaticsprotocol2.doc · Web viewThen go to your Word file and make sure that you are on the stop

7. Click on the highlighted KRAS and you open Entrez Gene for more info on the gene. Read the summery info—can you understand this information? When it says “Alternative splicing leads to variants encoding two isoforms that differ in the C-terminal region” this means that one pre-mRNA is made from the gene, but the processing of the pre-mRNA can differ (different sections are cut out). So two proteins (isoform a and isoform b) result from the same gene/pre-mRNA. In this case, isoforms are different proteins from the same gene. Highlight all on this web page, hit control-c to copy and then paste into your Word file. When you are all done, print off this Word file and put it in your notebook.

On this page, look at the Genomic Regions: the 5’ is the beginning and the 3’ is the end. Note that the short vertical red line means a section that is used for the protein (coding regions), and the blue section is not (called an untranslated region- UTR-- so, there is a 5’ UTR at the beginning and a 3’ UTR at the end of the sequence). Note the first three coding regions are the same for both isoform a and b, but the last one differs (in isoform b, the last coding region is right up against the 3’ UTR). Next, glance through the Genomic Context (again, note it is on chromosome 12) and then Bibliography (you will be asked to look at one paper on this list).

Go down to Pathways and look at KEGG pathway: Insulin signaling pathway 04910. In this path, note that insulin activates the receptor, which activates the insulin receptor substrate (IRS), which turns on GRB2, SOS, Ras, Raf, MEK, ERK1/2 (same path we looked at before in Xenopus oocytes). MAKE SURE AS MUCH OF THE TABLE IS IN VIEW and then hit the “print screen” button and go to your Word file and hit control-v to paste the pathway into your Word file. Note that the insulin path is a little different from the growth factor path (see text figure on page 1of this protocol); the insulin receptor is already dimerized and there is an insulin receptor substrate (IRS) inserted into the path before Grb2/SOS. The end results in this insulin path are

(1)___________________________________ and (2)___________________________. On the KRAS page, go down to RefSeq section. Compare variant a and b now (use text in web site; which exons are used by what isoform/variant)—note which isoform or variant is rare and which is common:

Click on the mRNA Sequence for “variant (b)” (not variant (a)). Use the RefSeq entries for the mRNA and protein sequences for K-Ras2 isoform b – also called “variant (b).” Go down to the protein sequence (/translation=") and save the sequence in FASTA format in your Word file. Remember this is the un-mutated protein sequence- name it the “protein sequence for the normal protooncogene.” Use the font “courier new” and get rid of all gaps. Also, go to the bottom of the web page and copy the gene – copy the sequence of bases that begins after ORIGIN, and goes down to number 5281- and put it in your Word file for your lab notebook.

6

Page 7: Second bioinformatics lab:Exercise on disease:clasfaculty.ucdenver.edu/bstith/bioinformaticsprotocol2.doc · Web viewThen go to your Word file and make sure that you are on the stop

8. Go to the ExPASy website (http://us.expasy.org/) and search for the SwissProt entry for your protein using “kras2.” Be sure to select the human protein from the list of results. Make sure the information in the entry is the same as you saw in the Gene entry. If your protein is an enzyme, the EC number is a good way to double-check—it is _______________. You may want to record the SwissProt entry number (primary accession number P01116) in case you want to find this entry again. Note that we could get the normal gene here also (at bottom, directly in FASTA format)- you want to save this version also.

Part 2:: Protein-protein BLAST --Finding homologous (similar) proteins9. Search for similar proteins by a BLAST (from NCBI home page or: http://www.ncbi.nlm.nih.gov/BLAST) search using the RefSeq or SwissProt protein FASTA sequence (the unmutated protein sequence). BLAST is a program that compares your input sequence to all the sequences in a database (that you choose). This program aligns the most similar segments between the two sequences (using a scoring matrix similar to BLOSUM -see entry). This scoring method gives penalties for gaps and gives the highest score for identical residues. Substitutions are scored based on how conservative the changes are (a nonpolar small amino acid replaced by a nonpolar large amino acid). The output shows a list of sequences, with the highest scoring sequence at the top. The scoring output is given as an E-value. The lower the E-value, the higher scoring the sequence is. E-values in the range of 1^-100 to 1^-50 are very similar (or even identical) sequences. Sequences with E-values 1^-10 and higher need to be examined based on other methods to determine homology. An Evalue of 1^-10 for a sequence can be interpreted as, “a 1 in 1^10 chance that the sequence was pulled from the database by chance alone (has no homology to the query sequence).”

First, under Protein, select PSI-PHI BLAST. Then paste the FASTA formatted protein sequence in the search box. Select the nrprotein database. Click “BLAST” to begin.

You may need to wait a few minutes before the results page opens. On the next page that appears, you will see that putative conserved domains have been found. Select “Format.” After obtaining the results, choose 5 sequences from various positions in the results (under Sequences producing significant alignments). Be sure not to choose any sequences that are human (see SOURCE), since they are the same as your search sequence. Choose ones for “lower” animals: rat, Tetraodon nigroviridis, Xenopus, Rivulus marmoratus, Oryzias latipes (Japanese medaka), etc. The goal is to choose a variety of sequences that greatly differ in evolutionary distance from the human protein. Be sure to choose a good variety of sequences from the BLAST search. The more varied the sequences, the more interesting the resulting phylogram will be.

Be sure the wild type human (RefSeq) and mutant sequences only differ by one amino acid residue. If more differences are found, there may have been a mistake in the translation of the mutant sequence.

For each of the five sequences, click on the sequence name to view the GenBank entry for the sequence. Then view the sequence in FASTA format. Copy and paste all the FASTA formatted sequences into the same Word file (get rid of any gaps or numbers). At

7

Page 8: Second bioinformatics lab:Exercise on disease:clasfaculty.ucdenver.edu/bstith/bioinformaticsprotocol2.doc · Web viewThen go to your Word file and make sure that you are on the stop

the beginning of this file, make sure that you have your mutant protein sequence (see very start of this exercise), also in FASTA format.

10. This Word file will be used to create the multiple sequence alignment, so the formatting is very important. Get rid of all gaps –esp those at the end of each line (go to the end of each line, and hit delete until you start deleting sequence amino acid symbols). You should end up with a Word file that contains the 5 sequences from the BLAST search plus the un-mutated human protein sequence and your mutant sequence for a total of 7 sequences.Each sequence should be in FASTA format and contain a title line (starting with >, then text, then a return). Shorten the text to contain JUST the species information so it will fit in one line!! For example, you should erase the “gi” line and add in something simpler like “pig,” “cow,” etc. Your mutant sequence should read “>mutant”. At the end of each title, be sure to press return to separate it from the rest of the sequence.

Part 3 – Multiple Sequence Alignment11. Go to the ClustalW website (http://www.ebi.ac.uk/clustalw/index.html) and enter (by using “copy” and “paste”) all your FASTA formatted sequence into the data entry box. The default parameters will work for us, except for the output order.a. Select “input” for the Output orderb. Press “run”

12. When the results come up, click on Edit on the top bar of the internet explorer tool bar, and then highlight Select All. Or you can highlight the central text (getting rid of the heading at the top of the page) and copy. After items are highlighted, hit control-c to copy and then paste into your Word file. Next, let’s center in on the alignment – click on View Alignment File and copy all and paste into your Word file. It may look broken up. Follow these steps to make it readable again.a. Select the alignment text (highlight it with your mouse)b. Change the font to size 10 and Courier Newc. Change the page set-up (first hit FILE) to landscaped. Save the file to your desktop

13. To save the “cladogram tree,” make sure that the diagram is in the center of your monitor, then hit the “Print Screen/SysRq” key on your keyboard. Then go to your Word file and make sure that you are on the stop you want the diagram, then hit Control-V or the paste icon (or under Edit) to put the picture in your Word file. Repeat the process but first click on “Show as phylogram Tree.” From the web site: “Phylogram is a branching diagram (tree) assumed to be an estimate of a phylogeny, branch lengths are proportional to the amount of inferred evolutionary change. A Cladogram is a branching diagram (tree) assumed to be an estimate of a phylogeny where the branches are of equal length, thus cladograms show common ancestry, but do not indicate the amount of evolutionary "time" separating taxa. Tree distances can be shown, just click on the diagram to get a menu of options. The ".dnd" file is a file that describes the phylogenetic tree.” With your phylogram, you might note that Xenopus separated into a species about 400 million

8

Page 9: Second bioinformatics lab:Exercise on disease:clasfaculty.ucdenver.edu/bstith/bioinformaticsprotocol2.doc · Web viewThen go to your Word file and make sure that you are on the stop

yeaers ago. Other animals: birds arose about 170 million years ago; Mammals about 220 million years ago; Reptiles about 320 million years ago; Amphibians about 400 million years ago and fish about 500 million years ago.

Scroll through the file and alignment and make sure none of the blocks of sequences are separated by a page break. Save and print the alignment- it will be part of your lab notebook.

OMIM search14. Search the OMIM database (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM&cmd=search&term=) by typing in: “lung Cancer” KRAS2. You have to put parentheses around the two words to make sure that you link them (or you get pages of breast cancer, etc).

15. Click on the first two items that come up on lung cancer and Ras. Read through them. If you want, when a new page comes up, use Edit, Find (on this web page) and type in ras to go right to a note about ras. An outline for the entry is provided to the left of the window-go to the references and read the first article listed. Go to the Allelic Variants section. Scroll until you see the entry for the Gly12Cys mutation (or G12C). Answer appropriate questions at the end of this exercise.

Examination of K-Ras structure with Kinemage:1. Open the program Kinemage (should be on C drive) or download it from the internet.

2. Download the file called: c14Recp.kin and save it next to the Kinemage program.

3. See figure below, small part on right: Ras is essentially a flat plane (made up of 6 BLUE beta-pleated sheets), with GREEN alpha helices sitting on both sides and on the edge—see lower right hand small figure in illustration below.

#1 Amino Acidhere

#12 GlycineIn P/G1 loop and helps bind GTP

Gly12 here

Find the structures of Ras noted in the figure above with Kinemage. Open Mage program, open file c14Recp.kin and then you have to go from the first figure (that of Src) to the second (showing Ras) – you will not need the third figure.

9

Page 10: Second bioinformatics lab:Exercise on disease:clasfaculty.ucdenver.edu/bstith/bioinformaticsprotocol2.doc · Web viewThen go to your Word file and make sure that you are on the stop

4. Ras: View1 of 4. The backbone of the amino acid chain is in white, the bound GTP analogue in pink, and the Mg++ in yellow. Mg is captured by Ras to help the protein bind to the phosphate part of GTP- thus, Mg is a cofactor (also a “trace element”- required in trace amounts). Rotate the Ras so that you get the same arrangement as shown in the PowerPoint slide (GTP on lower right side, see n terminus on the left side)- the flat sheet of the 6 beta strands is up and down, with the GTP binding site on top (see figure above). Make sure you read the captions for each of the four views of Ras. Zoom out (top right slider bar)  to find n terminus- remember that the P loop is just down the chain from the N terminus (the beginning of the protein). In this orientation, which of the two G’s represent Glycine 12 (versus Glycine 13)? The G on the left or right?

With the GTP site on lower right, can you count down from the N terminus?

5. View2 (remember how you go from the first view to the second?): Glycine 12 (where the most common mutation occurs) and Glycine 13 are labeled in green and are found on what is called the G1 or P loop. These 2 Glycines are located in a critical part of one of the main GTP-binding loops (the G1 or P loop). They are the two major sites of mutations that convert this enzyme into an oncogene - when these Gly's mutate to Cys, the GTP cannot be broken down (GTPase activity of ras is reduced) so Ras stays in the "on" state more of the time- causing cancer.

Draw (block diagram only, not atoms and bonds) GTP structure, then point to and name the three parts of GTP

Can you identify the 3 parts of GTP parts in the kinemage image?

6. To see details of interactions at the binding site, go to View3 and turn on "interact" (click on its box). Now, you can see the R group sidechains in cyan and weak H bonds in purple. Remember that you can zoom in, and alter the Z slab so that you can see atoms behind the plane of view.With the view three and interaction shown, you can also click on the atoms and find the number, name of the amino acid and what atom of the amino acid (the beginning of the amino acid has an amino group; for glycine 12, click on parts of the amino acid white chain and looking for “n;” the tail end of the amino acid has the carboxy group noted by “c,” and “ca” stands for the alpha carbon in the middle of the amino acid). See figure to right (Fig. 3-3 in sixth ed) showing how two amino acids are connected). In the Kinemage image, find the “amino” group NH at the beginning of glycine and the beginning of

10

Page 11: Second bioinformatics lab:Exercise on disease:clasfaculty.ucdenver.edu/bstith/bioinformaticsprotocol2.doc · Web viewThen go to your Word file and make sure that you are on the stop

the second amino acid alanine. Note the R group comes off of the alpha (or central) carbon and, for glycine, the R group consists of two H atoms. The R group of Alanine has a carbon and three hydrogen atoms (called a methyl group).

7. Glycine 13 interacts with what part of GTP? (see view three, and click on interactions to see blue lines representing weak H bonds; which of GTP’s 3 parts interacts with glycine 13?). 

8. Does glycine 12 (the one that mutates most commonly) have any weak interactions (blue line) with the GTP?   Yes or No (circle one)

How do you interpret this answer?

9. Which of the three parts of GTP interacts with the cofactor (in yellow)?

10. How many weak bonds link GTP to Ras?   11. Draw and compare the R group of glycine with that of cysteine; which R group is larger?

12. What type of amino acid are they (remember there are three types of amino acids based on the R group)?  Why might changing from a G to a C cause a problem?

Questions over bioinformatics exercise on Ras and cancer

1. In the first or second reference from step 7 on page 6 (Entrez Gene, bibliography) above, go to a paper and read the abstract: who is the first author on this first article and what is the reference for this article (journal name, volume, page numbers, year)?

2. Describe how many and what categories of patients were involved in the study?

3. What did the researchers find out about K-Ras mutations?

11

Page 12: Second bioinformatics lab:Exercise on disease:clasfaculty.ucdenver.edu/bstith/bioinformaticsprotocol2.doc · Web viewThen go to your Word file and make sure that you are on the stop

4. What conclusion(s) did the researchers come to about K-Ras mutations based on their data? (Summarize and put into your own words)

5. Using your Word file, look at the gene sequence for the normal k-Ras, and the mutant, cancer causing k-Ras. Look for the point mutation in the genes. To do this, use your “The Sequence Manipulation Suite: Show Translation- for the cancer patient’s mutant ras: Results for 5312 residue sequence starting "GGCCGCGGCG"” and the equivalent for the normal gene (see hint below). Find the first M or methionine (proteins begin with this amino acid), and then count down to the 12th amino acid. What is the three base code in the normal gene for this 12th amino acid--specifies G: ___What is the three base code in the mutant gene for the 12th amino acid--specifies C:____

Hint: see my alignment below…note that the amino acid is listed above the first base of the triplet that specifies the amino acid, start with M (methionine) and count down 12 amino acids…NORMAL GENE:181 aatgactgaa tataaacttg tggtagttgg agct ??? ggc gtaggcaaga gtgccttgac

Mutant gene; 61   M  T  E  Y  K  L  V  V  V  G  A   C  G  V  G  K  S  A  L  T       181 AATGACTGAATATAAACTTGTGGTAGTTGGAGCT ???GGCGTAGGCAAGAGTGCCTTGAC

Note that the M is above ATG- this is where the gene actually starts. Look at your Genetic code table; why does it say methionine is AUG? The genetic code table gives the codon for an amino acid from mRNA- not DNA. However, gene DNA sequences are given for the coding strand of DNA (not the template strand because the coding strand and mRNA are the same except that T is replaced with a U in the mRNA).DNA:Coding strand: ---ATG--- Template strand: ---TAC---

mRNA: ---AUG---

protein: methionine

Note how A pairs with T (or U in mRNA) and G pairs with C.

6. List the steps from insulin to the final cellular events- use the Kegg pathway for insulin.

12

Page 13: Second bioinformatics lab:Exercise on disease:clasfaculty.ucdenver.edu/bstith/bioinformaticsprotocol2.doc · Web viewThen go to your Word file and make sure that you are on the stop

7. Elk-1, c-Jun, and c-Fos are transcription factors. In general, what do they do as a result of Ras activation?

8. If Ras were mutated to be always active, what part of the pathway becomes irrelevant?

Concerning its Entrez Gene entry9. Fill in the following information a. Write the GeneID number here ________________.

b. What is the gene name?c. Where in the human genome is this gene located?d. What is the RefSeq number for the mRNA sequence for isoform b?e. What is the RefSeq number for the protein sequence for isoform b?

Concerning Swiss-Prot Entry10. How many splice variants are there of K-Ras2 and what are they called?

11. Describe how K-Ras2 is activated and inactivated.

12. What proteins does K-Ras2 interact with? (Hint: GDP and GTP are not proteins)

Concerning Multiple Sequence Alignment with ClustalW:13. What is the mutation in the amino acid mutant sequence? Write it in the following format “Res123Res” where the first Res is the three-letter code for the amino acid in the un-mutated (wild type) protein and the second Res is the amino acid in the mutated protein. In place of “123” put the amino acid residue number of the mutation.

14. Is the mutation in a region of conservation (look for sequences with many * or : (blanks mean no conservation; what does the colon mean?)? YES NOIf the sequence stays the same throughout the evolutionary path, the sequence (or part of a gene/protein) is said to be conserved- this probably means that the sequence plays a crucial role in the active site (or in a regulatory region).

15. Based on the alignment, what span of amino acids is LEAST conserved? Does this correlate with the region specified in the Swiss/Prot entry as “hypervariable”? These sections or sequences can vary over evolutionary time because they may not be important to the functioning of the protein (they are probably not in the active site; although the sequence could play a role in regulation).

13