76
Web Technologies in Bioinformatics T.J. Esposito April 28, 2005 Advanced Bioinformatics Computing

Web Technologies in Bioinformatics

  • Upload
    didier

  • View
    27

  • Download
    0

Embed Size (px)

DESCRIPTION

Web Technologies in Bioinformatics. T.J. Esposito April 28, 2005 Advanced Bioinformatics Computing. Project Goal. To make the normalized Frisina data easy and convenient to work with To avoid having to work with enormous text files of seemingly meaningless numbers. Project Goals. - PowerPoint PPT Presentation

Citation preview

Page 1: Web Technologies in Bioinformatics

Web Technologies in Bioinformatics

T.J. Esposito

April 28, 2005

Advanced Bioinformatics Computing

Page 2: Web Technologies in Bioinformatics

Project Goal

• To make the normalized Frisina data easy and convenient to work with

• To avoid having to work with enormous text files of seemingly meaningless numbers

Page 3: Web Technologies in Bioinformatics

Project Goals

• This will be accomplished by:

- Putting the data into a database

- Making the database easy to interact with as well

- Making the database available to whoever needs it

- Giving the data some sort of context

Page 4: Web Technologies in Bioinformatics

Methods

• One of the most convenient ways of doing this is to:

- Use a relational database to store the data

- Give the database a web interface, which is convenient to use and readily available

- Link that data to other available data from Affymetrix and other sources

Page 5: Web Technologies in Bioinformatics

Methods

• These goals will be reached using current database and web technology.

• For the back end database, mySQL will be used.

• For the web interface, JSP (Java Server Pages) will be used.

Page 6: Web Technologies in Bioinformatics

Reasons for mySQL

• MySQL will be used due to its speed.

• Competing systems, like Postgres, were considered; however, more fully featured (yet slower) systems were not necessary.

- the data will be manipulated using only SELECTS

- MySQL, having fewer features than other systems, makes it faster and thus better suited for use in web applications

Page 7: Web Technologies in Bioinformatics

Reasons for JSP• JSP has well known advantages; it is:

- Efficient

- Convenient

- Powerful

- Inexpensive

- Portable

- Secure

- Java based

Page 8: Web Technologies in Bioinformatics

JSP

• Perl and CGI were considered, but JSP was chosen due to:

- Its being a current web technology utilized by many major corporations

- It seems more convenient and full-featured compared to a Perl/CGI approach

- JSP fits current multi-tier database architectures better than CGI, due to the Java API and JSP being development so

- I will be working with JSP on co-op, so I wanted to brush up (or rather, learn it) before then

Page 9: Web Technologies in Bioinformatics

Data Expansion

• One the data has been entered into a mySQL database, and given a moderately flexible web interface, it will also be linked to other sources

- Affymetrix data from their site

- Other sites like NCBI or GenBank?

- Linking data to new sources as needed should be fairly easy

Page 10: Web Technologies in Bioinformatics

Finally…

• In the end, an expandable system will have been created that hopefully can be used in a real world application.

• Even if it isn’t, at least I will have gotten the experience in developing such a system with a new technology (JSP), and continued in the Java nature of the course.

Page 11: Web Technologies in Bioinformatics

Questions

Any questions?

Page 12: Web Technologies in Bioinformatics

Visualization of Frisina’s Research Data Using University

of Maryland’s Treemap 4.1John Boutell and Tom Maxon

Page 13: Web Technologies in Bioinformatics

Procedure

• Transform Frisina flat files into Treemap flat files or Excel files

• Determine relationships

• Determine organization / visualization preferences

Page 14: Web Technologies in Bioinformatics

File Transformation

• Treemap file considerations – Begins with a line consisting of a list of variables to be considered. The next line follows with definitions of variables. The subsequent consists of data, with relationships of each following list of data.

Page 15: Web Technologies in Bioinformatics

Determining Relationships

• A maximum of four layers can be used, so we’ll need to determine what the four layers should be. Example: Middle-aged vs. Young vs. Old could be one layer.

Page 16: Web Technologies in Bioinformatics

Organization and Visualization Determination

• This step will consist of ordering data and arranging coloration and spacing to insure that the visualization is easily understood.

Page 17: Web Technologies in Bioinformatics

ObtainingInformation Regarding Mouse

Array GenesChris Parkin

April 28, 2005

Page 18: Web Technologies in Bioinformatics

Overview:

• Research involves expression data from Affymetrix mouse chip 430a

• Thousands of genes found on this gene chip, any of which could be of importance

Page 19: Web Technologies in Bioinformatics

Overview:• Each gene in the expression data is given an accession number

Example Expression Data:

X16_Frisina_S2_M430A.CEL X17_Frisina_S2_M430A.CEL X25_Frisina_S2_M430A.CEL X36_b_Frisina_S2_M430A.CEL

1415672_at 14.2636987581270 14.8166925938434 14.7202558244306 14.71538938350851415673_at 10.6382802704383 10.8947849214261 9.7992056002344 10.04895619607921415675_at 12.6363495581221 12.310695824458 11.7665991587842 11.71928862807501415677_at 11.9224599733792 11.6230373622742 11.0882276072649 11.15845246207511415678_at 14.3403000148085 14.3258513901380 14.2753594390197 14.37584835520461415679_at 15.0959031716503 14.8066829033559 14.6876918364335 14.59118161580651415680_at 11.4203757035264 11.4120007012393 11.2384462748424 11.36847790232441415681_at 12.3004566771331 11.7383490484824 11.4995261583693 11.3078357750632

Page 20: Web Technologies in Bioinformatics

Overview:

• Gene information based on accession # available at Affymetrix website, but is a tedious process

• Some of the information may not be that useful for this particular research

Page 21: Web Technologies in Bioinformatics

Project Goal:

• Develop a useful online tool for obtaining information about genes on the mouse chip

• Two powerful tools to be used in developing this: Perl & NCBI

Page 22: Web Technologies in Bioinformatics

Information to Include:• Nucleotide sequence & amino acid translation• NCBI Definition: What metabolic role does this sequence play a part in• Any available links to PUBMED articles• Homology groups (using NCBI’s “Homologene”• Any available information in NCBI’s “Gene” database (descriptions, lineage, ontology…)

Page 23: Web Technologies in Bioinformatics

Questions?

Page 24: Web Technologies in Bioinformatics

Gene Group Correlation

• Presented by – Andrew Darling

Page 25: Web Technologies in Bioinformatics

Outline of Presentation

• Problem Statement

• Gene Group Correlation

• Methods

• Results

• Discussion

• Conclusion

Page 26: Web Technologies in Bioinformatics

Problem Statement

• Using ~20,000 expression levels taken from ~40 mice of various ages, find the genes responsible for progressive age related hearing loss in mice.

Page 27: Web Technologies in Bioinformatics

Gene Group Correlation

• Search for genes with expression levels– Grouping similarly to the 4 mouse test groups– Corresponding to the severity of the hearing

impairment– Exclude genes used for non hearing

impairment genes

Page 28: Web Technologies in Bioinformatics

Methods

• For each “gene”– Gather expression levels for each mouse– Segregate each expression level by mouse group– Apply mean and deviation calculations for each

group– Calculate metric for quality of segregation

• Do expression levels segregate by mouse group

• Repeat for each gene• Sort for highly segregated (by group)

expression values

Page 29: Web Technologies in Bioinformatics

Methods – examples 1 & 2

• Gene 1– Young mice levels = 1, 1, 1, 1, 1, 1, 1, 1– Middle mice levels = 3, 3, 3, 3, 3, 3, 3, 3– Old mice levels = 6, 6, 6, 6, 6, 6, 6, 6– Severe mice levels = 9, 9, 9, 9, 9, 9, 9, 9– Conclusion – highly segregated by group in order of severity

• Gene 2– Young mice levels = 1, 1, 2, 2, 3, 3, 4, 4– Middle mice levels = 3, 3, 4, 4, 5, 5, 6, 6– Old mice levels = 5, 5, 6, 6, 7, 7, 8, 8– Severe mice levels = 6, 6, 7, 7, 8, 8, 9, 9– Conclusion – mostly segregated by group in order of severity

Page 30: Web Technologies in Bioinformatics

Methods – examples 3 & 4

• Gene 3– Young mice levels = 1, 2, 3, 4, 5, 6, 7, 8– Middle mice levels = 1, 2, 3, 4, 5, 6, 7, 8– Old mice levels = 1, 2, 3, 4, 5, 6, 7, 8– Severe mice levels = 1, 2, 3, 4, 5, 6, 7, 8– Conclusion – not segregated by group

• Gene 4– Young mice levels = 1, 1, 1, 1, 2, 2, 2, 2– Middle mice levels = 7, 7, 7, 7, 8, 8, 8, 8– Old mice levels = 5, 5, 5, 5, 6, 6, 6, 6– Severe mice levels = 3, 3, 3, 3, 4, 4, 4, 4– Conclusion – mostly segregated by group not in order of severity

Page 31: Web Technologies in Bioinformatics

Results

• Coding still in process

• Working out a few parameters– Whether to sort by

• Distance of group means from each other

• Size of sigma for each group

• Mutually exclusive grouping

• Ordering of group means by severity

Page 32: Web Technologies in Bioinformatics

Discussion

• Quality of prediction of related genes based on quality of correlation theory– Presumes related gene expression is progressive

and consistent– Presumes a quality of gene expression level

measurement• Further validation possible by sorting for

redundant hits – Sequences referenced by several probes on the

chip – Several similar probes each correlating highly

Page 33: Web Technologies in Bioinformatics

Conclusion

• If this works, it’s a freaking miracle

Page 34: Web Technologies in Bioinformatics

Gene Selection

What level

Of what gene

Does what?

Page 35: Web Technologies in Bioinformatics

Clustering

• Radial Basis Neural Network

• Develop clustering using 2 “old” data sets

• Test with all 4 data sets to verify that it clusters correctly

• Generates weights to form the clusters

Page 36: Web Technologies in Bioinformatics

Anfis

• Tool to extract the neural network “rules”

• Gives a formula based on all the inputs to show given any set of input what value it will generate

• It is possible to extract the exact impact of each input from this formula.

Page 37: Web Technologies in Bioinformatics

Anfis Cont’

• However

• Computationally very expensive

• Training time for this type of network increases by a factor of 3 for each added line of input.

• Time to train would be in the order of – 10 * 322680 seconds (324 secs = 10000 yrs)

Page 38: Web Technologies in Bioinformatics

Weights

• Data values influence the weights

• To eliminate those influences the values must be converted to binary values.

• A set of threshold values is needed

Page 39: Web Technologies in Bioinformatics

Input

• For each variable these threshold are used– Median Mean– 25/75 75/25– 10/90 90/10– 0/100 100/0

• Each of those data sets are combined into one large training set.

Page 40: Web Technologies in Bioinformatics

Where I’m going with this

• What the network will learn is to classify the data by each of those sets– Does this already

• except for the all or nothing case

Page 41: Web Technologies in Bioinformatics

Where I’m going with this

• Analyze the weights– By distance between weights of opposite

categories

Page 42: Web Technologies in Bioinformatics

What does alarge differentiation mean

• Should point at – The gene of importance– The level of expression where the change

occurs

Page 43: Web Technologies in Bioinformatics

Data Set

• Each of those data sets are combined into one large training set.

Page 44: Web Technologies in Bioinformatics

Identify Classifying Genes of Presbycusis

Alex Haugh

Page 45: Web Technologies in Bioinformatics

Project Outline

• Step 1 – Calculate the mean of each of the datasets (Young, Midage, Mild, Severe).

• Step 2 – Find a set of genes that have unique expressions for each type.

• Step 3 – Test the ability of these genes to classify each type from training sets.

• Step 4 – Plot the expression levels of these genes throughout the mouse life cycle.

Page 46: Web Technologies in Bioinformatics

Step 1: Getting the Mean

1. Parse the files given to us by Tex.

2. Take those values and get a ‘Pre’ average.

3. Calculate the standard deviation

4. Remove any values are not contained within 95%

5. Calculate the ‘Post’ average with removed expression levels

6. Record them in a new condensed file format:

Gene Expression

at17186 10.56574

at17187 8.96768

Page 47: Web Technologies in Bioinformatics

Step 2: Calculating Classifying Genes

1. Read in each of the newly condensed files.

2. Place all of the values into a data structure.

3. Compare all of the values of a gene against all other types and record those genes which are greater than or less than a given threshold value.

4. Narrow down genes to much smaller set

5. Record the genes in a file for use later:

--------HIGHER --------- --------LOWER--------

at17186 10.56574 at15686 5.68869

at17187 8.96768 at17122 7.76859

Page 48: Web Technologies in Bioinformatics

Step 3: Testing Classifying Genes

1. Read in the classifying genes for each type2. Read in the unknown dataset3. Subtract the unknown expression value from

classifying gene and take the absolute value.4. If the gene less than the threshold value record a

plus one for that type.5. Report the type with the most genes within the

threshold.

Note: Given 100 Classifying genes per type and a threshold value of 0.35 there is a very high rate of accuracy.

Page 49: Web Technologies in Bioinformatics

Step 4: Tracking Levels

1. After testing the classifying genes from each type empirically, record these (hopefully about 20)

2. Record the average value for the gene from all types.

3. Graph the values

4. Observe and record the trends in each gene.

5. Report any genes that don’t follow the given trends.

Page 50: Web Technologies in Bioinformatics

Expectations• I expect to find about 20 genes per type that

classify ‘unknown’ datasets very well.

• I expect those genes to generally follow similar trends.

• I expect to be able to a have a program that can read in datasets and produce reliable results that can assist research by quickly identifying those genes which are outliers and unique.

Page 51: Web Technologies in Bioinformatics

ArrayView

Coherent visualization of clustered microarray data.

Madhu and Julia

Page 52: Web Technologies in Bioinformatics

Eisen Lab Software

• Cluster– Treeview– MapleTree

• FuzzyK– FuzzyExplorer– MapleTree

Page 53: Web Technologies in Bioinformatics

ArrayView Input

• Output from Cluster, FuzzyK– Convert to ArrayView datafile (XML)

• Attribute MySQL database– Gene title

– Gene symbol

– Public DB identifiers

– Protein families, domains

– Gene Ontology

– Metabolic pathways

Page 54: Web Technologies in Bioinformatics

ArrayView Output

• Hierarchical– Tree filter

• Possible layouts:– BalloonTree, RadialTree, SquarifiedTreeMapLayout,

TopDownTreeLayout, VerticalTreeLayout

• k-means– Graph filter

• Possible layouts:– ForceDirected, Random

Page 55: Web Technologies in Bioinformatics

Controls

• Change focus

• Rotate display

• Tool tips

• Zoom

• Filter

• Color code

Page 56: Web Technologies in Bioinformatics

Experimental data

• Cluster Frisina’s data– Cluster– FuzzyK

• View clustered Frisina data in ArrayView

Page 57: Web Technologies in Bioinformatics

Questions

Page 58: Web Technologies in Bioinformatics

Advanced Bioinformatics Computing Project

Kyle Shenk &

Laura Grell

Page 59: Web Technologies in Bioinformatics

Overview

• TIGR MultiExperiment Viewer (MeV) is a powerful analysis tool for microarray data.– Clustering– Classification – Visualization– Statistical Analysis

• We hope to use some of these tools to perform some analysis on the Frisina data

Page 60: Web Technologies in Bioinformatics

The Input File

• MeV requires a Affymetrix.txt file for input– Columns represent each individual sample –

so in this case each mouse/experiment– Rows represent the individual genes– Data points are the normalized expression

values– GeneName        Sample1     Sample2     Sample3        Sample4

> > MouseType   young       young       old_severe     old_mild> > 1415670_at  10.47015    13.195      9.620273       11.5090

Page 61: Web Technologies in Bioinformatics

Problem

• Dr. Frisina has provided us with four files –each representative of a different age group of mice

• The Affymetrix.txt file contains expression data from all samples

• We have to convert these four files into one large file the MeV can read and recognize

Page 62: Web Technologies in Bioinformatics

Solution

• Perl is an ideal language for editing/parsing text and generating files

• The program we developed reads in all four files and creates one large Affymetrix.txt file

• Basically the program consists of reading each file line by line and concatenating the line from one file onto the next

Page 63: Web Technologies in Bioinformatics

Kyle’s Solution

• page +page +page +page = BIG PAGE!!

+ ++ =

Page 64: Web Technologies in Bioinformatics

TIGR MeV

• The next step is to utilize the TIGR MeV tools and analyze the results.– Expression Viewer– Expression Graphs

• http://www.tm4.org/mev.html

Page 65: Web Technologies in Bioinformatics

PRINCIPAL COMPONENT ANALYSIS OF THE FRISINA MICROARRAY DATA

Presented by Lee Edsall

April 28, 2005

Page 66: Web Technologies in Bioinformatics

OUTLINE

• What is Principal Component Analysis?• Method• Goals

Page 67: Web Technologies in Bioinformatics

WHAT IS PRINCIPAL COMPONENT ANALYSIS?

• Also referred to as “PCA”

• Analysis of the variation in the data to find a new set of variables to describe the data

• Goal is to decrease the number of variables required

Page 68: Web Technologies in Bioinformatics

METHOD

• Library research and literature review to understand method and determine appropriate parameters

• Use Minitab to determine the new variables for:• Young data

• Middle age data

• Old with mild hearing loss data

• Old with severe hearing loss data

• Compare the four sets of variables to see if any of them are specific to a set of data

Page 69: Web Technologies in Bioinformatics

GOALS

• Determine if any genes uniquely identify a set of data

• Provide a much smaller number of genes to be used in future analysis

Page 70: Web Technologies in Bioinformatics

Comparying and analyzing the tools for the microarray data

Shruti Sharma/Jennifer D’Souza

Page 71: Web Technologies in Bioinformatics

GEPAS - "Gene Expression Pattern Analysis Suite"

• Normalization

• Preprocessing

• Viewers

• Clustering

• Differential Expression

• Supervised Classification

• Data Mining & Analysis

Page 72: Web Technologies in Bioinformatics

MIAME – “Minimum Information About a Microarray Experiment”

• Interpretes the results

• Reproduce the experiment.

Page 73: Web Technologies in Bioinformatics

EPCLUST- Expression Profile data CLUSTering and analysis

Tool for

• Clustering

• Visualization

• Analysis

for gene expression data as well as sequence data.

Page 74: Web Technologies in Bioinformatics

Cluster

• Performs cluster analysis

– Hierarchical clustering

– Self-organizing maps (SOMs)

– k-means clustering

– Principal component analysis

• Processes large microarray datasets

Page 75: Web Technologies in Bioinformatics

Links to the tools

• http://ep.ebi.ac.uk/EP/EPCLUST/

• http://www.mged.org/Workgroups/MIAME/miame.html

• http://gepas.bioinfo.cnio.es/tools.html

Page 76: Web Technologies in Bioinformatics