Web Technologies in Bioinformatics

T.J. Esposito

April 28, 2005

Advanced Bioinformatics Computing

Project Goal

• To make the normalized Frisina data easy and convenient to work with

• To avoid having to work with enormous text files of seemingly meaningless numbers

Project Goals

• This will be accomplished by:

- Putting the data into a database

- Making the database easy to interact with as well

- Making the database available to whoever needs it

- Giving the data some sort of context

Methods

• One of the most convenient ways of doing this is to:

- Use a relational database to store the data

- Give the database a web interface, which is convenient to use and readily available

- Link that data to other available data from Affymetrix and other sources

Methods

• These goals will be reached using current database and web technology.

• For the back end database, mySQL will be used.

• For the web interface, JSP (Java Server Pages) will be used.

Reasons for mySQL

• MySQL will be used due to its speed.

• Competing systems, like Postgres, were considered; however, more fully featured (yet slower) systems were not necessary.

- the data will be manipulated using only SELECTS

- MySQL, having fewer features than other systems, makes it faster and thus better suited for use in web applications

Reasons for JSP• JSP has well known advantages; it is:

- Efficient

- Convenient

- Powerful

- Inexpensive

- Portable

- Secure

- Java based

• Perl and CGI were considered, but JSP was chosen due to:

- Its being a current web technology utilized by many major corporations

- It seems more convenient and full-featured compared to a Perl/CGI approach

- JSP fits current multi-tier database architectures better than CGI, due to the Java API and JSP being development so

- I will be working with JSP on co-op, so I wanted to brush up (or rather, learn it) before then

Data Expansion

• One the data has been entered into a mySQL database, and given a moderately flexible web interface, it will also be linked to other sources

- Affymetrix data from their site

- Other sites like NCBI or GenBank?

- Linking data to new sources as needed should be fairly easy

Finally…

• In the end, an expandable system will have been created that hopefully can be used in a real world application.

• Even if it isn’t, at least I will have gotten the experience in developing such a system with a new technology (JSP), and continued in the Java nature of the course.

Questions

Any questions?

Visualization of Frisina’s Research Data Using University

of Maryland’s Treemap 4.1John Boutell and Tom Maxon

Procedure

• Transform Frisina flat files into Treemap flat files or Excel files

• Determine relationships

• Determine organization / visualization preferences

File Transformation

• Treemap file considerations – Begins with a line consisting of a list of variables to be considered. The next line follows with definitions of variables. The subsequent consists of data, with relationships of each following list of data.

Determining Relationships

• A maximum of four layers can be used, so we’ll need to determine what the four layers should be. Example: Middle-aged vs. Young vs. Old could be one layer.

Organization and Visualization Determination

• This step will consist of ordering data and arranging coloration and spacing to insure that the visualization is easily understood.

ObtainingInformation Regarding Mouse

Array GenesChris Parkin

April 28, 2005

Overview:

• Research involves expression data from Affymetrix mouse chip 430a

• Thousands of genes found on this gene chip, any of which could be of importance

Overview:• Each gene in the expression data is given an accession number

Example Expression Data:

X16_Frisina_S2_M430A.CEL X17_Frisina_S2_M430A.CEL X25_Frisina_S2_M430A.CEL X36_b_Frisina_S2_M430A.CEL

1415672_at 14.2636987581270 14.8166925938434 14.7202558244306 14.71538938350851415673_at 10.6382802704383 10.8947849214261 9.7992056002344 10.04895619607921415675_at 12.6363495581221 12.310695824458 11.7665991587842 11.71928862807501415677_at 11.9224599733792 11.6230373622742 11.0882276072649 11.15845246207511415678_at 14.3403000148085 14.3258513901380 14.2753594390197 14.37584835520461415679_at 15.0959031716503 14.8066829033559 14.6876918364335 14.59118161580651415680_at 11.4203757035264 11.4120007012393 11.2384462748424 11.36847790232441415681_at 12.3004566771331 11.7383490484824 11.4995261583693 11.3078357750632

Overview:

• Gene information based on accession # available at Affymetrix website, but is a tedious process

• Some of the information may not be that useful for this particular research

Project Goal:

• Develop a useful online tool for obtaining information about genes on the mouse chip

• Two powerful tools to be used in developing this: Perl & NCBI

Information to Include:• Nucleotide sequence & amino acid translation• NCBI Definition: What metabolic role does this sequence play a part in• Any available links to PUBMED articles• Homology groups (using NCBI’s “Homologene”• Any available information in NCBI’s “Gene” database (descriptions, lineage, ontology…)

Questions?

Gene Group Correlation

• Presented by – Andrew Darling

Outline of Presentation

• Problem Statement

• Gene Group Correlation

• Methods

• Results

• Discussion

• Conclusion

Problem Statement

• Using ~20,000 expression levels taken from ~40 mice of various ages, find the genes responsible for progressive age related hearing loss in mice.

Gene Group Correlation

• Search for genes with expression levels– Grouping similarly to the 4 mouse test groups– Corresponding to the severity of the hearing

impairment– Exclude genes used for non hearing

impairment genes

Methods

• For each “gene”– Gather expression levels for each mouse– Segregate each expression level by mouse group– Apply mean and deviation calculations for each

group– Calculate metric for quality of segregation

• Do expression levels segregate by mouse group

• Repeat for each gene• Sort for highly segregated (by group)

expression values

Methods – examples 1 & 2

• Gene 1– Young mice levels = 1, 1, 1, 1, 1, 1, 1, 1– Middle mice levels = 3, 3, 3, 3, 3, 3, 3, 3– Old mice levels = 6, 6, 6, 6, 6, 6, 6, 6– Severe mice levels = 9, 9, 9, 9, 9, 9, 9, 9– Conclusion – highly segregated by group in order of severity

• Gene 2– Young mice levels = 1, 1, 2, 2, 3, 3, 4, 4– Middle mice levels = 3, 3, 4, 4, 5, 5, 6, 6– Old mice levels = 5, 5, 6, 6, 7, 7, 8, 8– Severe mice levels = 6, 6, 7, 7, 8, 8, 9, 9– Conclusion – mostly segregated by group in order of severity

Methods – examples 3 & 4

• Gene 3– Young mice levels = 1, 2, 3, 4, 5, 6, 7, 8– Middle mice levels = 1, 2, 3, 4, 5, 6, 7, 8– Old mice levels = 1, 2, 3, 4, 5, 6, 7, 8– Severe mice levels = 1, 2, 3, 4, 5, 6, 7, 8– Conclusion – not segregated by group

• Gene 4– Young mice levels = 1, 1, 1, 1, 2, 2, 2, 2– Middle mice levels = 7, 7, 7, 7, 8, 8, 8, 8– Old mice levels = 5, 5, 5, 5, 6, 6, 6, 6– Severe mice levels = 3, 3, 3, 3, 4, 4, 4, 4– Conclusion – mostly segregated by group not in order of severity

Results

• Coding still in process

• Working out a few parameters– Whether to sort by

• Distance of group means from each other

• Size of sigma for each group

• Mutually exclusive grouping

• Ordering of group means by severity

Discussion

• Quality of prediction of related genes based on quality of correlation theory– Presumes related gene expression is progressive

and consistent– Presumes a quality of gene expression level

measurement• Further validation possible by sorting for

redundant hits – Sequences referenced by several probes on the

chip – Several similar probes each correlating highly

Conclusion

• If this works, it’s a freaking miracle

Gene Selection

What level

Of what gene

Does what?

Clustering

• Radial Basis Neural Network

• Develop clustering using 2 “old” data sets

• Test with all 4 data sets to verify that it clusters correctly

• Generates weights to form the clusters

• Tool to extract the neural network “rules”

• Gives a formula based on all the inputs to show given any set of input what value it will generate

• It is possible to extract the exact impact of each input from this formula.

Anfis Cont’

• However

• Computationally very expensive

• Training time for this type of network increases by a factor of 3 for each added line of input.

• Time to train would be in the order of – 10 * 322680 seconds (324 secs = 10000 yrs)

Weights

• Data values influence the weights

• To eliminate those influences the values must be converted to binary values.

• A set of threshold values is needed

• For each variable these threshold are used– Median Mean– 25/75 75/25– 10/90 90/10– 0/100 100/0

• Each of those data sets are combined into one large training set.

Where I’m going with this

• What the network will learn is to classify the data by each of those sets– Does this already

• except for the all or nothing case

Where I’m going with this

• Analyze the weights– By distance between weights of opposite

Web Technologies in Bioinformatics

Documents

An integrative approach to drug repositioning: a use case for semantic web technologies Paul Rigor Institute for Genomics and Bioinformatics Donald Bren

Grendel: A bioinformatics Web Service-based architecture for

SPACE FOR BIOINFORMATICS. · bioinformatics, but have recently revolutionized the entire IT sector and are being viewed as key future technologies by Google, Facebook, Microsoft and

BIOINFORMATICS AND TECHNOLOGIES CONFERENCE · 2020-02-28 · Bioinformatics and Technologies (NGBT) Conference. It will be held from 27th – 30th Sep, 2020 in Kolkata, West Bengal,

Web Technologies in Bioinformatics T.J. Esposito April 28, 2005 Advanced Bioinformatics Computing

Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics

Introduction to Bioinformatics · Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information

Web-based Bioinformatics Applications in Proteomics

New bioinformatics applications based on Web Service Technologies

Bioinformatics Resources and Tools on the Web: A Primer

Bioinformatics is the application of computer science and information technologies to the processing and analysis of biological data. Bioinformatics is

Bioinformatics & Machine LearningBioinformatics Bioinformatics Machine Learning Microarrays Applications z“Application of the Information Technologies to the field of molecular biology”

BIOINFORMATICS WORKFLOW DESIGN USING SEMANTICALLY ...mango.ctegd.uga.edu/jkissingLab/SWS/Papers/dhamana... · BIOINFORMATICS WORKFLOW DESIGN USING SEMANTICALLY ANNOTATED WEB SERVICES

How to use the web for bioinformatics Molecular Technologies Ethan Strauss ethan.strauss@promega.com 274-4330 X 1171 ethan

Cloud Bioinformatics in a private cloud deployment · Cloud Bioinformatics in a private cloud deployment Victor Chang1, 2 1 School of Computing and Creative Technologies, Leeds Metropolitan

The secondary metabolite bioinformatics portal ...orbit.dtu.dk/.../The_secondary_metabolite_bioinformatics_portal.pdf · The secondary metabolite bioinformatics portal ... A web portal

Semantic web technologies applied to bioinformatics and laboratory data management

Bioinformatics(MVE360) - Chalmers · describe how bioinformatics methods can be used to relate sequence, structure and function discuss the technologies for modern high-throughput

WIWS: a protein structure bioinformatics Web service collection