Upload
dell-enterprise
View
988
Download
2
Tags:
Embed Size (px)
Citation preview
BIOINFORMATICS COMPUTATION
AND VISUALIZATION AT THE
UNIVERSITY LOUISVILLE
Eric C. Rouchka and Adel S. ElmaghrabyComputer Engineering and Computer
Science Department
November 16, 2010
Abstract
Current high throughput molecular biology techniques are providing researchers with data growing at a rate equivalent and/or faster than Moore’s law. While the ability to store, manipulate and analyze this “Big Data’ requires intelligent utilization of HPC hardware and software resources. At the University of Louisville, we are specifically interested in understanding gene expressions for a variety of disease and disorder states through analysis of microarray data, next generation sequencing of transcriptomes and visualization of high resolution in-situ hybridization images of the central nervous system. We use a variety of approaches including GPU computing and a Dell Visualization Cluster to help achieve faster results.
HPC,GPU, Clusters
GENE EXPRESSION VISUALIZATION
Analyzing in-situ hybridization images of the central nervous system.
Statement of Problem
Multitude of high resolution biological image techniques available, including:
◦ Magnetic resonance
◦ Ultrasound
◦ Computed tomography
◦ X-ray
◦ Histological
University of Louisville Database
View genes involved in CNS◦ neurotransmitters
◦ neuroreceptors
Ages: ◦ E13.5 (Embryonic day 13.5)◦ P0 (Postnatal day 0 – newborn)
◦ P7 (Postnatal day 7)
Typical Size (TIFF Format)◦ 6000 x 6000 pixels
ranges from 3000 x 3000 to 30,000 x 30,000 30 MB to 800 MB per image
CNS Image Types
Whole Brain
Eye (Retina)
Spinal Cord
UofL In-Situ Hybridization Database
GOALS
◦ Tie in-situ database into gene expression (microarray; rtPCR) experiments
◦ Link to other existing information
◦ Localize and quantify signal
Purpose
Search Images of Interest◦ by gene name
◦ by developmental stage◦ by tissue type
Share Images◦ publicly
◦ private groups
Annotate Images Store Images
Partitioning for Web Viewing
Extending the image display on multiple tiles (15,000 x 4,800 available display pixels)
High Resolution in-situ hybridization of mouse retina
Utilizing the Dell Video Wall
Research Areas
Control of gene expression
Sources of Variability
DNA and ProteinSequence Analysis Other
FUNCTIONAL GENOMICS
•TSS classification•Transcription factor detection•Translational control
•Primer design•Gene structure prediction
•SNP analysis•Repeat analysis•Alternative splicing
•2nd level microarray analysis•Gene interactions•In-situ hybridization•Machine learning•DNA computing
0
5
10
15
20
25
30
35
40
0
5
10
15
20
25
30
35
Nu
mb
er
of
Bases i
n G
en
Ban
k (
log
2)
Pre
dic
ted
Nu
mb
er
of
Tra
nsis
tors
(lo
g2)
Year
Log Growth of GenBank
Moore’s Law vs. GenBank
286 Processor134,000 Transistors
Hard Drive Storage vs. NGS
Stein L. (2010) Genome Biology 2010. 11:207.
The LINE-1 Retrotransposon
Adapted from: Babushok DV, Kazazian HH, Jr. Progress in understanding the biology of the human mutagen LINE-1. Hum Mutat. 2007 Jun;28(6):527-39.
1K 2K 3K 4K 5K 6K
5’ UTR
Antisense Promoter
ORF 1 ORF 2 3’ UTR
•Long Interspersed Nuclear Element-1
•A repeat sequence found pervasively throughout the genome.
•Each copy may or may not be capable of transcription or retrotransposition.
Poly-A
What is the LINE-1 Life Cycle
Ribosomes
ORF 2: endonuclease,
reverse transcriptase
ORF1: Zipper domain
• Comprises ~21% of the genome
• As many as 100 copies are estimated to be active and capable of retrotransposition.
• Most copies of LINE-1 however, are truncated at the 3’ end, or otherwise not
intact, and are inactive.
Distance from 3’ end
Co
un
ts
0
100
200
300
400
500
600
700
0 1000 2000 3000 4000 5000 6000 7000
Number of Observations
Number of Observations
How does LINE-1 affect cellular function
Down regulation
Splice isoforms
Ectopic expression
1K 2K 3K 4K 5K 6K
5’ UTR
Antisense Promoter
ORF 1 ORF 2 3’ UTR
Poly-A
CTTGGCTCCTCCCC
GGGGAGGAGCCAAG
LINE1 5’ Signature
Reverse ComplementString search for exact match in fastq records
SRR000921.50547 …TGAGTAAATAATGGA*GGGGAGGAGCCAAGAT…
SRR003709.200687 …CTTGGCTCCTCCCCC*AAAAGGAATCATTTTAAA…
Identify and collect those sequences that contain the 5’ LINE1 signature element
and at least 25 nucleotides of additional sequence that flanks the LINE1 element
Align against reference human sequence with
BLAT
Identify and collect those sequences that whose flanking sequence maps uniquely to
the genome but alignment does not extend to cover LINE1 element. Isolate
flanking sequence and create blastable database with comprised of the flanks.
Flanking sequence LINE1 element
Flanking sequenceLINE1 element
>SRR000921.50547
…TGAGTAAATAATGGA
>SRR003709.200687
AAAAGGAATCATTTTAAA…
Convert all fastq to fasta and BLAST against
flanking sequence database
GPU AND CUDA
Typical Computational Operations
RNA Folding using Nussinov Algorithm based on Dynamic Programming which is of O(n3).
Clustering of gene data and textual information.
A binary matrix representation of a secondary structure of an RNA sequence.
Hamada M et al. Bioinformatics 2009;25:465-473
© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email:
VaIG Lab
Use Dell Alienware with Nvidia GPUs for◦ Hierarchical clustering of DNA microarray data, 48 times
speedup over single core CPU, using Tesla C-870◦ Nussinov RNA folding, 290 times speedup, using Tesla C-
2050◦ Processing of PubMed abstracts, ongoing◦ SAT (propositional logic) as applied to haplotype
inference, ongoing◦ Semi-supervised support vector machine (S3VM), ongoing
Sample Speed up using GPU
Compute pairwise Manhattan distance and Pearson correlation coefficient of data points with GPUDar-Jen Chang, Ahmed H. Desoky, Ming Ouyang, Eric C. Rouchka,2009 10th ACIS International Conference on Software Engineering, Artificial Intelligences, Networking and Parallel/Distributed Computing
UNIVERSITY OF LOUISVILLE AND DELL
A positive experience
J.B. Speed School Industry Affiliates