Upload
randolf-atkinson
View
218
Download
1
Tags:
Embed Size (px)
Citation preview
ILRI/BECA ILRI/BECA Bioinformatics PlatformBioinformatics Platform
IntroductionIntroduction
Etienne de Villiers
ILRI - Kenya
OutlineOutline
• ILRI/BECA Bioinformatics Platform• Hardware• Specialized software:
– Database searching– Assembly software
• CGIAR Bioinformatics Grid
International Livestock Research InstituteInternational Livestock Research Institute
A lab in Africa at the foot of Kenya’s Ngong Hills
ILRI Research ObjectivesILRI Research Objectives
•Overall mandate is livestock research for poverty alleviation in Africa and South East Asia.
•Undertakes a balance of fundamental and applied research with long, medium and short term objectives. •Livestock health, genetics, and management.
ILRI FacilitiesILRI Facilities
• State of the art laboratories (2500 m2)• Large and small animal facilities
– Level-2/3 biosafety facility for cattle and sheep
• Bioinformatics unit– 64 CPU Paracel 64-bit HPC cluster
• Sequencing unit– ABI 3730 and ABI 3100
• Microarray facility• Proteomics facility• Oligonucleotide synthesis unit • FACS analysis facility• Tick unit
BECA - Biosciences East and Central BECA - Biosciences East and Central AfricaAfrica
• Under NEPAD several centers of excellence are being established in Africa.
• One center is being established at ILRI –Biosciences East and Central Africa (BECA).
• Center will provide state-of–the-art facilities for scientist in the region.
• Facilities include:• Genetics and Genomics lab with high throughput sequencers
• Microarray laboratory
• Proteomics laboratory
• Immunology and molecular biology laboratories
• Bioinformatics Platform
ILRI/BECA – Bioinformatics PlatformILRI/BECA – Bioinformatics Platform
• Provide all East and Central African scientist access to bioinformatics applications, large-volume data storage, local mirror of all relevant databases, basic training and helpdesk support.
• EMBNet node for East and central Africa
IBBPIBBP service servicess
• Access to bioinformatics tools through either:– web-based bioinformatics tools through the BBP website– secure shell (ssh) access for registered users
• Facilities for storage of large datasets• Systems administration and backup of datasets• Training and support in the use of BBP resources• Graduate and Post-graduate Fellowships in
Bioinformatics
IBBIBBPP FacilitiesFacilities
• Training room– 18 computers with MS windows
and Linux– High speed internet connection
• Servers– 66 CPU Beowulf Linux cluster– High availability Web server
IBBP WebsiteIBBP Website
www.becabioinfo.org
Selection of available tools on IBBPSelection of available tools on IBBP
• Paracel Blast• GeneMatcher2• PTA• Oligocheck• EMBOSS 200+ bioinformatics tools• ClustalW multiple alignment software• T-coffee multiple alignment software• FastA sequence alignment tool• HMMER multiple alignment and
sequence searching software
• Staden sequence assembly and analysis package
• Primer3 primer design package • Paup tree-inference package • Phylip tree-inference package• Phred/Phrap DNA editing and
assembly tools• R statistical package • Rosetta – Ab initio protein prediction• SRS – sequence retrieval tool• Etc……
IBBP Hardware SystemsIBBP Hardware Systems
Paracel Blast MachineParallel NCBI-Blast (20 CPU )•Blast•PSI-Blast•Mega-Blast
GeneMatcher26144 CPU supercomputer•HMM•Smith-Waterman•GeneWise•Profile
HPC Linux cluster66 CPUs (AMD 64-bit)72 Gigabyte RAM3 Terrabyte disk storage
Linux clusterLinux cluster
• Rocks 4.1 (RedHat) operating system• Platform LSF batch queuing
• shares resources equally between users
• MPI libraries • Parallel computations
Application Software (e.g. BLAST, EMBOSS, Rosetta)
Middleware (Platform LSF)
Operating System (Red Hat - ROCKS)
Node Node Node Node Node
Network (GiGE)
Application Integration
Batch Queue Setup
Cluster Build and Configuration
Turnkey HPC Integration
Database searchingDatabase searching
• Heuristic Algorithms (FASTA and BLAST)– Gapped BLAST– Traditional ungapped BLAST
Are fast but give approximate alignments
• Dynamic Programming Algorithms– Global – Needleman-Wunsch– Local – Smith-Waterman
Give optimal alignment but are very slow
Paracel Blast ServerParacel Blast Server
• Paracel BLAST is the most advanced BLAST software written specifically for large-scale cluster systems
• 20 CPU parallel NCBI-Blast• 20x faster than NCBI-Blast server
Paracel Blast – 1h 9m 56s
NCBI – 6 days 2h 20m 34s
Blastn – Paracel Blast vs. NCBI Blast
Query – Chromosome 81 sequence150,000,000 bases
Database – Human Ref. Seq10,300 sequences24,300,000 bases
Paracel Blast ServerParacel Blast Server
BioView Viewer
BioView ViewerBioView Viewer
Gene Structure DeterminationGene Structure Determination
• To compare a cDNA or EST database to a genomic database, one must allow introns
• Two approaches:– Double-affine Smith-Waterman (separate gap penalty for
introns)– Genewise – protein or HMM versus genomic DNA (models
the important features of protein families better)
How to get more distant homologsHow to get more distant homologs
• Use dynamic programming algorithms• Use position-specific or HMM profiles• Do iterated searches• Use translated searches
Must be careful in interpretation (statistics)
GeneMatcher2GeneMatcher2
• Do things you either can’t or wouldn’t attempt at NCBI (100x faster)
• Is a computer specialized for executing calculation intensive methods in bioinformatics:– Especially fast in performing the very sensitive Smith-
Waterman pairwise alignment method• compensate for frame shifts
– GeneWise • intron- and frameshift-tolerant search method
– Needleman-Wunch alignments– HMM searches
• 6,144 parallel processor computer
Why GeneMatcher2?Why GeneMatcher2?
•Comparison of sensitivity and selectivity of various sequence search methods
•Blue denotes a software method•Yellow denotes a hardware accelerated method
Less Falsepositives
More true positives
GeneMatcher2 - PerformanceGeneMatcher2 - Performance
•Time-to-completion comparison of original methods and methods on GeneMatcher2
•TBLASTX improvement is 20-fold•Other methods at least 100-fold
Source: Genome Canada Bioinformatics Platform Project
NCBI TBLASTX
Parac
el T
BLASTX
Decyp
her T
BLASTX
WUSTL H
MM
clu
ster
Decyp
her H
MM
FASTA Sm
ith-W
ater
man
GeneM
atch
er2
SW
EBI Gen
eWis
e
Parac
el G
eneW
Ise
376
140.1
161316
270
1000
Runtime for an average query
Method
0
200
400
600
800
1000
Se
co
nd
s
* * *
BioView WorkbenchBioView Workbench
BioView Viewer
BioView ViewerBioView Viewer
Assembly SoftwareAssembly Software
• Paracel Transcript Assembler (PTA)– High capacity solution for EST based transcript
reconstruction– Can assemble large numbers of ESTs, allowing for splice
variants– Complete pipeline for: sequence cleaning,clustering and
assembly– Detection, alignment and visualization of alternative splice forms– Visualization through intuitive graphical interfaces
Scientific problems for PTAScientific problems for PTA
• Proteomics• Gene discovery• Verify gene predictions for genome assembly• Detecting splice variants• Patterns of expression, tissue specificity• SNP detection • Combinations of all the above...
PTA – Contig viewPTA – Contig view
PTA – Splice variant alignmentPTA – Splice variant alignment
Paracel OligocheckParacel Oligocheck
• Oligocheck use sensitive Smith-Waterman alignment routine of GeneMatcher2
• Search oligo’s fast against whole genome• Software used by companies designing and
synthesizing oligonucleotides e.g. MWG
Ensemble mirrorEnsemble mirror
• Ensembl is a joint project between EMBL - EBI and the Sanger Institute.
• A software system which produces and maintains automatic annotation on selected eukaryotic genomes.
• Our site provides free access to a selected areas of the data and software from the Ensembl project.
CGIAR – HPC GRID computingCGIAR – HPC GRID computing
ILRIKenya
IRRIPhilippines
ICRISATIndia
CIPPeru
49 nodes89 CPUs
33 nodesGenematcher2 4 nodes
8 nodes 4 nodes
BECA/Partners
Thank youThank you