56
Challenges in developing and implementing standards- based approaches in bioinformatics South African National Bioinformatics Institute Electric Genetics University of the Western Cape

Challenges in developing and implementing standards-based approaches in bioinformatics

Embed Size (px)

DESCRIPTION

Challenges in developing and implementing standards-based approaches in bioinformatics. South African National Bioinformatics Institute Electric Genetics University of the Western Cape. Impact of Open Standard is a lot like Open Source Software. Free software Open source software - PowerPoint PPT Presentation

Citation preview

Challenges in developing and implementing standards-based approaches in bioinformatics

South African National Bioinformatics Institute

Electric Genetics

University of the Western Cape

Impact of Open Standard is a lot like Open Source Software

Free software

Open source software

Myriad of licenses

Low or no cost access

Tools

• Existing and growing numbers of initiatives• Applications: EMBOSS, BLAST• Environments, vocabularies, databases

ACCESS and support of OS tools

• Understanding– How do I install and use this

system/application?– What if I have never used a non-windows

environment?– Who else is using this that I can share my

questions with?– Is anyone out there going to help me?– Is there a credible user base?

Impact

• Commercial– Are there any legal/regulations hurdles to

employing open source tools?– Is it a time sink?– Any impact on early adopters within the

company?– How is this supported in terms of impact on

the enterprise?

Impact

• Academic– Funding threat

• Calls for public funds to be spent upon development that should be available back to the public

– requirement to distribute freely– If a developer wishes to do OS projects,

sometimes a requirement to commercilise as part of funding

Population Genetics of Open Source

– Longevity of use as a function of penetrance

– most new mutations, even if they are not selected against, never succeed in entering the population.

– Where N is the total finite population size– 1/2N is the probability that the mutation will

become fixed.

Software zygosity

• Two possible forms of software or data– Open– Controlled access

• Heterozygous – Both are used

• Homozygous– Only one is used

No selection

Artificial Selection

Software selection - packaging

• Opportunity to use• Support and documentation• Distribution and marketing• Training• User base• Knowledge of users• Repeat uses = impact• Funded/stable development• Commercial or open source support or both

Artificial Selection

Mea

n nu

mbe

r of

use

rs

Number of choice instances

No packaging

Packaging

Effects of software selection

• Same selection can have different outcomes– Roberston and Reeve

• Change in wing size in drosophila• Number of cells• Size of cells

• Selection for a web browser– Mosaic*– Netscape $ > *– Mozilla*– Internet explorer *– Opera $ > *

1997

2000

2001

2002

Africa Regional Training Course: Participants

Manifesto for bioinformatics

• open source

• open standards

• open annotation

• open data

• open health care

North – South Divide• Generation of genome data has been performed mostly in the

developed west

• Major laboratories and researchers are not in developing countries

• Researchers at ‘site of infection’ have to compete with developed country researchers for access to genome projects

• Developing countries lack resources for large scale projects

• Developing countries provide the genetic material

Lessons so far

• Sharing knowledge is key to developing knowledge• Sharing is difficult if there is an impediment to access• Open philosophies provide access to those with

limited resources but a need for knowledge• Standards improve sharing• Those who would benefit from access to knowledge

should contribute to standards and sharing

Why did SANBI get involved in controlled vocabularies?

Legacy expertisein gene expression

data

RP1expression product is unique to retina

ESTs have nasty annotations

Genome imminent

Leverage

Looking at ESTs across libraries

• Library descriptions are diverse and in many cases non-informative

• NCI_CGAP_Lip2• UT0117 (75% of all EST libraries)

• Soares foetal %^&*()• What were the actual expression states that

these libraries captured?

eVOC: Controlled Vocabulary for Unifying Gene Expression Data

• Consistent description of different libraries• Mapped orthogonal vocabularies• Anatomy, Cell type, Pathology, Development• 7016 EST libraries classified + 104 SAGE• 700 controlled terms• Applying terms of SAGE and EST allows

cross comparisons for the first time, Microarray to follow…

Uses of eVOC

• Provide as an integrated public resource which allows:– Linking libraries, transcripts and genes with

expression terms– Analysis of expression level and tissue expression

profiles– Comparison of expression between species– Linkage of genome sequence with expression

phenotype information

Data Structure

• 4 orthogonal mutually exclusive knowledge domains• independent pure hierarchies

– One parent but multiple children• Advantages of pure hierarchies over more complex data structures

– Easily maintained– Easily expanded– Easily visualised– Human and computer readable– Powerful simple querying

• Each node has specific concept– One or more synonymous terms

• Nasal• Nose

No More Tangles?

• Where multiple parents/relationship types exist and could be represented in a DAG, these can often be “untangled” into more than one hierarchy

• Untangling a tangled ontology. A complex mixed ontology can be simplified by creating simpler ontologies representing distinct domains.

Entities Roles Value Types

Person Body Substance

Steroid Organic Ion

Testosterone Glutamate

Clinical Role Physiological Role

PatientDoctor Neurotransmitter

Hormone

Sex Age

Male Female Adult Child

Untangled Ontology

Man

Body SubstancePerson

System

Woman Doctor Patient Steroid Hormone Neurotransmitter

Female doctor

Male doctor

Organic Ion

Testosterone Glutamate

Tangled Ontology

Relationships

• Single type of relationship between nodes• Anatomical System

– part-of• Cell Type + Pathology

– subclass• Developmental Stage

– is-a

Anatomical System Ontology

• Untangling of Computational Biology and Informatics Laboratory’s (CBIL) terms (ICDM9)

• removal of all references to tissue type, cell type or developmental stage

• digestive system > pancreatic islets – Anatomical Site (spatial position)

• 372 terms

Cell Type ontology

• fine-grained description of where a gene is expressed.

• listing of human cell types extracted from Gray’s Anatomy (Gray, H. L., Bannister, L. H, Williams, P. L, Collins, P., and Berry, M. M 1995).

• 154 different cell types.

Developmental Stage ontology

• Ordered timeline of human development for the description of gene expression in temporal space

• Examples “embryo” and “adult”. • Embryogenesis is further divided into the

standard Carnegie stages (www.ana.ed.ac.uk/anatomy/database/humat/) – first two months of human development.

• further divided into weekly and yearly categories

• 133 terms

Pathology Ontology

• WHO ICD-9-CM basis• classification of morbidity and mortality

information – Stats and indexing of hospital records by

disease and surgery performed• first two levels • sample description • 141 terms

liver

neoplasia

Anatomical System

Pathology

Query “liver AND neoplasia”

Result: Intersection of libraries mapped to liver and to neoplasia

Total cDNA library collection

  Total cDNA Libraries

Annotated Libraries

Not Annotated

Anatomical System 7016 6752 5.2%

Cell type 7016 410 94.2%

Developmental Stage 7016 5891 17.3%

Pathology 7016 6401 10.1%

Most libraries can be annotated with Anatomical System terms as these are generally present in the library record. Less information is available for Cell Type and Developmental Stages as these are not consistently captured during the capture of library information.

Ontologies Clone Libraries ESTs

U30152

U30154

U30159

U30162

U30163

U30164

U58979

Human TNF-treated BG9 fibroblasts (ID:1260)

Homo sapiens foreskin fibroblast (ID:1620)

Anatomical System

foreskin

Pathology

Not classified

Developmental Stage

Not classified

Cell Type

fibroblast

The four expression ontologies are used to annotate cDNA clone libraries. ESTs can be transitively associated with ontology terms via their association with a unique clone library.

Browsing, Querying and CurationAn interface for browsing, curating and querying the ontologies is under development by Electric Genetics (see poster by Visagie et al. this meeting).

   

Curation

• Central, versioned database of the eVOC ontologies

• Curators who are domain experts add and delete terms or synonyms and make changes to the hierarchies on an ongoing basis

• Groups that modify the ontologies are encouraged to contribute these modifications back to eVOC

Applications What happens when you link libraries (cDNA/SAGE) or microarray probes to terms in each

ontology?

– Expression profile selection of libraries – Terms > Libraries > Transcripts > Genes– Genes > Terms– Breadth of expression

• Assess differential expression levels (SAGE)• Assess differential tissue expression (cDNA & SAGE)

– Physical distribution of expression across the genome • Expression profile prioritisation of disease candidates• Link genome to standardised controlled terms

– Assess expression clustering– Cross species expression comparison– Comparison of local data with whole picture– Choice of libraries by, for instance, molecular pathology :

Neoplasia– Transitive Integration with GO

IntegrationCurrentICL Candidate Gene ProfilerA disease gene candidate identification system which integrates genomic data with the

GO and eVOC ontologies to identify and rank genes which are candidates for known diseases.

Swiss Institute of Bioinformatics Transcriptome Database

FutureEnsembl Datamart: select expression profile in a defined genome region

GOBO apply for incorporation

MGED apply for inclusion as an MGED-approved expression ontology

Human Transcriptome Database

• H-Invitational Odaiba, Japan• Human FLcDNA annotation jamboree• Non-redundant set of mapped, manually

curated, expression profiled, classified cDNAs• eVOC terms used to describe mRNA

expression

High resolution of eVOC

• genome-wide detection of alternatively spliced transcripts and identified those which show tissue-specificity (Xu, Q., Modrek, B.,

and Lee, C. 2002) • flat list of 46 human tissue classes • isoform-specific EST lists provided for a

subset of the genes

Gene Name

Isoform 1 Isoform 2

  Xu et al. eVOC Xu et al. eVOC

IRP3 Brain-specific

5 nervous >brain1 respiratory >lung

No specificity

2 urogenital >genital >female >uterus1 urogenital >genital >female >placenta1 haematological >blood

4 infant 3 adult

WNK1 Kidney-specific

7 urinary >kidney No specificity

2 urogenital >genital male >penis1 alimentary >pancreas

eVOC extends the expression information that can be obtained from other sources. IRP3, described by Xu et al. as having a brain-specific isoform, was shown to be infant brain specific by combining information gathered from the eVOC ontologies. The ESTs for each isoform were submitted to eVOC and the associated terms in each of the four ontologies were examined to identify expression state specificity.

GANESH deepAnnotation engine

ENSEMBL annotation engine

Controlled

Expression Vocabulary

Candidate Gene Profiler

Candidate geneEnrichment

Annotation using sequencesGenerated in the lab, and usingLocal domain expertise

Annotation servedusing DAS

Exon Skipping in Cancer:

• Determine chromosomal location of 1011 gene set on human genome sequence

• Assess the frequency and tissue distribution of exon skipping

• Determine functional significance of exon skipping

• Can the presence of transcripts demonstrating exon skipping be used as diagnostic/prognostic markers?

• Can the biological effect of the skip on the resulting protein be explained?

Genes with exon skipped transcripts found uniquely in cancer tissues

GENE NORMAL FUNCTION EFFECT OF SKIP ON PROTEIN

CD53 antigen Panleukocyte marker; may function in the transduction of CD2-generated signals in T cells and natural killer (NK) cells

Reading frame remains intactPrenyl group removed – effect unknown

Human trans-Golgi p230 (GOLG4A)

Peripheral membrane protein Reading frame remains intactNo known functional motif affected

PTPN13 (protein tyrosine phosphatase)

Signaling molecule that regulates a variety of cellular processes including cell growth, differentiation, mitotic cycle, and oncogenic transformation.

Reading frame remains intactPDZ domain removed – possible effect on intracellular signalling cascade

Case Study: TRANS-GOLGI P230

• Trans-Golgi p230 gene on chr3• Membrane protein implicated in vesicular

transport from cytoplasmic face of the golgi• 17 ESTs confirm skip of exon 2• Other distinct exon skipping events previously

described• What were the expression terms associated

with the exon 2 skip?

Tissue Expression Profile

Constitutive Exon Skipped ExonNormalNeoplasticAlimentaryDermalReproductiveLymphoreticularMultisystemNervousAdultFetusGlioblastEpithelialMelanocyte

Pathological

Anatomical

Developmental

Cell Type

Encouraging distribution

• Broad acceptance• Supported offering

– Open Source to maximise• Community involvement• Acceptance• Speed of development and resources• Quality of science

– Commericial model to provide• Support• Documentation• Customisation• Distribution• Accountability and interface to other commercial entities

Availability

– Can be used and modified without restriction under BSD – style license

– Mailing list for comments, questions and suggestions: [email protected]

– Commercial support with Electric Genetics commercial grade software, customisation and enhanced mappings to commercial clones, libraries, proprietary data

Anatomy and Gene Expression Workshop

• Resource repository site @ NCI and mirrors• Resource name• Access type (FTP)• Language/access software• Purpose• Domain• Level of commitment• Corresponding author (and when will respond by?)• Status (dev/production)• Used by? - applications

SANBIJanet Kelso Alan Christoffels Soraya Bardien

Electric GeneticsJohann VisageDarren OtgaarGary Greyling Tania Hide

Imperial College LondonDamian Smedley Mark McCarthy

Swiss Institute BioinformaticsGregory Theiler Victor JongeneelENSEMBLArek Icrapzych and Datamart

team

SANBI

www.sanbi.ac.za

electric genetics