View
877
Download
4
Category
Tags:
Preview:
DESCRIPTION
Literature Based Framework for Semantic Descriptions of e-Science resources
Citation preview
Hammad AfzalPhD, Computer Science
The University of Manchester, UK
Seminar at: National University of Sciences and Technology,
Islamabad
Dated: April, 2010
A Literature based framework for semantic descriptions of e-Science
resources
Hammad.Afzal@deri.org
Who am I
A former PhD Student at University of Manchester (Finished in Dec, 2009).
A former Research Fellow at Digital Enterprise Research Institute (DERI), National University of Ireland (Finished in Dec, 2009)
At University of Manchester: Text Mining Group. Worked to automate the process of Semantic Service Descriptions of
Bioinformatics resources using Natural Language Processing (NLP) techniques
on large amount of relevant literature available online At DERI:
Unit of Natural Language Processing. Worked on development of methods for the semi/automatic generation of
multilingual lexicons for domain ontologies, exploiting Web-based and
language resources.
e-Science Perspective
Development in Web has changed the way of research.
The resources are now mostly outside a researcher’s office, Scientific data, knowledge and computational resources are
typically distributed over the Internet. This paradigm is largely known as e-Science.
E-Science is an infrastructure for systematic development of research methods that involve distributed resources (Web services, data and knowledge resources, and computational resources) and their application to research
e-Science Resources
The resources involved in e-Science are known as e-Science resources, which can be
Scientific literature databases (e.g. PubMed, PubChem etc).
Tool repositories (e.g. bioinformatics tools and services provided by the European Bioinformatics Institute (EBI) etc).
Social network like portals where scientists can exchange knowledge and comments etc (e.g. myExperiment, F1000 Biology).
Semantic Web
Provides machine understandability by adding machine processable semantics to conventional Web infrastructure
revolutionised the paradigms of resource sharing and service provision by adding meaning to resources (services, data) through associated semantics (formal descriptions of their meaning).
Semantic Web
Ontologies Integral part of Semantic Web. The specification of conceptualisations,
used to help programs and humans share knowledge (Gruber, 1995).
Capture and computationally present knowledge shared by people in a certain community (Hadzic and Chang, 2004).
Represents a set of concepts (typically with precise definitions), which are mutually linked through a number of relationships.
Examples:
Bioinformatics e-Resources
Bioinformatics – a pioneer adopter of e-Science use of computational and mathematical techniques to
store, manage, and analyse the data from molecular biology in order to answer questions about biological phenomena (Lord et al., 2004).
emerged from molecular biology laboratories, enormous amount of data is produced, various tools (Web services) that operate on that data.
Bioinformaticians typically decompose high-level tasks into simpler modules and choose the most appropriate class of service to accomplish each sub-task using different data resources, many of which are distributed (Wroe et al., 2004).
Bioinformatics e-Resources
Semantic Descriptions of Bioinformatics e-Resources
A number of bioinformatics tools and resources available for service use and composition guessimate is 3000+ Web Services publically
available how to find a service, what is out there to use? provenance?
Efficient use of resources require making them discoverable by potential users. Their functional capabilities need to be described, so
that they are not only accessible by humans but also by machines (resource crawlers, software agents etc).
BioCatalogue
Beta version at http://beta.biocatalogue.org/Launch June 2009 at ISMB
Semantic annotation of bioinformatics services annotate functional capabilities e.g. Taverna, myGrid, myExperiment, EBI
Not only services and tools databases, repositories, corpora
Manual curation e.g. myGrid, BioCatalogue etc. e.g. Taverna/Feta: only ~15-20% functionally
described backlog – and the number of services is growing
Semantic Descriptions in Bioinformatics Domain
Our approach – Mine the literature
Literature: Still the largest and most popular source of knowledge.
Hypothesis: The semantic profiles of entities and events can be extracted from the domain literature.
text
ExampleSemantically Annotated Web Service
Annotations combine textual descriptions ontological mappings
Detailed approach
The rest of the talk
Methodology A literature based methodology to develop and
maintain existing domain knowledge representations, in particular Controlled Vocabularies, Ontologies.
An integrated literature based methodology for extraction of resource description profiles
Building semantic networks of resources from their descriptions.
What next?
1st Module Building Controlled Vocabulary
from Literature
Terminology Building
First step towards knowledge acquisition from unstructured text.
Structurally organised terms help in Information Retrieval (IR) Information Extraction (IE) etc Document Summarization etc
Used in annotation tasks, predefined and authorised terms known as controlled
vocabularies (CVs) provide domain-specific tags to enrich data or textual resources
Terms provide basis for Ontologies, Controlled Vocabularies, Taxonomies used in Semantic Web
Terms are automatically identified in literature using Automatic Term Recognition (ATR) techniques
Controlled Vocabulary Building – a challenging task
In dynamic domains, new terms representing new domain concepts are continuously introduced.
Generic ATR techniques fail to differentiate between terms related to a specific task and generic domain terms in heterogeneous text (in particular scientific articles in cross-disciplinary domains)
Term Classification Assigning terms to domain-specific classes.
Narrowing down the specific meaning of a concept described by a given term. For example, in biomedicine, terms can be assigned to classes such as genes,
proteins, mRNAs, diseases, etc.
Can help in building controlled vocabularies by classifying Instances of specific and focused sub-classes of interest.
Controlled Vocabulary Building – Solution
Building controlled vocabulary from literature
Term Classification driven approach
1) get a corpus
2) get all terms
3) get seed examples
4) find relevant ones using term profiling and comparison to seed examples
Learn bioinformatics terms from literature
Bioinformatics terminology
Use seed terms to bootstrap e.g. known descriptors used in existing
service descriptions, either in literature or service repositories 250 terms identified, manual pruning after
automatic term recognition examples of lexical constituents and
textual behaviour (pragmatics) lexical profiling contextual profiling
Bioinformatics terminology
Lexical profiling what is in the name
Contextual profiling characterise sentences in which terms
appear (nouns, verbs and context-patterns) Comparing candidate term profiles to
average seed term best-match
Lexical Profile
Term (t) Lexical Profile LP(t)
protein (1) Protein
Protein sequence
(1) protein
(2) sequence
(3) protein sequence
protein sequence
alignment
(1) protein
(2) sequence
(3) alignment
(4) protein sequence
(5) sequence alignment
(6) protein sequence
alignment
Contextual Profile
Verb
ProfileProduce
Noun
Profilegenscan, program, list, transcript
Left
Pattern
(LP)
Class-Level (LP1) <Term>, produce, <NP>, of
Right
Pattern
(RP)
Class-Level (RP1) of, <NP>
SentenceGenscan program node can produce a list of nucleotide
FASTAs of predicted transcripts
Profile Comparisons
)()()()( evance(t)Overal_Rel tCPRtCVRtCNRtLR
n
i i
i
S
S
n 1avg )(CNP CNP(t)
|)(CNP CNP(t)|2
1 (t)CNR
)(CNP CNP(t)
|)(CNP CNP(t)|2 (t)CNR maxmax
i
i
STS S
S
i
Bioinformatics terminology
Comparison between Profile based term classification and generic Term Recognition (c-Value method)
Statistics about textual corpus
Full Text
Articles
# of documents 2,691
# of distinct candidate
terms
113,280
# of candidate term
occurrences
533,418
# of distinct sentences 294,614
# of distinct context
noun stems
~79,000
# of distinct context
verb stems
~2,500
The Bioinformatics Controlled Vocabulary
Number of
Terms
ATR (C-Value) – total number of candidate
terms113,280
Number of terms with lexical similarity to
resource terms95,437
Number of terms with context noun
similarity to resource terms103,104
Number of terms with context verb
similarity to resource terms73,478
Number of terms with context pattern
similarity to resource terms21,182
Number of terms with combined
contextual similarity (Nouns ∪ Verbs ∪
Patterns)
98,307
2nd Module Mining Semantic Descriptions
from Literature
Mining service descriptions
Informatics concepts general concepts of data, data
structures, databases, metadata
Bioinformatics concepts domain-specific data sources
and algorithms for searching and analysing data
e.g. Smith-Waterman algorithm
Semantic classes – myGrid Ontology
Molecular biology concepts higher level concepts used to describe
bioinformatics data types, used as inputs and outputs in services e.g. protein sequence, nucleic acid
sequence
Task concepts generic tasks a service operation can
perform e.g. retrieving, displaying, aligning
Semantic classes – myGrid Ontology
Semantic classes identification
Engineered from MyGrid bioinformatics sub-ontology
Semantic
class
Typical terminological heads
Applicationapplication, tool, service, software, system,
program
Algorithmalgorithm, method, approach, procedure,
analysis, alignment
Data data, record, report, sequence, structure
Data
Resource
resource, database, dataset, repository
Resource mentions
Named-entity recognition (NER) task Recognition of service mentions using
terminological (semantic) heads of automatically recognised terms Apollo2Go Web Service is an Application BIND database is a Data source assign the corresponding semantic class
Hearst patterns (co-ordinations, appositions, enumerations, etc.)
Semantic classes and instances
Semantic classes and instances
Extraction/functional rules
Manually designed predicate-driven rules: Subject (Arg) – Verb (Predicate) – Object (Arg)
Applied on dependencyparsed sentences Stanford parser no phrase structures complex sentences information in sub-clause
“Matrix Global Alignment Tool MatGAT generates similarity/identity matrices for DNA or protein
sequences” “Term_App generates similarity/identity matrices for
DNA or protein sequences”
Extraction/functional rules
Phrase structuresidentified and integratewith the dependency
Predicate-dependent rules applied to extractspecific ‘content’ andprofile the services
Profiles collated for all mentions service name variation
“Matrix Global Alignment Tool MatGAT generates similarity/identity matrices for DNA or protein sequences”
“Term_App generates similarity/identity matrices for DNA or protein sequences”
Extraction/functional rules
Extraction/functional rules
Predicate-driven rules: each verb associated with the type of “information content” it provides
Function Associated verbs
Generic functionality/
Task specification
applied, access, achieve, align, allow,
based, developed, implemented,
present, provide, used, is a, called
Inputs, outputs
accept, applied, create, provide,
query, retrieve, starts with, take,
used, generate
Comparison outperform, perform, compare
Implementation
technique,
Programming language
implement(ed)
Composition, subtaskscontain(ed), construct(ed),
generate(d)
Availability available
Information Extraction
SC instance (resource) Matrix Global Alignment Tool MatGAT
SC Application
Task Generate
Predicted input DNA or protein sequences
Predicted output similarity/identity matrices
Descriptorssimilarity/identity matrices, DNA or protein
sequences
Input Sentence: “Matrix Global Alignment Tool MatGAT generates similarity/identity matrices for DNA or protein sequences”
Experiments
2657 BMC Bioinformatics articles full-text articles before March 2008
108 predicates used
Semantic Class Total # of instances
Algorithm 5,722
Application 2,076
Data 2,662
Data Resource 1,992
Total 12,452
Example – GeneClass
1) Resource descriptors
Descriptors
Frequency of co-
occurrence
motif data 4
differential gene expression 3
reliable predictive model 2
genome-wide protein-DNA binding
data
2
transcriptional gene regulation 2
gene expression data 1
2) MyGrid terms
BIND
3) Related resources
Robust GeneClass Algorithm
Example – GeneClass
Functional
Content
Predic
ate
(Task)
Subject Functional Description
Input/
Outputpredict
GeneClas
s
Algorithm
predicting differential
gene expression starts
with a candidate set of
motifs x003bc
Example – GeneClass
Sentences1. We also show how to incorporate genome-wide protein-DNA
binding data from ChIP chip experiments into the GeneClass algorithm, and we use an improved noise model for gene expression data [PMC 1810316].
2. The GeneClass algorithm for predicting differential gene expression starts with a candidate set of motifs; representing known or putative regulatory element sequence patterns and a candidate set of regulators or parentSS [PMC 1810316].
3. Target set: We extend the original GeneClass algorithm to use all target genes for which both motif and expression data is available [PMC 1810316].
Evaluated for their capability to be used for semantic description of a given bioinformatics resource
(0) irrelevant
(1) partially useful
(2) useful
HeatMapperThe HeatMapper tool has already proven to be very useful in several studies
KalignTo compare Kalign to other MSA programs, the following test sets were used. Cognitor
To add a new species to the COG system, the annotated protein sequences from the respective genome were compared to the proteins in the COG database by using the BLAST program and assigned to pre-existing COGs by using the COGNITOR program
Evaluation of semantic profiles
Two experiments: 15 well-known resources with descriptions already
available 15 new resources
Evaluation of semantic profiles
Quality comparison of various components of resource description profiles from the two
experiments
3rd Module Mining Semantic Networks from
Literature
What next?
Good recall, poor precision context needs a better model
Mining parameter values sub-language of parameters
Candidate service/resource mentions an entity whose profile looks like a service comparison of semantic profiles network of services [ISMB 2009]
Do we have good service ontologies?
What Next ? (Proposed in BioHackathon2010)
Phylogenetic TreePhylogenetic Tree
Generated byGenerated by
ClustalW ProgramClustalW Program
MultialignmentMultialignment
Is used forIs used for
Phylogenetic trees are then generated by the ClustalW program by the neighbour-joining method [PMC1973088].
We also used the CLUSTALW program for multialignment as a control process [PMC434493].
Resource1 Resource2 Resource3
Phylogenetic TreePhylogenetic Tree ClustalW ProgramClustalW Program MultialignmentMultialignment
RDF Store
#Data#Data
#Task#Task
Conclusion
Literature mining approach to service description and annotation
Aims reduce curation efforts provide semantic synopses of services for the Semantic
Web Potential of text mining
integration with other annotation approaches extracting the entire service context is still challenging
Related Selected Publications
Hammad Afzal, James Eales, Robert Stevens, Goran Nenadic (2010): Mining Semantic Networks of Bioinformatics Web Resources from the Literature, Journal of Biomedical Semantics.
Hammad Afzal, Robert Stevens, Goran Nenadic (2009): Mining Semantic Descriptions of Bioinformatics Web Resources from the Literature,
6th European Semantic Web Conference (ESWC) on the Semantic Web: Research and Applications. Heraklion, Crete, Greece, Springer-Verlag
Hammad Afzal, Robert Stevens, Goran Nenadic (2008): Towards Semantic Annotation of Bioinformatics Services: Building a Controlled Vocabulary,
Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008).
Thanks
Recommended