The Progress on Sagace and
Data Integration
Maori Ito1
Main two topics
• Sagace –Cross Search
• RDF –Data Integration
2
Sagace
• Search for Biomedical Data & Resources in Japan
Features
• Focus on biomedical database• Manual Semi-automated Ranking • Refining search results with
facets• More informative search results
with metadata
Mechanism of Search Engine
1. Crawling2. Indexing3. Query Processing4. Scoring
6
Crawling
Databases
Crawling Program
Indexing
• Split data convenient size and store own server
Internal Server
Indexing Data
Query Processing and Scoring
NIBIO
MEDALS
JCGGDB
NBDC / DBCLS
AgriTogo
Collaborate by using P2P architecture
Search System
9
What is the most Important
thing in cross search ?
! Speed and Accuracy !
Features
• Focus on biomedical database
• Semi-automated Ranking
Log Analysis and reflect search results
• The members of top 8 databases are almost the same.– Patents– KEGG MEDICUS– Medicine and pharmaceutical
proceedings– Drug emergency call– Ingredients information of health food– Merck Manual– Medical Information Network Distribution
Service– The Encyclopedia of Psychoactive Drugs
12
13
Comparison of databases
• Popular databases are Medical or Pharmaceutical “literal rich” databases.
• Top databases run away with the winnings!
• More than half databases have never clicked!
Log data has been reflected in ranking.
• Original score -> A:12,000,B:8,000• Gather clicked data• Eliminate duplicating database in the same
day and pick up lowest denotative rank.– If the database score is lower than 12,400, add
200.– The other databases are added 100 basically. But
if the database denotative rank is lower than 10, add 200.
• Patents score is fixed 8,100.• Maximum score is 30,000.
15
Unpopular databases
• Sagace has started the service in March 2012.
• Some databases have never clicked since then.
• Eliminate these databases.• Databases
– 272 DB -> 122 DB
Results
• Accuracy for users must have improved.
• Reducing databases also caused speed up.
16
Specific databases in life science
• Some databases in life science is lacked “literal information” .
• Cross search engine is suitable to show literal information.
• Metadata will help these database.
17
18
Metadata
• If the developers mark up data with metadata…
Metadata
• Literal information can add into search results!
Results Image
How to mark up and reflect the results?
【 HTML 】
<div itemscope itemtype="http://schema.org/BiologicalDatabaseEntry"> <span itemprop="dateModified">2012-10-24</span></div>
【 Result 】
Declare scope itemtype with normal html tag
Select property Content
Win Win Win!
• Database developers can appeal rich database information.
• Users can find valuable information easily.
• Crawler program can find these metadata properly.
21
What is schema.org?
• "Schema.org is a set of extensible schemas that enables webmasters to embed structured data on their web pages for use by search engines and other applications.”
• "Search engines including Bing, Google, Yahoo! and Yandex rely on this markup to improve the display of search results, making it easier for people to find the right web pages.”
(http://schema.org/)
Microdata
“You use the schema.org vocabulary, along with the microdata format, to add information to your HTML content.”
(http://schema.org/docs/gs.html)• Finalizing the proposal of schema.org
extension is a requirement to show “rich” results for major search engines.
Current Situation
• Define original "property" (entryID, isEntryOf, taxon, seeAlso, reference).
• Please refer to– http://sagace.nibio.go.jp/press/metadata/markup/
6 DBs, 1 catalog and 1 DB archive applied microdata!
• DoBISCUIT(Database Of BIoSynthesis clusters CUrated and InTegrated)
• JCRB Cell Bank • Functional Glycomics with KO mice database • Glyco-Disease Genes Database• JCGGDB Report• MEDALS• Integbio Database Catalog• Life Science Database Archive
To add biological database vocabularies into schema.org,
• “Need more people who think it is a good idea.” (by organizers @ schema.org)– [email protected] (<- ML Let’s join !)
• We need more databases and web pages that are marked up with microdata.
• I want your opinion on microdata.• Let's talk!
http://www.mkbergman.com/968/a-new-best-friend-gephi-for-large-scale-networks/
Data Integration with RDF
http://www.cytoscape.org/what_is_cytoscape.html
What is RDF?
• Resource Description Framework
28
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix drugbank: <http://bio2rdf.org/drugbank:> .drugbank:DB00316 rdfs:label "Acetaminophen”.
RDF
29
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix drugbank: <http://bio2rdf.org/drugbank:> .@prefix drugbank_vocab: <http://bio2rdf.org/drugbank_vocabulary:> .@prefix drugbank_target: <http://bio2rdf.org/drugbank_target:> .
drugbank:DB00316 rdfs:label "Acetaminophen" ; drugbank_vocab:target drugbank_target:290 .drugbank_target:290 rdfs:label "Prostaglandin G/H synthase 2".
RDF
subject predicate object
Drugbank:DB00316
Drugbank_target:290
Acetaminophen
Prostaglandin G/H synthase2
rdfs:label
rdfs:label
drugbank_vocab:target
Subject
ObjectPredicate
Predicate
Predicate ObjectObject / Subject
SPARQL(SPARQL Protocol and RDF Query Language)
• “SPARQL (pronounced "sparkle", a recursive acronym for SPARQL Protocol and RDF Query Language) is an RDF query language, that is, a query language for databases, able to retrieve and manipulate data stored in Resource Description Framework format.”
(http://en.wikipedia.org/wiki/SPARQL) 30
How to use?
"Prostaglandin G/H synthase 2”
31
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix drugbank: <http://bio2rdf.org/drugbank:> .@prefix drugbank_vocab: <http://bio2rdf.org/drugbank_vocabulary:> .@prefix drugbank_target: <http://bio2rdf.org/drugbank_target:> .drugbank:DB00316 rdfs:label "Acetaminophen" ; drugbank_vocab:target drugbank_target:290 .drugbank_target:290 rdfs:label "Prostaglandin G/H synthase 2".
RDF
PREFIX drugbank:<http://bio2rdf.org/drugbank_vocabulary:>select distinct ?v where {#distinct means exclude duplicate?s rdfs:label "Acetaminophen” ; drugbank:target ?t .?t rdfs:label ?v.}
SPARQL
Results!
What is the target of “Acetaminophen”
32
SPARQL Endpoint
What is the target of “Acetaminophen”
e.g:http://drugbank.bio2rdf.org/sparql
33
Results
• You can get results from the endpoint.
34
RDFization in life science
• Many data has been rdfized already.
• Affymetrix,Drugbank, GO, OMIM, KEGG, PDB, UniProt, PubMed...
35
Let’s try!
• Bio2RDF– http://bio2rdf.org/
• EBI RDF Platform – http://www.ebi.ac.uk/rdf/
• SPARQL endpoint– e.g:http://drugbank.bio2rdf.org/
sparql• How to learn?
– Learning SPARQL
Pros of RDF
• Excellent with life science data
• Comparison to RDB– Easily be expanded– RDB RDF
• Excellent with No SQL too– key value
36
37
Cons of RDF
• A bit hard to make RDF• A bit hard to create developing
environments
• Speed of SPARQL
38
Currant situation in NIBIO
• Toxygates– Johan-san and Igarashi-san have
been developing .
• Orphan Drug Data
Toxygates
• RDFization Open TG-Gates data.– microarray data, pathological data
(kidney, liver, grade ,... )• Linked to other database by
using RDF– KEGG pathway– GO terms– CHEMBL– DrugBank
39
40
http://toxygates.nibio.go.jp/
Orphan Drug
• RDFize orphan drug information in NIBIO.
41
<http://www.nibio.go.jp/orphanDrugTarget#80> drgn:designationFiscalYear "1996"; drgn:designationDate "1996/4/1"; drgn:number "(8yaku A) No. 81";drgb:name "Imiglucerase"; dc:description "Improvement of symptoms of anaemia, thrombocytopenia, hepatosplenomegaly, bone symptoms, etc. in patients with Gaucher's disease"; drgn:designationApplicant "Genzyme Japan K.K."; drgb:pharmacology "Improvement of symptoms of anaemia, thrombocytopenia, hepatosplenomegaly, bone symptoms, etc. in patients with Gaucher's disease"; drgb:manufacturer "Genzyme Japan K.K."; eob:approvalDate "1998/3/6"; drgb:product "Cerezyme injection 200U";drgb:brand "CEREZYME_ injection"; drgn:approvedName "Imiglucerase (Genetical Recombination)";drgn:status "Approved".
42
Let’s try and give me your idea!
• RDF data will enlarge many kinds of data in Life science.
• NBDC encouraged this movement.
Future Perspective
• RDFize other databases in NIBIO– E.g. bioresource
• Examine the benefit• Spread RDF to many scientists • Make useful environments for
who are not familiar with computers
43