Upload
maori-ito
View
102
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
Cross Search Service for Life Science and Semantic web
National Institute of Biomedical Innovation Maori Ito
1
Presentation Materials http://l.bitcasa.com/ayav_jSQ
Sagace
Search for Biomedical Data & Resources in Japan
Features
• Focus on biomedical database • Semi-automated Ranking • Refining search results with facets • More informative search results with
metadata
4
h"p://integbio.jp/en/
Mechanisms of Search Engine
1. Crawling 2. Indexing 3. Query Processing 4. Scoring
Crawling
6
Databases
Crawling Program
Indexing
• Split data convenient size and store own server
Internal Server
Indexing Data
Query Processing and Scoring
NIBIO
MEDALS
JCGGDB
NBDC / DBCLS
AgriTogo
Collaborate by using P2P
architecture
Search System
9
Log Analysis and Reflect Search Results
• The members of top 8 databases are almost the same. – Patents – KEGG MEDICUS
– Medicine and pharmaceutical proceedings – Drug emergency call
– Ingredients information of health food
– Merck Manual – Medical Information Network Distribution Service
– The Encyclopedia of Psychoactive Drugs
10
Comparison of Databases
• Popular databases are Medical or Pharmaceutical “literal rich” databases.
• Top databases run away with the winnings!
• More than half of databases have never clicked!
11
Unpopular databases
• Sagace has started the service in March 2012.
• Some databases have never clicked since then.
• Eliminate these databases. • Databases
– 272 DB -> 122 DB
12
Results
• Accuracy for users must have improved. • Reducing databases also caused speed
up.
13
Specific databases in life science
• Some databases in life science is lacked “literal information” .
• Cross search engine is suitable to show literal information.
• Semantic web will help these databases.
14
Semantic Web?
15
What is semantic web?
Semantic web is constructed by Web of Meaningful and Machine
Understandable Data
16
Web of Document
17 h"p://pdbj.org/mine/summary/2yi1
Search Engine Results
18
Query “2yi1 pdbj” search on google
Search engine can reflect only text data.
Web of Document to Web of Data
19
Data
Data
Data
Data Data
Data Data
Data Data Data Data Data
Data
Data
h"p://pdbj.org/mine/summary/2yi1
20
How should the computer recognize
these data?
21
A.(Focus on search service) Mark-up with Metadata by Database Developer
What is metadata?
• Data about Data
Entry ID: 2YI1 Species:HOMO SAPIENS Reference: PubMed ID 22343627 See Also:2YHY,2YHW Experimental method: X-RAY DIFFRACTION Image: http://pdbj.org/pdb_images/2yi1.jpg
22
See Also
Keywords
Reference
Species
Experimental method
Entry ID
Image
Reflect Search Results
• Metadata encourage encounter Users and Database
23
Image
How to markup? (microdata)
• Add metadata with html tag
24
http://schema.org/BiologicalDatabaseEntry/entryID
http://pdbj.org/mine/summary/2yi1 2YI1
<div itemscope=“” itemtype=“h"p://schema.org/BiologicalDatabaseEntry”> <span itemprop=“entryID”>2YI1</span>
</div>
Declare Vocabulary
Content (Object)
Property (Predicate)
How to reflect? • Crawler program can find metadata easily!
• Add indexed data
• Reflect search results
25
@BiologicalDatabaseEntry_entryID=2YI1
<div itemscope=“” itemtype=“h"p://schema.org/BiologicalDatabaseEntry”> <span itemprop=“entryID”>2YI1</span>
</div>
Machine Understandable Data
• Declaration of vocabulary is important.
26
E.g. entryID book?
products?
biological?
recipe?
Machine Understandable Data
• Declaration of vocabulary is important.
27
E.g. entryID=2YI1
Biological DatabaseEntry!!
<div itemscope=“” itemtype=“h"p://schema.org/BiologicalDatabaseEntry”> <span itemprop=“entryID”>2YI1</span>
</div>
What is schema.org?
• "Schema.org is a set of extensible schemas that enables webmasters to embed structured data on their web pages for use by search engines and other applications.” – (http://schema.org/)
28
It’s not only in Sagace.
29
• "Search engines including Bing, Google, Yahoo! and Yandex rely on this markup to improve the display of search results, making it easier for people to find the right web pages.” (h"p://schema.org/)
• Google support these content types: – Reviews – People
– Products – Businesses and organizations – Recipes
– Events – Music
30
Current Situation • Define original properties for Biological Database and
Biological Database Entry for schema.org – entryID, isEntryOf, taxon, seeAlso, reference – Schema.org proposal – http://www.w3.org/wiki/WebSchemas/BioDatabases
• Sagace can reflect them to search results.
• Search Collaboration organization will also reflect them to search results. – NBDC – MEDALS (molprof)
• How to mark up and search results examples in Sagace • http://sagace.nibio.go.jp/press/metadata/markup/
31
Sagace reflects these properties
• image • isEntryOf (Database name) • entryID
• taxon(Species) • disease • seeAlso (Reference database entry)
• dateModified (last modified) • reference (Reference article)
32
To reflect biological data into major search engine, it requires adding schema.org.
33
schema.org Proposal
schema.org
Reflect Search Results
Biological Database and Biological Database Entry
• To achieve adding our proposal into schema.org,“Need more people who think it is a good idea.” (by organizers @ schema.org)
• We need more databases!
34
9 DBs have applied microdata!
• DoBISCUIT (Database Of BIoSynthesis clusters CUrated and InTegrated)
• JCRB Cell Bank
• Functional Glycomics with KO mice database • Glyco-Disease Genes Database • Carbohydrate Interaction Database (Carint) • JCGGDB Report • MEDALS
• Integbio Database Catalog • Life Science Database Archive
35
Search Results Example 1
36
Search Results Example 2
37
Issues (Cons) for Microdata
• Microdata strongly recommend using schema.org vocabulary.
• Microdata is W3C working group not recommendation
• If we integrate RDF data, we have to consider again which vocabularies are suitable.
RDFa Lite
• RDFa Lite is a minimal subset of RDFa, the Resource Description Framework in attributes (http://www.w3.org/TR/rdfa-lite/)
– Affected by Microdata – W3C recommendation 07 June 2012
• Ability to specify more than one vocabulary (not only schema.org)
• Easy to mark up
39
How to markup? (RDFa Lite)
• Add metadata with html tag
40
http://schema.org/BiologicalDatabaseEntry/entryID
http://pdbj.org/mine/summary/2yi1 2YI1
<div vocab=“h"p://schema.org” typeof=“BiologicalDatabaseEntry”> <span property=“entryID”>2YI1</span>
</div>
Declare Vocabulary
Property (Predicate)
Content (Object)
If you use PDBo as extension vocabulary
41
<div prefix="PDBo : http://rdf.wwpdb.org/schema/pdbx-v40.owl#"> <span property="PDBo:exptl.method">X-RAY DIFFRACTION</span> </div>
Image
Declare Vocabulary
Content (Object)
Property (Predicate)
If metadata add into database...,
• Search engine can pick up many important data.
• Database developers can appeal their service more effectively.
• Users can find easily which they are looking for.
42
Current Situation
• KNApSAcK has applied RDFa Lite. • We’d like to reflect more information by
using RDFa Lite. • If you add metadata into your databases,
please contact NBDC or me ([email protected])
• Please collaborate with us ! • Please tell me what kind of information is
suitable to show and refine.
43
Acknowledgement • National Institute of
Biomedical Innovation
– Mizuguchi Kenji – Morita Mizuki – Igarashi Yoshinobu – Sakate Ryuichi – Nagao Chioko – Chen Yi-an – Akiko Fukagawa – Tohru Masui – Johan Nystrom-Persson
44
• This project is supported by a collaboration "Database integration in NIBIO and cooperation with outside organizations" with the NBDC.
• National Bioscience Database Center (NBDC)
• National Institute of Agrobiological Sciences database (NIAS)
• Molecular Profiling Research Center for Drug Discovery (molprof)
• Japan Consortium for Glycobiology and Glycotechnology DataBase (JCGGDB)
45
46
Web of Data (Concept)
47
http://schema.org/BiologicalDatabaseEntry/entryID
http://pdbj.org/mine/summary/xxxx xxxx
PDBj
PubMed:xxxxxxx
http://schema.org/BiologicalDatabaseEntry/reference
http://schema.org/BiologicalDatabaseEntry/isEntryOf
http://schema.org/BiologicalDatabaseEntry/reference
http://schema.org/BiologicalDatabaseEntry/isEntryOf
Database A http://databaseA.org/publication