Upload
bruce-whitehead
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Information and Communications Univ.
Bioinformatics & Software Systems Lab.Woo-Hyuk Jang
Integration of Biological XML data
Ph. D. Lecture
ICE0534 - Web-Based Software Development, Summer 2005 2
Where are we?
Server-Side Info. Management
Client-Side Info. Management Business related Issues
Web Services
Internationalization and Privacy
XML & XML Processing
HTML, JavaScript, Plug-in, Applet…
WWW Concepts & Web-based Info. Management
S.S. Info. Management Concept
CGI, Java Servlets
JDBC, MySQL
App. Of Web-based tech.
Semantic Web
This Lecture ?
ICE0534 - Web-Based Software Development, Summer 2005 3
Can you remember?
• Problems in Integrating Heterogeneous Information- Heterogeneity of formats, data types, units, or
semantics.
• Information Mediation
Fig 1. Mediator in Lecture 7.
ICE0534 - Web-Based Software Development, Summer 2005 4
This Lecture Contains…
• Information Integration in Bioinformatics- Bioinformatics Overview
• Is there any relationship between Web and Bioinformatics?
- Difficulties to handle Biological XML data- What is it? Why?
• Cultures : Schema-driven, Data-driven• Models : Federation, Warehousing, Mediation
• Integration of XML format data- Problems- Issues
• Summary
ICE0534 - Web-Based Software Development, Summer 2005 5
Bioinformatics
• A narrow sense- The application of information technology to life
science research• Modeling (abstraction)• Analysis and collection• Data integration and information retrieval
- Enables the discovery and analysis of biomolecules and their properties (Structure, function, interactions)
• A wide sense- The use of computers to collect, analyze, and
interpret biological information at the molecular level
ICE0534 - Web-Based Software Development, Summer 2005 6
Web and Bioinformatics
Experiment,Publish
Use
Make, Publish
Use
Biological Data
Bio Applications
Biologist
Computer Scientist
ICE0534 - Web-Based Software Development, Summer 2005 7
Difficulties to Handle Biological XML data
• Lack of standard- Different data model and schemas- Different handling methods are needed- Different formats
• Monstrous volume of data- It is growing exponentially- Data are updated very frequently
• Newly introduced data, error fixed data
ICE0534 - Web-Based Software Development, Summer 2005 8
Why Integration?
• In the post-human genome sequencing era, many analyses on the genome scale are possible
• Majority of human diseases are the product of multi-step pathophysiological processes
• The biggest challenge in interpreting the results of these analyses lies in the data integration problem
ICE0534 - Web-Based Software Development, Summer 2005 9
Two Cultures of Integration
• Database Integration- Schema level view- Focus on outside of data
• Data Integration- Data level view- Focus on inside of data
Schema 1 Schema 2
Schema 3Schema 4
Data 1
Data 3
Data 2
Data 4
ICE0534 - Web-Based Software Development, Summer 2005 10
Two Cultures of Integration
• Schema-driven (computer scientists)- Much smaller than data, (hopefully) well-defined elements- Resolve redundancy and heterogeneity at the schema level- High degree of automation once system is set-up- Focus on methods - you rarely publish a “data paper”
• Data-driven (biologists)- Value is in the data, abstraction is a result of analysis - Don‘t bother with schemas
• Abstraction is volatile and depends on experimental technique
- Manual integration at data level, constant high effort- You rarely publish a (database) “method paper”
ICE0534 - Web-Based Software Development, Summer 2005 11
Models of Integration
• Federation (Multi-database)• Warehousing (Materialized in house)• Mediation (Virtual integration)
ICE0534 - Web-Based Software Development, Summer 2005 12
Models of Integration
• Federation (Multi-database)- K2/BioKleisli, Entrez
ICE0534 - Web-Based Software Development, Summer 2005 13
Models of Integration
• Warehousing (Materialized in house)- GUS (Genome Unified Schema), SRS (Sequence Retrieval
System)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Global Schema
Local Schema Local Schema
User Query
O/RDB
Wrapper
Web Sourc
e
Wrapper
Repository
Data Extraction
Global Schema
Local Schema Local Schema
User Query
O/RDB
Wrapper
Web Sourc
e
Wrapper
Repository
Data Extraction
LocalOperational
WarehouseDecision Support& Mining
Network Internet
Integration& Storage
R3R2
ICE0534 - Web-Based Software Development, Summer 2005 14
Models of Integration
• Mediation (Virtual integration)- TAMBIS (Transparent Access to Multiple Bioinformatics Information
Source)
Data Sources
Mediated Schema
Local Schema Local Schema
O/RDB
Wrapper
Web Sourc
e
Wrapper
User Query
Query 2Query1
Integration System
Data Sources
Mediated Schema
Local Schema Local Schema
O/RDB
Wrapper
Web Sourc
e
Wrapper
User Query
Query 2Query1
Integration System
Mediator
Network Internet
Query Translation
ICE0534 - Web-Based Software Development, Summer 2005 15
Models of Integration
• Federation represents a more “static” approach – using agreed couplings to allow view creation.
• Warehousing and Mediation addresses integration in a more “dynamic” way – using extraction, transformation and integration processes.
ICE0534 - Web-Based Software Development, Summer 2005 16
Warehousing vs. Mediation
• Warehouse- Update-driven: i.e. in warehouse repository- Heterogeneous data is integrated in advance and
stored in-house for direct query and analysis.
• Mediation- Wrapper and Mediator layer on top of source DBs.- Query-driven: Query to mediated schema then
translated into queries appropriate to sources.- Results integrated into a global answer set.
ICE0534 - Web-Based Software Development, Summer 2005 17
Now let’s study the…
• Information Integration in Bioinformatics- Bioinformatics Overview
• Is there any relationship between Web and Bioinformatics?
- Difficulties to handle Biological XML data- Why Integration?
• Cultures : Schema-driven, Data-driven• Models : Federation, Warehousing, Mediation
• Integration of XML format data- Problems- Issues
• Discussion about Reading Question #6
ICE0534 - Web-Based Software Development, Summer 2005 18
Integration of XML format data
• Why XML?- Biology is a complex discipline- Wide variety of data resources and
repositories • No standard protocol exists to interrogate biological
data stores• No standard data format exists to exchange
biological data.• No standard data model exists.
- Difficulties in using and exchanging data• There exist various tools that can support XML
handling
ICE0534 - Web-Based Software Development, Summer 2005 19
Integration of XML format data
• Problems- We focus on schema-driven integration- Warehousing model is efficient
• Have to analyze data• Performance• To implement perfect mediation model is extremely
difficult
- XML data should be converted into RDB- We want to make our own DB schema accommodating
the data from XML files- We need to make the DB schema regarding efficiency
and our own purpose- Heterogeneity and Large scale
ICE0534 - Web-Based Software Development, Summer 2005 20
Integration of XML format data
• PreSPI (Prediction System for Protein Interaction)
General XML Wrapper (SAX)
Sequence Structure Function Domain٠ ٠ ٠
XMLXML XMLXML XMLXML XMLXML
Integration Rule Local DB1 Local DB2 Local DB3
Warehouse
Local
Web
ICE0534 - Web-Based Software Development, Summer 2005 21
Issues of Using XML Biological data• Structure
- Semi-structured: Can be expressed as trees, graphs- Theoretically, it is ideal to map them into DB
regarding structural feature
• Method for storing XML- File system
• Has overhead for query • Text file, invert list, compression file
- Specific storing method • Use XML’s own structure
- DB system• Especially, mapping into RDB has been researched a lot• Has overhead for converting into the appropriate model
ICE0534 - Web-Based Software Development, Summer 2005 22
Issues of Using XML Biological data
Object view of the XML use DOM
A Class can be mapped into a Table, PCDATA or ATTRIBUTE can be column
XML Objects Tables ============= ============ ============== Table A <A> object A {
---------------------- <B>bbb</B> B = "bbb" B C D <C>ccc</C> <=> C = "ccc" <=> -- --- --- <D>ddd</D> D = "ddd" ... ... ... </A> } bbb ccc ddd ... ... ...
XML-view CREATE XMLVIEW xview_1( id char(20), email char (30) )
AS ( ‘select p.personnel.person@id, p.personnel.person@email
from “file:/home/user1/personal.xml”, p; ‘);
“A generic load/extract utility for data transfer between XML documents and relational databases”Bourret, R.; Bornhovd, C.; Buchmann, A.;Advanced Issues of E-Commerce and Web-Based Information Systems, 2000.
ICE0534 - Web-Based Software Development, Summer 2005 23
Issues of Using XML Biological data• Direct method
XML Document
InsertStatement
Mapping Rule
XML Saverinput
Output & execute
input
“A direct method of data exchange between XML and relational database” Bei Jia; Cai Fei; Tao Lie-Jun; Pan Jin-Gui;
Information Technology Interfaces, 2004. 26th International Conference on 2004 Page(s):127 - 132 Vol.1
ICE0534 - Web-Based Software Development, Summer 2005 24
Issues of Using XML Biological data• Direct Method (cont’d)
ICE0534 - Web-Based Software Development, Summer 2005 25
Issues of Using XML Biological data
• Current methods force DB to follow XML schema
• Complex structured XML- Share the same element name even thought they should be
different columns in DB (DIP, InterPro…)
• Large size of file; we cannot use DOM• XML updated frequently; the process should be easy
... B C ...
… ID_B ID_B …
<protein id=“ID_A" name=“PROTEIN_A“> <ref db=“B" id=“ID_B" /> <ref db=“C" id=“ID_C" /> ………..
ID DB IDID_A B ID_BID_A C ID_CID_A D ID_DID_A E ID_E
ID NAME B C D E
ID_A PROTEIN_A ID_B ID_C ID_D ID_ERather than
ICE0534 - Web-Based Software Development, Summer 2005 26
Issues of Using XML Biological data Direct Method cannot cover following XML type
Cannot integrate two more files ; Needs constraint
<node id="G:1" uid="DIP:232N" name="BAXA_HUMAN" class="protein"> <xref db="DIP" id="232N" type="src"/> <feature name="swp_ref" class="cref"> <src>SwissProt</src> <val>SWP:Q07812</val> <xref db="SWP" id="Q07812" type="src"/> </feature> <feature name="pir_ref" class="cref"> <src>PIR</src> <val>PIR:A47538</val> <xref db="PIR" id="A47538" type="src"/> </feature> <feature name="gi_ref" class="cref"> <src>NCBI</src> <val>GI:539664</val> <xref db="gi" id="539664" type="src"/> </feature> <att name="descr"> <val>bcl-2-associated protein x, alpha splice form</val> </att> <att name="organism"> <val>Homo sapiens</val> <xref db="TXID" id="9606" type="ont"/> </att> </node>
ID NAME DIP_ID SWP_ID PIR_ID GI_ID
G:1 BAXA_HUMAN DIP:232N
Q07812
A47538 539664
ID DB Ref_ID
G:1 DIP DIP:232N
G:1 SWP Q07812
G:1 PIR A47538
G:1 gi 539664
We want
But,
ICE0534 - Web-Based Software Development, Summer 2005 27
Issues of Using XML Biological data• Make a data set for a tuple, which ignore sub
document tree nodes• Define SQL like syntax
- Where condition of each column for constraints- Multiple files can be populated into one table by
manipulationCREATE TABLE PROTEIN_IDs(ID_A CHAR(20), NAME CHAR(20), B CHAR(20),
C CHAR(20), D CHAR(20) , E CHAR(20) ) AS ( SELECT ( FILE.protein@id,
FILE.protein@name,[FILE.protein.ref]@id WHERE @db = B,[FILE.protein.ref]@id WHERE @db = C,[FILE.protein.ref]@id WHERE @db = D,[FILE.protein.ref]@id WHERE @db = E,[FILE_2.ELEMENT]@value WHERE @id=ID_A )
FROM “file/protein.xml” AS FILE, “file/file.xml” AS FILE_2);
ICE0534 - Web-Based Software Development, Summer 2005 28
Summary
• Integration of biological data is a kind of Web based information management
• Integration in bioinformatics is a very important work because we can find out more valuable biological information via comprehensive view
• Biological XML data have some properties that disturb integration, so schema-driven and warehousing model are usually used for integration Thank you~~~