28
Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

Embed Size (px)

Citation preview

Page 1: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

Information and Communications Univ.

Bioinformatics & Software Systems Lab.Woo-Hyuk Jang

Integration of Biological XML data

Ph. D. Lecture

Page 2: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 2

Where are we?

Server-Side Info. Management

Client-Side Info. Management Business related Issues

Web Services

Internationalization and Privacy

XML & XML Processing

HTML, JavaScript, Plug-in, Applet…

WWW Concepts & Web-based Info. Management

S.S. Info. Management Concept

CGI, Java Servlets

JDBC, MySQL

App. Of Web-based tech.

Semantic Web

This Lecture ?

Page 3: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 3

Can you remember?

• Problems in Integrating Heterogeneous Information- Heterogeneity of formats, data types, units, or

semantics.

• Information Mediation

Fig 1. Mediator in Lecture 7.

Page 4: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 4

This Lecture Contains…

• Information Integration in Bioinformatics- Bioinformatics Overview

• Is there any relationship between Web and Bioinformatics?

- Difficulties to handle Biological XML data- What is it? Why?

• Cultures : Schema-driven, Data-driven• Models : Federation, Warehousing, Mediation

• Integration of XML format data- Problems- Issues

• Summary

Page 5: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 5

Bioinformatics

• A narrow sense- The application of information technology to life

science research• Modeling (abstraction)• Analysis and collection• Data integration and information retrieval

- Enables the discovery and analysis of biomolecules and their properties (Structure, function, interactions)

• A wide sense- The use of computers to collect, analyze, and

interpret biological information at the molecular level

Page 6: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 6

Web and Bioinformatics

Experiment,Publish

Use

Make, Publish

Use

Biological Data

Bio Applications

Biologist

Computer Scientist

Page 7: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 7

Difficulties to Handle Biological XML data

• Lack of standard- Different data model and schemas- Different handling methods are needed- Different formats

• Monstrous volume of data- It is growing exponentially- Data are updated very frequently

• Newly introduced data, error fixed data

Page 8: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 8

Why Integration?

• In the post-human genome sequencing era, many analyses on the genome scale are possible

• Majority of human diseases are the product of multi-step pathophysiological processes

• The biggest challenge in interpreting the results of these analyses lies in the data integration problem

Page 9: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 9

Two Cultures of Integration

• Database Integration- Schema level view- Focus on outside of data

• Data Integration- Data level view- Focus on inside of data

Schema 1 Schema 2

Schema 3Schema 4

Data 1

Data 3

Data 2

Data 4

Page 10: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 10

Two Cultures of Integration

• Schema-driven (computer scientists)- Much smaller than data, (hopefully) well-defined elements- Resolve redundancy and heterogeneity at the schema level- High degree of automation once system is set-up- Focus on methods - you rarely publish a “data paper”

• Data-driven (biologists)- Value is in the data, abstraction is a result of analysis - Don‘t bother with schemas

• Abstraction is volatile and depends on experimental technique

- Manual integration at data level, constant high effort- You rarely publish a (database) “method paper”

Page 11: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 11

Models of Integration

• Federation (Multi-database)• Warehousing (Materialized in house)• Mediation (Virtual integration)

Page 12: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 12

Models of Integration

• Federation (Multi-database)- K2/BioKleisli, Entrez

Page 13: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 13

Models of Integration

• Warehousing (Materialized in house)- GUS (Genome Unified Schema), SRS (Sequence Retrieval

System)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Global Schema

Local Schema Local Schema

User Query

O/RDB

Wrapper

Web Sourc

e

Wrapper

Repository

Data Extraction

Global Schema

Local Schema Local Schema

User Query

O/RDB

Wrapper

Web Sourc

e

Wrapper

Repository

Data Extraction

LocalOperational

WarehouseDecision Support& Mining

Network Internet

Integration& Storage

R3R2

Page 14: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 14

Models of Integration

• Mediation (Virtual integration)- TAMBIS (Transparent Access to Multiple Bioinformatics Information

Source)

Data Sources

Mediated Schema

Local Schema Local Schema

O/RDB

Wrapper

Web Sourc

e

Wrapper

User Query

Query 2Query1

Integration System

Data Sources

Mediated Schema

Local Schema Local Schema

O/RDB

Wrapper

Web Sourc

e

Wrapper

User Query

Query 2Query1

Integration System

Mediator

Network Internet

Query Translation

Page 15: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 15

Models of Integration

• Federation represents a more “static” approach – using agreed couplings to allow view creation.

• Warehousing and Mediation addresses integration in a more “dynamic” way – using extraction, transformation and integration processes.

Page 16: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 16

Warehousing vs. Mediation

• Warehouse- Update-driven: i.e. in warehouse repository- Heterogeneous data is integrated in advance and

stored in-house for direct query and analysis.

• Mediation- Wrapper and Mediator layer on top of source DBs.- Query-driven: Query to mediated schema then

translated into queries appropriate to sources.- Results integrated into a global answer set.

Page 17: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 17

Now let’s study the…

• Information Integration in Bioinformatics- Bioinformatics Overview

• Is there any relationship between Web and Bioinformatics?

- Difficulties to handle Biological XML data- Why Integration?

• Cultures : Schema-driven, Data-driven• Models : Federation, Warehousing, Mediation

• Integration of XML format data- Problems- Issues

• Discussion about Reading Question #6

Page 18: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 18

Integration of XML format data

• Why XML?- Biology is a complex discipline- Wide variety of data resources and

repositories • No standard protocol exists to interrogate biological

data stores• No standard data format exists to exchange

biological data.• No standard data model exists.

- Difficulties in using and exchanging data• There exist various tools that can support XML

handling

Page 19: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 19

Integration of XML format data

• Problems- We focus on schema-driven integration- Warehousing model is efficient

• Have to analyze data• Performance• To implement perfect mediation model is extremely

difficult

- XML data should be converted into RDB- We want to make our own DB schema accommodating

the data from XML files- We need to make the DB schema regarding efficiency

and our own purpose- Heterogeneity and Large scale

Page 20: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 20

Integration of XML format data

• PreSPI (Prediction System for Protein Interaction)

General XML Wrapper (SAX)

Sequence Structure Function Domain٠ ٠ ٠

XMLXML XMLXML XMLXML XMLXML

Integration Rule Local DB1 Local DB2 Local DB3

Warehouse

Local

Web

Page 21: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 21

Issues of Using XML Biological data• Structure

- Semi-structured: Can be expressed as trees, graphs- Theoretically, it is ideal to map them into DB

regarding structural feature

• Method for storing XML- File system

• Has overhead for query • Text file, invert list, compression file

- Specific storing method • Use XML’s own structure

- DB system• Especially, mapping into RDB has been researched a lot• Has overhead for converting into the appropriate model

Page 22: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 22

Issues of Using XML Biological data

Object view of the XML use DOM

A Class can be mapped into a Table, PCDATA or ATTRIBUTE can be column

XML Objects Tables ============= ============ ============== Table A <A> object A {

---------------------- <B>bbb</B> B = "bbb" B C D <C>ccc</C> <=> C = "ccc" <=> -- --- --- <D>ddd</D> D = "ddd" ... ... ... </A> } bbb ccc ddd ... ... ...

XML-view CREATE XMLVIEW xview_1( id char(20), email char (30) )

AS ( ‘select p.personnel.person@id, p.personnel.person@email

from “file:/home/user1/personal.xml”, p; ‘);

“A generic load/extract utility for data transfer between XML documents and relational databases”Bourret, R.; Bornhovd, C.; Buchmann, A.;Advanced Issues of E-Commerce and Web-Based Information Systems, 2000.

Page 23: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 23

Issues of Using XML Biological data• Direct method

XML Document

InsertStatement

Mapping Rule

XML Saverinput

Output & execute

input

“A direct method of data exchange between XML and relational database” Bei Jia; Cai Fei; Tao Lie-Jun; Pan Jin-Gui;

Information Technology Interfaces, 2004. 26th International Conference on 2004 Page(s):127 - 132 Vol.1

Page 24: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 24

Issues of Using XML Biological data• Direct Method (cont’d)

Page 25: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 25

Issues of Using XML Biological data

• Current methods force DB to follow XML schema

• Complex structured XML- Share the same element name even thought they should be

different columns in DB (DIP, InterPro…)

• Large size of file; we cannot use DOM• XML updated frequently; the process should be easy

... B C ...

… ID_B ID_B …

<protein id=“ID_A" name=“PROTEIN_A“> <ref db=“B" id=“ID_B" /> <ref db=“C" id=“ID_C" /> ………..

ID DB IDID_A B ID_BID_A C ID_CID_A D ID_DID_A E ID_E

ID NAME B C D E

ID_A PROTEIN_A ID_B ID_C ID_D ID_ERather than

Page 26: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 26

Issues of Using XML Biological data Direct Method cannot cover following XML type

Cannot integrate two more files ; Needs constraint

<node id="G:1" uid="DIP:232N" name="BAXA_HUMAN" class="protein"> <xref db="DIP" id="232N" type="src"/> <feature name="swp_ref" class="cref"> <src>SwissProt</src> <val>SWP:Q07812</val> <xref db="SWP" id="Q07812" type="src"/> </feature> <feature name="pir_ref" class="cref"> <src>PIR</src> <val>PIR:A47538</val> <xref db="PIR" id="A47538" type="src"/> </feature> <feature name="gi_ref" class="cref"> <src>NCBI</src> <val>GI:539664</val> <xref db="gi" id="539664" type="src"/> </feature> <att name="descr"> <val>bcl-2-associated protein x, alpha splice form</val> </att> <att name="organism"> <val>Homo sapiens</val> <xref db="TXID" id="9606" type="ont"/> </att> </node>

ID NAME DIP_ID SWP_ID PIR_ID GI_ID

G:1 BAXA_HUMAN DIP:232N

Q07812

A47538 539664

ID DB Ref_ID

G:1 DIP DIP:232N

G:1 SWP Q07812

G:1 PIR A47538

G:1 gi 539664

We want

But,

Page 27: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 27

Issues of Using XML Biological data• Make a data set for a tuple, which ignore sub

document tree nodes• Define SQL like syntax

- Where condition of each column for constraints- Multiple files can be populated into one table by

manipulationCREATE TABLE PROTEIN_IDs(ID_A CHAR(20), NAME CHAR(20), B CHAR(20),

C CHAR(20), D CHAR(20) , E CHAR(20) ) AS ( SELECT ( FILE.protein@id,

FILE.protein@name,[FILE.protein.ref]@id WHERE @db = B,[FILE.protein.ref]@id WHERE @db = C,[FILE.protein.ref]@id WHERE @db = D,[FILE.protein.ref]@id WHERE @db = E,[FILE_2.ELEMENT]@value WHERE @id=ID_A )

FROM “file/protein.xml” AS FILE, “file/file.xml” AS FILE_2);

Page 28: Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 28

Summary

• Integration of biological data is a kind of Web based information management

• Integration in bioinformatics is a very important work because we can find out more valuable biological information via comprehensive view

• Biological XML data have some properties that disturb integration, so schema-driven and warehousing model are usually used for integration Thank you~~~