Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture

Information and Communications Univ.

Bioinformatics & Software Systems Lab.Woo-Hyuk Jang

Integration of Biological XML data

Ph. D. Lecture

ICE0534 - Web-Based Software Development, Summer 2005 2

Where are we?

Server-Side Info. Management

Client-Side Info. Management Business related Issues

Web Services

Internationalization and Privacy

XML & XML Processing

HTML, JavaScript, Plug-in, Applet…

WWW Concepts & Web-based Info. Management

S.S. Info. Management Concept

CGI, Java Servlets

JDBC, MySQL

App. Of Web-based tech.

Semantic Web

This Lecture ?


Can you remember?

• Problems in Integrating Heterogeneous Information- Heterogeneity of formats, data types, units, or

semantics.

• Information Mediation

Fig 1. Mediator in Lecture 7.


This Lecture Contains…

• Information Integration in Bioinformatics- Bioinformatics Overview

• Is there any relationship between Web and Bioinformatics?

- Difficulties to handle Biological XML data- What is it? Why?

• Cultures : Schema-driven, Data-driven• Models : Federation, Warehousing, Mediation

• Integration of XML format data- Problems- Issues

• Summary


Bioinformatics

• A narrow sense- The application of information technology to life

science research• Modeling (abstraction)• Analysis and collection• Data integration and information retrieval

- Enables the discovery and analysis of biomolecules and their properties (Structure, function, interactions)

• A wide sense- The use of computers to collect, analyze, and

interpret biological information at the molecular level


Web and Bioinformatics

Experiment,Publish

Use

Make, Publish

Use

Biological Data

Bio Applications

Biologist

Computer Scientist


Difficulties to Handle Biological XML data

• Lack of standard- Different data model and schemas- Different handling methods are needed- Different formats

• Monstrous volume of data- It is growing exponentially- Data are updated very frequently

• Newly introduced data, error fixed data


Why Integration?

• In the post-human genome sequencing era, many analyses on the genome scale are possible

• Majority of human diseases are the product of multi-step pathophysiological processes

• The biggest challenge in interpreting the results of these analyses lies in the data integration problem


Two Cultures of Integration

• Database Integration- Schema level view- Focus on outside of data

• Data Integration- Data level view- Focus on inside of data

Schema 1 Schema 2

Schema 3Schema 4

Data 1

Data 3

Data 2

Data 4


Two Cultures of Integration

• Schema-driven (computer scientists)- Much smaller than data, (hopefully) well-defined elements- Resolve redundancy and heterogeneity at the schema level- High degree of automation once system is set-up- Focus on methods - you rarely publish a “data paper”

• Data-driven (biologists)- Value is in the data, abstraction is a result of analysis - Don‘t bother with schemas

• Abstraction is volatile and depends on experimental technique

- Manual integration at data level, constant high effort- You rarely publish a (database) “method paper”


Models of Integration

• Federation (Multi-database)• Warehousing (Materialized in house)• Mediation (Virtual integration)



• Federation (Multi-database)- K2/BioKleisli, Entrez



• Warehousing (Materialized in house)- GUS (Genome Unified Schema), SRS (Sequence Retrieval

System)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Global Schema

Local Schema Local Schema

User Query

O/RDB

Wrapper

Web Sourc

e

Wrapper

Repository

Data Extraction

Global Schema


User Query

O/RDB

Wrapper

Web Sourc

e

Wrapper

Repository

Data Extraction

LocalOperational

WarehouseDecision Support& Mining

Network Internet

Integration& Storage

R3R2



• Mediation (Virtual integration)- TAMBIS (Transparent Access to Multiple Bioinformatics Information

Source)

Data Sources

Mediated Schema


O/RDB

Wrapper

Web Sourc

e

Wrapper

User Query

Query 2Query1

Integration System

Data Sources

Mediated Schema


O/RDB

Wrapper

Web Sourc

e

Wrapper

User Query

Query 2Query1

Integration System

Mediator

Network Internet

Query Translation



• Federation represents a more “static” approach – using agreed couplings to allow view creation.

• Warehousing and Mediation addresses integration in a more “dynamic” way – using extraction, transformation and integration processes.


Warehousing vs. Mediation

• Warehouse- Update-driven: i.e. in warehouse repository- Heterogeneous data is integrated in advance and

stored in-house for direct query and analysis.

• Mediation- Wrapper and Mediator layer on top of source DBs.- Query-driven: Query to mediated schema then

translated into queries appropriate to sources.- Results integrated into a global answer set.


Now let’s study the…

• Information Integration in Bioinformatics- Bioinformatics Overview

• Is there any relationship between Web and Bioinformatics?

- Difficulties to handle Biological XML data- Why Integration?

• Cultures : Schema-driven, Data-driven• Models : Federation, Warehousing, Mediation

• Integration of XML format data- Problems- Issues

• Discussion about Reading Question #6


Integration of XML format data

• Why XML?- Biology is a complex discipline- Wide variety of data resources and

repositories • No standard protocol exists to interrogate biological

data stores• No standard data format exists to exchange

biological data.• No standard data model exists.

- Difficulties in using and exchanging data• There exist various tools that can support XML

handling



• Problems- We focus on schema-driven integration- Warehousing model is efficient

• Have to analyze data• Performance• To implement perfect mediation model is extremely

difficult

- XML data should be converted into RDB- We want to make our own DB schema accommodating

the data from XML files- We need to make the DB schema regarding efficiency

and our own purpose- Heterogeneity and Large scale



• PreSPI (Prediction System for Protein Interaction)

General XML Wrapper (SAX)

Sequence Structure Function Domain٠ ٠ ٠

XMLXML XMLXML XMLXML XMLXML

Integration Rule Local DB1 Local DB2 Local DB3

Warehouse

Local

Web


Issues of Using XML Biological data• Structure

- Semi-structured: Can be expressed as trees, graphs- Theoretically, it is ideal to map them into DB

regarding structural feature

• Method for storing XML- File system

• Has overhead for query • Text file, invert list, compression file

- Specific storing method • Use XML’s own structure

- DB system• Especially, mapping into RDB has been researched a lot• Has overhead for converting into the appropriate model


Issues of Using XML Biological data

Object view of the XML use DOM

A Class can be mapped into a Table, PCDATA or ATTRIBUTE can be column

XML Objects Tables ============= ============ ============== Table A <A> object A {

---------------------- <B>bbb</B> B = "bbb" B C D <C>ccc</C> <=> C = "ccc" <=> -- --- --- <D>ddd</D> D = "ddd" ... ... ... </A> } bbb ccc ddd ... ... ...

XML-view CREATE XMLVIEW xview_1( id char(20), email char (30) )

AS ( ‘select p.personnel.person@id, p.personnel.person@email

from “file:/home/user1/personal.xml”, p; ‘);

“A generic load/extract utility for data transfer between XML documents and relational databases”Bourret, R.; Bornhovd, C.; Buchmann, A.;Advanced Issues of E-Commerce and Web-Based Information Systems, 2000.


Issues of Using XML Biological data• Direct method

XML Document

InsertStatement

Mapping Rule

XML Saverinput

Output & execute

input

“A direct method of data exchange between XML and relational database” Bei Jia; Cai Fei; Tao Lie-Jun; Pan Jin-Gui;

Information Technology Interfaces, 2004. 26th International Conference on 2004 Page(s):127 - 132 Vol.1


Issues of Using XML Biological data• Direct Method (cont’d)


Issues of Using XML Biological data

• Current methods force DB to follow XML schema

• Complex structured XML- Share the same element name even thought they should be

different columns in DB (DIP, InterPro…)

• Large size of file; we cannot use DOM• XML updated frequently; the process should be easy

... B C ...

… ID_B ID_B …

<protein id=“ID_A" name=“PROTEIN_A“> <ref db=“B" id=“ID_B" /> <ref db=“C" id=“ID_C" /> ………..

ID DB IDID_A B ID_BID_A C ID_CID_A D ID_DID_A E ID_E

ID NAME B C D E

ID_A PROTEIN_A ID_B ID_C ID_D ID_ERather than


Issues of Using XML Biological data Direct Method cannot cover following XML type

Cannot integrate two more files ; Needs constraint

<node id="G:1" uid="DIP:232N" name="BAXA_HUMAN" class="protein"> <xref db="DIP" id="232N" type="src"/> <feature name="swp_ref" class="cref"> <src>SwissProt</src> <val>SWP:Q07812</val> <xref db="SWP" id="Q07812" type="src"/> </feature> <feature name="pir_ref" class="cref"> <src>PIR</src> <val>PIR:A47538</val> <xref db="PIR" id="A47538" type="src"/> </feature> <feature name="gi_ref" class="cref"> <src>NCBI</src> <val>GI:539664</val> <xref db="gi" id="539664" type="src"/> </feature> <att name="descr"> <val>bcl-2-associated protein x, alpha splice form</val> </att> <att name="organism"> <val>Homo sapiens</val> <xref db="TXID" id="9606" type="ont"/> </att> </node>

ID NAME DIP_ID SWP_ID PIR_ID GI_ID

G:1 BAXA_HUMAN DIP:232N

Q07812

A47538 539664

ID DB Ref_ID

G:1 DIP DIP:232N

G:1 SWP Q07812

G:1 PIR A47538

G:1 gi 539664

We want

But,


Issues of Using XML Biological data• Make a data set for a tuple, which ignore sub

document tree nodes• Define SQL like syntax

- Where condition of each column for constraints- Multiple files can be populated into one table by

manipulationCREATE TABLE PROTEIN_IDs(ID_A CHAR(20), NAME CHAR(20), B CHAR(20),

C CHAR(20), D CHAR(20) , E CHAR(20) ) AS ( SELECT ( FILE.protein@id,

FILE.protein@name,[FILE.protein.ref]@id WHERE @db = B,[FILE.protein.ref]@id WHERE @db = C,[FILE.protein.ref]@id WHERE @db = D,[FILE.protein.ref]@id WHERE @db = E,[FILE_2.ELEMENT]@value WHERE @id=ID_A )

FROM “file/protein.xml” AS FILE, “file/file.xml” AS FILE_2);


Summary

• Integration of biological data is a kind of Web based information management

• Integration in bioinformatics is a very important work because we can find out more valuable biological information via comprehensive view

• Biological XML data have some properties that disturb integration, so schema-driven and warehousing model are usually used for integration Thank you~~~

Documents

Information and Communications Univ. Bioinformatics & Software Systems Lab. Woo-Hyuk Jang Integration of Biological XML data Ph. D. Lecture