Upload
ronald
View
40
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Exploitation of Structural Similarity in Semi-Structured Bioinformatics Data for Efficient Storage Construction. Dongkyoo Shin ([email protected]) Sejong University, InCob2007. Table of contents. Abstract Background Methods Results Conclusions. Abstract (1). Background - PowerPoint PPT Presentation
Citation preview
Exploitation of Structural Similarity in Semi-Structured Bioinformatics
Data for Efficient Storage Construction
Dongkyoo Shin ([email protected])
Sejong University, InCob2007
Multimedia & Internet Laboratory, Sejong University 2/20
Table of contents
• Abstract• Background• Methods• Results• Conclusions
Multimedia & Internet Laboratory, Sejong University 3/20
Abstract (1)
• Background– Many researches related to storing XML data
• Reduce the number of joins between tables• Not proper to microarray data with distinctive hierarchy
– Hierarchical feature of microarray data model• a few core values occurs iteratively
– New approach for capturing the feature• Class elements with similar structure into a group• Design common database table for the group
Multimedia & Internet Laboratory, Sejong University 4/20
Abstract (2)
• Results– Database schema created by our approach
• Reduce the number of table joins remarkably• Improve performance of storing and loading XML-based
microarray data
• Conclusions– Efficient way to improve performance of microarray
data is mining structural similarity of elements
Multimedia & Internet Laboratory, Sejong University 5/20
Background (1)
• DTD (Data Type Definition)-dependent base– Map one element into one table
For each e E, #(S) ≥1 OR #(A) ≥1 -> define_Class(e)For each Se S -> Add_attributes_of_Class(e)
Se SequenceType -> Define_multivalued_att(Se, e)
<?xml?><!ELEMENT Dept(A*)><!ATTLIST Dept dept_id ID #REQUIRED><!ELEMENT Employee (Name, Enroll*)><!ATTLIST Employee emplopyee_id ID #REQUIRED><!ELEMENT Name #PCDATA><!ELEMENT Enroll #PCDATA>
ParentID ID dept_id
1 2 “dept1”
The Dept table
ParentID ID TEXT
3 7 “CS10”
3 8 “CS20”
The Enroll table
ParentID ID TEXT
3 5 “St1”
3 6 “St2"
The Name tableThe Employee table
ParentID ID Employee_id
2 3 “123”
2 4 “124”
Multimedia & Internet Laboratory, Sejong University 6/20
Background (2)
• Inline technique base– Reduce the complexity of DTD (Data Type Definition)
For each e, #(S) == 1 AND Se SequenceType -> Add_Multi-valued_attribute_of_Paren-tClass(e)
<?xml?><!ELEMENT A(B, C)><!ELEMENT B(#PCDATA))><!ELEMENT C(D)* ><!ELEMENT D(E, F*, G)> <!ELEMENT E(H)><!ELEMENT F(#PCDATA)><!ELEMENT G(#PCDATA)><!ELEMENT H(#PCDATA)>
A
D
FE G
H
*
*
B C
inlining
A
C
FE G
H
*
*
B
Multimedia & Internet Laboratory, Sejong University 7/20
Background (3)
• Drawback of previous approaches– DTD-dependent
• Database schema has the same complexity with DTD
– Inline technique• Strongly depend on the number of omissible elements
• New design approach for microarray database– Capture similar structural features of microarray data– Need fast and simple way to mine the structural
features
Multimedia & Internet Laboratory, Sejong University 8/20
Background (5)
• Microarray data and MAGE (Microarray Gene Expression) standards– Research groups share microarray data with others,
and use it to solve their biological questions– MGED society’s standard definitions
• MIAME (Minimum Information for the Annotation of a Microarray Experiment)
• MAGE-OM and MAGE-ML – Exchange object model and format for MIAME
– Structural feature of MAGE-OM• a variety set of objects defining the same data types
including complex types.
Multimedia & Internet Laboratory, Sejong University 9/20
Background (6)
• Decision Tree– a simple model for easy understanding classification
rules correlations, and effects between variables
– Proper for mining structural features of MAGE-ML DTD itself (Not MAGE-ML instances !!!)
• Possible to classify all elements three levels:– A root, mediators group, and bottoms group
A
B C
E FDleaf
depthnode
root
Multimedia & Internet Laboratory, Sejong University 10/20
Methods (1)
• Classification of core features using decision tree– Terminologies for expression of a complexType
• e: an element defined in XML schema• E: an elements set of e• SE: a sub-elements set of e• a: an attribute of e• A: an attributes set of e• SA: an attributes set for all sub-elements of e• complexType: Structural information that consists of
SE and (or) A of e. • Lowest child: an element without a sub-element• Lowest parent: an element with a sub-element that
is one of the lowest child elements• PG (Parent Group): a set of candidate elements to be parents of a Lowest Child• LPCG (The Lowest Parent Candidate Group): a set of candidates to be Lowest Parent• LCG (The Lowest Child Group): a set of Lowest child elements• LPG (The Lowest Parent Group): a set of Lowest Parent elements• ULPG (Upper Level Parent Group): a set of upper level parents, including elements that are neither
Lowest Child nor Lowest Parent
<xsd:element name="A"><xsd:complexType>
<xsd:sequence><xsd:element ref="B"/><xsd:element ref="C"/>
</xsd:sequence></xsd:complexType>
</xsd:element><xsd:element name="B">
<xsd:complexType><xsd:sequence>
<xsd:element ref="D"/></xsd:sequence>
</xsd:complexType></xsd:element><xsd:element name="C">
<xsd:complexType><xsd:sequence>
<xsd:element ref="E"/><xsd:element ref="F"/>
</xsd:sequence></xsd:complexType>
</xsd:element>
<xsd:element name="D"><xsd:complexType>
<xsd:sequence><xsd:element ref="G"/><xsd:element ref="H"/>
</xsd:sequence></xsd:complexType>
</xsd:element><xsd:element name="E">
<xsd:complexType><xsd:sequence>
<xsd:element ref="I"/></xsd:sequence>
</xsd:complexType></xsd:element><xsd:element name="F">
<xsd:complexType><xsd:sequence>
<xsd:element ref="I"/><xsd:element ref="J"/>
</xsd:sequence></xsd:complexType>
</xsd:element>
<xsd:element name="G"><xsd:complexType>
<xsd:attribute name="identifier" use="required"/><xsd:attribute name="name"/>
</xsd:complexType></xsd:element><xsd:element name="H">
<xsd:complexType><xsd:attribute name="identifier" use="required"/><xsd:attribute name="name"/>
</xsd:complexType></xsd:element><xsd:element name="I">
<xsd:complexType><xsd:attribute name="width" use="required"/><xsd:attribute name="hight" use=”required”/><xsd:attribute name="length" use=”required”/>
</xsd:complexType></xsd:element><xsd:element name="J">
<xsd:complexType><xsd:attribute name="identifier" use="required"/><xsd:attribute name="name"/>
</xsd:complexType></xsd:element>
A
B C
E FD
I JG H
upper level parent
lowest parent
lowest child
parent group
Multimedia & Internet Laboratory, Sejong University 11/20
Methods (2)
• Expression of a complexType– A complexType defines structural information of
elements• A set of arrays including data type
• Definition of structural similarity
SEelex = {e1, e2, … , en}, SAelex = {Ae1, Ae2, … , Aen}
complexType(elex) = {SEelex, SAelex}
complexType(elex) == complexType(eley)
Multimedia & Internet Laboratory, Sejong University 12/20
Methods (3)
• Decision Tree for recognizing the core features
– Condition 1: If rule 1 is satisfied, then e arrives at LCG. Otherwise, it arrives at PG.– Condition 2: If rule 2 is satisfied, then e and its similar element e arrive at a new LCG.– Condition 3: If rule 3 is satisfied, then e arrives at LPG. Otherwise, it arrives at ULPG.– Condition 4: If rule 4 is satisfied, then e and elements similar to e arrive at a new LPG.
Elements
PGLCG
LCGi ULPG
yes no
no yes
Condition 1
Condition 2 Condition 3
Condition 4LPG
LPGi
Multimedia & Internet Laboratory, Sejong University 13/20
Methods (4)
• Classification rules– Rule 1
• Decide that an element should belong to group LCG or PG
For each ei E { if(number of elements in SEei == 0){ ei is classified into LCG; }else{ ei is classified into PG; }}
Multimedia & Internet Laboratory, Sejong University 14/20
Methods (5)
• Classification rules– Rule 2
• Classify multiple sets of LCGp = 0;For each ei LCG0 { Flag=0; If (p>0) { For q=1 to p If (complexType(ei) = complexType(element in LCGq) { ei is classified into LCGq; Flag=1; } } If (Flag==0) { For each ej LCG0
if(complexType(ei) = complexType(ej) { p=p+1; ei and ej are classified into a new group of LCGp; } } }
Multimedia & Internet Laboratory, Sejong University 15/20
Methods (6)
• Classification rules– Rule 3
• Separate elements in PG into two groups: LPG and ULPG
For each ei PG { if(SEei LCG) { ei is classified into LPG; }else{ ei is classified into ULPG; }}
Multimedia & Internet Laboratory, Sejong University 16/20
Methods
• Classification rules– Rule 4
• Classify multiple sets of LPGp = 0;For each ei LPG0 { Flag=0; If (p>0) { For q=1 to p If (complexType(ei) = complexType(element in LPGq) { ei is classified into LPGq; Flag=1; } } If (Flag==0) { For each ej LPG0
if(complexType(ei) = complexType(ej) { p=p+1; ei and ej are classified into a new group of LPGp; } } }
Multimedia & Internet Laboratory, Sejong University 17/20
Result (1)
• Database design by the proposed decision tree
Parent_1 Parent_2 Parent_3 Parent_4 Parent_N
Lowest child_1
Lowest child_2
Lowest child_3
Lowest child_4
Lowest child_N
Upper level elements
Parent_1 Parent_2 Parent_3 Parent_4 Parent_N
Parent_1 Parent_2 LPG Parent_4 Parent_N
Lowest child_1
Lowest child_2
LCGLowest child_4
Lowest child_N
Upper level elements
Parent_1 Parent_2 Parent_3 Parent_4 Parent_N
upper level parent
lowest parentgroup
lowest child group
identifer Object_type Parent_id
1354979 Array_assnlist 1256789
4661459 Bioassay_assnlist 1264564
... ... ...
Name Object_type Parent_id
null Array_ref 1354979
null Bioassay_ref 4661459
... ... ...
identifer
R:A- UMCU- 2:h000001_01
R:A- UMCU- 2:h000001_02
...
Multimedia & Internet Laboratory, Sejong University 18/20
Result (2)
• Database space complexity
• Time complexity
Raw schema Classified schema
Total classes 455 314
Total tables 455 314
Total records 2012 160
Total DB size 710 (Kb) 27 (Kb)
0
200
400
600
800
1000
1200
1400
1600
Sto
ring
tim
e (m
s)
storing 1515 528
raw schema classified schema0
2000
4000
6000
8000
10000
12000
14000
16000
18000L
oadin
g tim
e (
ms)
loading 15890 6362
raw schema classified schema
Multimedia & Internet Laboratory, Sejong University 19/20
Result (3)
• Reconstructing the XML Document...<LPG> <LCG identifier=”R:A-UMCU-2:h000001_01" name=”” objectType=”Array_ref” parent_id=”1354979"/></LPG>...
...<Array_assnlist> <Array_ref identifier=”R:A-UMCU-2:h000001_01" name=””/> ...</Array_assnlist>...
XSLT Processor
OriginalXML schema
RelationalDatabase
tuples
objects
JAXB
SimplifiedXML schema
Multimedia & Internet Laboratory, Sejong University 20/20
Conclusions
• Proposed approach– Mine elements with structural similarity from XML
Schema for biological information– Experimental result
• Mining structural similarity of object model is proper to microarray data and more efficient than previous approaches
• Future work– Plan to extend current classification rules to root,
LCG, LPG, ULPG respectively