Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
CRL: A Rule Languagefor Table Analysis and Interpretation*
in Unstructured Tabular Data Integration
Alexey Shigarov, [email protected]
Matrosov Institute for System Dynamics and Control Theory of SB RAS
17th International Conference on
Data Analytics and Management in Data Intensive Domains
Obninsk, Russia
October 13-16, 2015
* This work was financially supported by the Russian Foundation for Basic Research (Grant No. 15-37-20042)
and the Council for grants of the President of the Russian Federation (Scholarship No. SP-3387.2013.5)
Unstructured vs Structured
UnstructuredTabular Data
Arbitrary Tables inASCII-text, Spreadsheets,PDF Documents,Web-Pages
Structured DataRelational Databases
Easy Way
Hard Way
For HumansTo Understand
No ExplicitSemantics
We Can Read, Write, and Edit
For Computers To Understand
Formal Data Model (Semantics)
We Can Query (SQL)and Analyse (DM, OLAP)
2
Hard Way Back to Structured Data World
Table Detection*
Table Recognition*
Table Analysis*
Table Interpretation*
ASCII-text
Untagged PDF Documents
ImageDocumnets
Spreadsheets
Web PagesWord Documents
OCR
Databases
Cannonical Forms
XMLETL
* Hurst M. Layout and language: Challenges for table understanding on the web //Proc. 1st Int. Workshop on Web Document Analysis. 2001. pp. 27-30 3
Our purpose
Globallyto automate unstructured tabular data integration
DatabasesArbitrary Tablesin Spreadsheets
Currentlyto automate table analysis and interpretation
Tables in Cannonical Form
4
Ok, We Have Initially an Arbitrary Tagged TableWe know • structure (rows, columns, cells) • style settings (fonts, colors, alignments, etc.)• textual content
5
All We Need Is To Recover Semantics
Relationships likeentry-label, label-label, label-category*
* Our terminology is inspired bythe X. Wang’s abstract table model [Wang X. Tabular Abstraction, Editing, and Formatting, PhD Thesis. 1996]
6
When We Know Semantics We Can Generate a Canonical Table
It can be loaded into a database by ETL tools
7
Challenges on the Hard Way Back
• Too many layouts to create a table
• Anyone can invent new one
• Messy data
• No guarantees your tabular data are clear and standardized
• Natural Language
• Table understanding needs using knowleadge
8
Our Idea
When
• A table creator (e.g. a company, a government agency, ad-hoc software)
use a set of rules for table generation
• Tables have similar structure, style, and content
within a set of generating rules
Then
• We can define a set of rules for table analysis and interpretation
• We can use a rule engine to execute these rules9
Table Analysis and Interpretation Rules
• Rules can be expressed in
• Drools Rule Language* (DRL)
General-purpose language for expressing production rules in Drools* rule engine
• Cells Rule Language (CRL)
Our domain-specific language for expressing table analysis and interpretation rules
• Rules can be executed with Drools* rule engine
*http://drools.org
10
CRL Rules
Rules map known table data to unknown ones
rule
when
Left hand side defines conditions using available facts (cells, categories)
then
Right hand side defines actions to recover unknown semantics
(entries, labels, categories, entry-label, label-label, label-category)
end
11
CRL: Left Hand Side
factType $variable : Java boolean expressions
cell $cell : constraints
entry $entry : constraints
label $label : constraints
category $category : constraints
12
CRL: Right Hand Side
Merged Cells Splitted Cells
Cell splitting
To split n-tiles cell into n cellssplit $cell
Cell merging
To merge two cells into onemerge $cell1 -> $cell2
13
CRL: Right Hand Side
Cell marking
set mark @mark -> $cell
where @mark is a word with @ starting character
Using marks in conditions
cell $cell : mark == @mark, constraints
Short form
cell@mark $cell : constraints
14
CRL: Right Hand Side
Entry creating
Using a cell valuenew entry $cell
Using a specified valuenew entry value -> $cell
Label creating
Using a cell valuenew label $cell
Using a specified valuenew label value -> $cell
15
CRL: Right Hand Side
Label categorizing
To associate a label with a category
set category $category -> $label
Trying to find or create a category with a specified name
set category category_name -> $label
16
CRL: Right Hand Side
Label associating
set parent label $label1 -> $label2
• Labels can be organized in a tree
• We can build hierarchical categories
• We can build compound label values like label1|label2|…|labelN
17
CRL: Right Hand Side
Label grouping
group $label1 -> $label2
• A label group constitutes an anonymous category
• We can divide labels into categories without knowing categories
• We can entirely categorize a label group
18
CRL: Right Hand Side
Entry associating
To associate an entry with a label
add label $label -> $entry
Trying to find or create a label in the category with specified value
add label label_value from $category -> $entry
Trying to find or create a category with specified name
add label label_value from category_name -> $entry
19
Cannonical Form Generation
<entries>={1,2,3,4,5,6,7,8}
<labels>={a1,a11,a12,a2,a21,a22,b1,b2}
<categories>={A,B}
<entry-label pairs>={(1,a11),(1,b1),(2,a12),
(2,b1),(3,a21),(3,b1),(4,a22),(4,b1),(5,a11),
(5,b2),(6,a12),(6,b2),(7,a21),(7,b2),(8,a22),
(8,b2)}
<label-label pairs>={(a11,a1),(a12,a1),
(a21,a2),(a22,a2)}
<label-category pairs>={(a1,A),(a11,A),
(a12,A),(a2,A),(a21,A),(a22,A),(b1,B),(b2,B)}
DATA A B
1 a1 | a11 b1
2 a1 | a12 b1
3 a2 | a21 b1
4 a2 | a22 b1
5 a1 | a11 b2
6 a1 | a12 b2
7 a2 | a21 b2
8 a2 | a22 b2
a11 a12 a21 a22
b1 1 2 3 4
b2 5 6 7 8
A
B
a1 a2
20
Applying CRL: Critical Cells*
c d c d e
j 2 2 2 3
k
i l 6 7
h1
45
a b
f g
* Nagy G. Learning the Characteristics of Critical Cells from Web Tables // In Proc. of the 21st Int. Conf. on Pattern Recognition, Tsukuba, Japan, IEEE Comp. Soc., 2012, pp. 1554-1557
when
cell $cc : cl==1, rt==1, blank
cell $ec : cl>$cc.cr, rt>$cc.rb
then
new entry $ec
-> <entries> = {1,2,3,4,5,6,7}
21
when
cell $cc : cl == 1, rt == 1, blank
cell $clc : cl > $cc.cr, rb <= $cc.rb
then
set mark @ColLabel -> $clc
new label $clc
when
cell@ColLabel $c1
cell@ColLabel $c2 : rt == $c1.rt
then
group $c1.label -> $c2.label
Applying CRL: Label Groups
c d c d e
j 2 2 2 3
k
i l 6 7
h1
45
a b
f g
-> <labels>={a,b,c,d,e,f,g,...}
-> <groups>={{a,b},{c,d,e},
{f,g},...}
22
Applying CRL: Row Label Hierarchies
when
cell $c1 : cl==1, $l1 : label
cell $c2 : cl==1, rt>$c1.rt,
indent==$c1.indent+2, $l2 : label
no cells : cl==1, rt>$c1.rt,
rt<$c2.rt, indent==$c1.indent
then
set parent label $c1.label -> $c2.label
-> <label-label pairs> =
{(c1,c),(c11,c1),(c12,c1),(c2,c),
(c21,c2),(d1,d),(d11,d1)}
23
Applying CRL: YAML* Specified categories
Category YAML specification
# category YEAR
name: Year
description: years from 1982 to 2015
constraints:
-"198[2-9]"
-"200[1-9]"
-"201[0-5]"
when
category $c : name == "Year"
label $l : $c.canHaveLabel(value)
then
set category $c -> $l
Category YAML specification
# category COUNTRY_CODE
name: CountryCode
description: ISO 3166 2-letter country codes
labels:
-AD
-AE
-...
-ZW
when
category $c : name == "CountryCode"
label $l : $c.hasLabel(value)
then
set category $c -> $l
*http://yaml.org
24
Applying CRL: Category Names
when
cell $cc : cl == 1, rt == 1
cell $c : mark == "@ColLabel"
then
set category token($cc, 0) -> $c.label
A
Ba1 a2 a3
b1 1 2 3
b2 4 5 6
-> <categories> = {A,...}
-> <labels> = {a1,a2,a3,...}
-> <label-category pairs> = {(a1,A),(a2,A),(a3,A),...}
25
Applying CRL: Multi-Valued Cells
α β
阿爾法 公測
γ 1 2
伽馬 一 二
δ 3 4
三角洲 三 四
C1 C2 C3
a = 1 b = 2 c = 3
d = 4 e = 5 f = 6
g = 7 h = 8 i = 9
Bilingual Tables Key=Value Cells
when
cell $c : cl==1 || rt==1, !blank
then
new label token($c, 0) -> $c
new label token($c, 1) -> $c
when
cell $c : rt>1
then
new label left($c, '=') -> $c
new entry right($c, '=') -> $c
26
Applying CRL: Footnotes
when
cell $footer : onLastRow, $notes : text
entry $e : cell.text matches ".+\\*+",
$ref : extract(cell.text, "\\*+")
then
add label between($notes, $ref, '\n')
from "footnotes" -> $e
c d c d
e 1* 2** 3 4
f 5 6 7 8
g 9 10 11 12
a b
* x
** y
-> <labels>={x,y,...}
-> <categories>={"footnotes",...}
-> <entry-label pairs>={(1,x),(2,y),...}
-> <label-category pairs>={(x,"footnotes"), (y,"footnotes"),...}27
Applying CRL: Colored Tables
when
cell $lc : style.bgColor == "#4f81bd"
cell $ec : style.bgColor == null, rt >= $lc.rt, cl > $lc.cr
no cells : style.bgColor == "#4f81bd", cl > $lc.cr, cr < $ec.cl
then
add label $lc.label -> $ec.entry
1l
l2 l3 l4 l2 l3 l2
l5 l7 e1 e2 e2 l5 l7 e6 e8 l5 l8 e9
l6 l8 e3 e4 e5 l6 l7 e7 e8 l5 l8 e9
c1 c2l1
c1 c2l1
c1 c2
28
Prototype of Spreadsheet Data Extractionand Transformatiom System
29
Experimental Evaluation
Our purpose is evaluation of recovering entries, labels, entry-label and label-label relationships
Dataset• We use the TANGO dataset (http://tango.byu.edu/data)
which
• is a part of the TANGO (Table ANalysis for Generating Ontologies) project (http://tango.byu.edu)
• is intended for testing table interpretation methods
• has 200 arbitrary tables collected from 10 statistical sites in spreadsheet format in 2009
30
Experimental Evaluation
Multi-rowhierarchical layout
Multi-columnplain layout
One-columnhierarchicallayout
Multi-column &multi-row layout
One-columnplain layout
Category name cells
Row label cells
Column label cells
Entry cells
Table regions
One-column &one-row layout
Multi-column &one-row layout
One-row plain layout
Multi-rowplain layout
47,5%
47%5,5% 100%
94,5% 5,5% 65,5%
26%
8,5%
31
We develop two setsof CRL rules to define two table types
• TANGO-200all tables
• TANGO-SUBwithout tables having hierarchical layout in the leftmost column
Layouts of TANGO Tables
Experimental Evaluation
Measures
• Recall• a table is processed successfully, when all entries, labels, entry-label pairs, and label-label pairs which
are implicitly contained in its source form are explicitly included in its canonical form
• Presision• a table is processed successfully, when all entries, labels, entry-label pairs, and label-label pairs which
are explicitly included in its canonical form are implicitly contained in its source form
Process• Two experts independently compare sources and generated automatically canonical forms of tables
• They referee that each table is processed successfully or not in terms of recall and precision
• When they make opposite decisions on a table, a final decision is made by third expert
32
Experimental EvaluationResults
Rule Set / Table Type TANGO-200 TANGO-SUB
Tables 200 105
Cells 22757 10893
Rules 16 13
Recall 87% 95%
Precision 89% 95%
For TANGO-200
• 33 tables are processed with errors
• 85% of errors are born in the leftmost column with one-column hierarchical layout
• Two main causes:1) ambiguity among style characteristics2) hierarchical relationships expressed by natural language only
33
Comparison with others
Methods for Table Analysis and Interpretation
Fixed Table Types Programmable Table Types
Domain-specific
Douglas, 1995Tijerino, 2005Embley, 2005WangJ, 2012
• Domain ontologies• Taxonomies like
ProBase, FreeBase
Domain-independent
Gatterbauer, 2007Pivk, 2005, 2006, 2007Kim, 2008Chen&Cafarella, 2013, 2014Embley, 2014Nagy, 2014
• Spatial, style, and textual data• Several typical table types
We are here!2014, 2015• Rule language (CRL, DRL)• Relative cell addressing• Fixed target schema• Spatial, style,
and textual data
Hung, 2011• Spreadsheet-like formula
mapping language (TranSheet)• Absolute cell addressing• Programmable target schema• Spatial and textual data
34
Conclusions
• Our methodology is mainly oriented on unstructured tabular data integration
• We expect it to be useful in cases when data from a large number of tables appertaining to a few table types are required for populating a database
• One set of rules can be suitable for processing a wide range of arbitrary tables with high accuracy
• Experiment demonstrates that narrowing of a table type can cause simplifying of rules and increase of recall and precision in table canonicalization
35
Further Work
• Table Layouts
to develop techniques for widely used table features, e.g. for recovering a row label hierarchy in the leftmost column
• Messy Tabular Data
to incorporate data cleansing techniques into table understanding
• Natural Language
to add knowledge, global taxonomies (e.g. FreeBase, DBpedia) and domain ontologies
36
Supplementary Materials
CRL language specification
Examples of CRL rules
All details of our experiment
http://cells.icc.ru/pub/crl
Source code of our prototype licensed under Apache License 2.0
https://github.com/shigarov/cells-ssdc
37
Thanks!This presentation is available on SlideShare.net
http://www.slideshare.net/shig
Alexey [email protected]
http://cells.icc.ru
38