32
BY ERIC CHU, AKANKSHA BAID, TING CHEN, ANHAI DOAN, AND JEFFREY F. NAUGHTON", PROCEEDINGS OF THE 33RD INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES (VLDB'07), VIENNA, AUSTRIA, SEPTEMBER 2007, 1045- 1056. A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data Presentation by Andrew Zitzelberger

A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

  • Upload
    vic

  • View
    21

  • Download
    0

Embed Size (px)

DESCRIPTION

A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data. - PowerPoint PPT Presentation

Citation preview

Page 1: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

BY ERIC CHU, AKANKSHA BAID, TING CHEN, ANHAI DOAN, AND JEFFREY F. NAUGHTON", PROCEEDINGS

OF THE 33RD INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES (VLDB'07), VIENNA,

AUSTRIA, SEPTEMBER 2007, 1045-1056.

A Relational Approach to Incrementally Extracting and

Querying Structure in Unstructured Data

Presentation by Andrew Zitzelberger

Page 2: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Problem

To find information from unstructured data users have to rely on a combination of keyword search, browsing, and possibly predefined search options.

Current search mechanisms do not leverage the potentially rich set of structures embedded in text.

Page 3: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Vision

Allow users to load a set of documents without any pre-processing and immediately allow the user to begin making queries.

At first the system provides no benefit above that provided by a traditional Information Retrieval, but information is extracted in the background providing incrementally better search.

Page 4: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Contributions

Provide a way to store the evolving set of documents and

structures tools that can be used to query and to incrementally

process the data a way to handle changes in our understanding of the

data set as it is processed

Page 5: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Schema and Data Representations

Continuing extraction of heterogeneous structures will gradually lead to sparse data set. Sparse – comprises a large number of attributes, but

most entities have non-null values for only small fraction of all these attributes.

Page 6: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Wide Tables

Wide Tables Forego schema design and store all objects in a single

horizontal table using the interpreted storage format. System uses an attribute catalog to record the name,

id, type, and size of each attribute. Each tuple starts with a header with fields such as

relation-id, tuple-id, and record length For each non-null attribute the tuple stores the

attribute’s identifier, length field, and value Attributes that are not in the tuple are implicitly null.

Page 7: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Interpreted Schema

Attribute Catalog

Tuples

Page 8: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Wide Tables

1NF restriction generally used in databases is removed. A state has many lakes; unreasonable to store them all

in a separate column. Creating a new table for each structure would also

grow too large and a query would require too many joins.

Allow complex attributes Lists, arrays, tables, set of tuples, etc. The administrators must decide how to deal with

complex attributes (user defined functions), the system does not attempt to handle them.

Page 9: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Wide Table

Page 10: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Mapping Table

Maps concepts together Temperatures in Fahrenheit on one page and Celsius

on another. We want to be able to compare the two. When the system determines that an attribute can

map to another, it rewrites the query appropriately.

Host name column just for presentation, not part of actual system

Page 11: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Relationship Table

Records complex structures that comprise multiple attributes Headquarter (city, company)

This table is used to keep track of which attributes belong to which complex structure.

Page 12: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Operators

Extract – extracts structureIntegrate – identifying attributes that

correspond to the same real-world conceptCluster – clustering a set of different

attributes based on some similarity structure

Page 13: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Operator Requirements

1) Each operator should be able to use different algorithms.

2) Database administrators should be able to specify a scope for the input on which chosen methods operate.

3) Given the output database administrators should be able to specify what they want to do with it.

Page 14: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Extract

Two types Detects structure such as entities and relationships from

natural language (DIRPE, Snowball, KnowItAll). Extracts structured data embedded in text of known

format such as LaTex, XML, and wiki markup text.

Output is a set of structures that can be stored in a wide table or run through one of the other operators.

Database administrators need to decide how to apply the extract operator. Can apply to subsets of the documents or columns of

already of extracted information to get a finer granularity.

Page 15: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Integrate

Input: a set of structures from the wide table or a previous operator

Output: one or more sets of mappings over attributes that correspond to the same real world concept

Database administrators decide what to do with each set of mappings Store in mapping table Consider collapsing the table (put attributes that map

to one another under a single column)

Page 16: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Cluster

Input: a set of documents or a set of attributes

Output: classification of documents or attributes into one or more clusters

Clustering can help database administrators: To know what views to build into the database To consider splitting wide tables into different clusters

Page 17: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Operator Interaction

Operators can be combined synergistically in a “whole is greater than the sum of the parts” fashion.

Six possible pairwise combinations of distinct operators: Integrate-Extract Cluster-Extract Extract-Cluster Integrate-Cluster Extract-Integrate Cluster-Integrate

Page 18: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Operator Interaction

Integrate-Extract Integrate helps find new targets for extract Example: Integrate finds a mapping between

“address” and “sent-to”, the extractor can then be used on sent-to instances

Cluster-Extract Cluster allows database administrators to find

documents in a specific domain, allowing them to use domain specific extractors for better results than using domain-independent extractors.

Page 19: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Operator Interaction

Extract-Cluster The extracted set of structures may provide more

information that Cluster can use to group together documents or attributes.

Example: clustering pages about cities on Wikipedia based on section names misses short pages because they don’t have a section name. However, after extracting the “city info-box” structure in some of the pages, Cluster was able to recognize them and put them in the city cluster.

Page 20: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Operator Interaction

Integrate-Cluster Integrate can prevent cluster from creating multiple

clusters where logically a single cluster would be better.

Example: Given a data set with attributes {C#, Company, FirstName, LastName, CustID, Contact, Cname}, Cluster may find two clusters {C#, Cname, FirstName, LastName} and {CustID, Company, Contact}. However, if we run Integrate first, we will have the

mapping {C# = CustID, Cname = Company, FirstName + LastName = Contact} and only end up with one cluster.

Page 21: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Operator Interaction

Extract-Integrate The primary tool. Need to extract before integrating.

Cluster-Integrate Cluster can narrow the scope of Integrate in two

ways: 1) Cluster may identify a domain for a set of structures,

and then the database administrators can apply a domain-specific schema matchers.

2) Cluster may identify two overlapping sets of attributes (e.g., {CustID, Cname} and {CustID, Company}), and then database administrators may want to look for possible mappings in the difference between the two sets of attributes (e.g., Cname=Company)

Page 22: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Case Study: Wikipedia

Page 23: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Case Study: Wikipedia

Desirable Qualities: Contents are embedded in wiki markup text. There are guidelines on how to create and edit a wiki page.

Pages generally have a consistent organizational structure

Downloaded database dump on December 5, 2005 Dump contained 4 million XML files (8.5 GB) Each XML file contained a blob that is the content of the

page and metadata bout the page (page-id, title, revision-id, contributor-username, etc.)

Used a control set of major American Cities (254 files), major universities from the states of Wisconsin, New York, and California (255 files), and top male tennis players on the ATP tour in the “Open Era” (373 files)

Page 24: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Incremental Processing

Stage 1: Initial Loading Parsed and loaded the XML files into a single table

which initially had five columns: PageId, PageText, RevisionId, ContributorUserName, and LastModificationDate PageId = page title PageText = content of page

Each page corresponds to a single row in the database.

With the full-text index on PageText, users can already query the documents using the keyword searches, even though data processing has not yet begun.

Page 25: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Incremental Processing

Stage 2: Extracting SectionName(text) Extracted SectionName(text) where SectionName represents

the name of a first-level section in the page and text is the content of that section.

Extracting this structure allows for “focused” keyword search. e.g., Search for “World No. 1 tennis player”

Searching over all page text gives us 86 players, including all 23 who have been ranked No. 1 (text may mention players who defeated a No. 1 player, etc.)

Assume that this fact is generally mentioned in the introduction, we search only introduction sections and get 67 players, 21 of which have been No 1 (better precision, worse recall)• In the two that were missed this fact was express as “world number

one” instead of “world No. 1”

Page 26: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Incremental Processing

Page 27: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Incremental Processing

Stage 3: Extracting info box as a blob Ideally we would want to extract attribute-value pairs,

but initially we store it as a blob to allow focused keyword searches. e.g., if we want to find out which universities have

“cardinal” as their school color, we can pose the keyword query “cardinal” over the university info boxes which return 7 results, 6 of which are correct (one has the mascot “Cardinal Burghy”) If we were to run “cardinal university” over the

PageText column 51 pages are returned.

Page 28: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Incremental Processing

Stage 4: Extracting structured data from info boxes and wiki tables Gather attribute-value pairs from info boxes and

tables Have wide tables store pointers to tables containing

the information.

Page 29: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Incremental Processing

Page 30: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Queries

Find the average temperature for the winter months in Madison, Wisconsin.

Find all tennis players who have been ranked number one (according to the info box).

Find universities who have temperature below 20o F in January.

Page 31: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Future Work

Deciding when to split wide tables as more understanding is gained about the concept.

Repository of user defined programs for the operators and handling complex objects.

Questions to Answer: How to handle updates to the unstructured data? How to record the evolution of data? How to help users write queries that exploit the

relational structure? How to optimize queries when they involve attributes

that have many mappings?

Page 32: A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data

Analysis

Negatives Requires a lot of work from the database

administrators Not tailored to the average users (SQL queries)

Positives Pages can be queries immediately Highly customizable Allows for focused searches Provide compelling idea for a personal search

assistant