11
Integrating Integrating Structured & Structured & Unstructured Unstructured Data Data

Integrating Structured & Unstructured Data. Goals Identify some applications that have crucial requirement for integration of unstructured and structured

Embed Size (px)

Citation preview

Page 1: Integrating Structured & Unstructured Data. Goals  Identify some applications that have crucial requirement for integration of unstructured and structured

Integrating Structured Integrating Structured & Unstructured Data& Unstructured Data

Page 2: Integrating Structured & Unstructured Data. Goals  Identify some applications that have crucial requirement for integration of unstructured and structured

Goals

Identify some applications that have crucial requirement for integration of unstructured and structured data

Identify key technical issues in integrating unstructured and structured data

Identify potential approaches

Page 3: Integrating Structured & Unstructured Data. Goals  Identify some applications that have crucial requirement for integration of unstructured and structured

Definitions (simplified)

Structured object: – <oid, {<name, value>}>

Unstructured object:– <oid, {word}>– <oid, unknown/complex structure>

Semi-structured object– <oid, {<name, value>}, {word}>– <name, value> pairs may be

• Given (e.g. author, title, etc.)• Extracted (e.g. Date, Zipcode, etc.)• Inferred (e.g. Topic)

Page 4: Integrating Structured & Unstructured Data. Goals  Identify some applications that have crucial requirement for integration of unstructured and structured

Representative Applications

BPI: Messasges- unstructured Web Applications: unstructured pages Corporate Portals: DSS involving Combination of simulation with database system News syndication: author etc + story Call centers: customer interaction + structured component of complaint Mail system/document systems Tourist information system Product catalogs/engineering spec sheets Patents/chenistry documents Matching Legal documents (with cross citations) with building codes ---

representative

Page 5: Integrating Structured & Unstructured Data. Goals  Identify some applications that have crucial requirement for integration of unstructured and structured

Key Technical Issues

Query language & data model– Sharp vs fuzzy / complete vs best-effort– Boolean vs similarity queries (relationship to “value”)

Integration strategies– Loose vs. tight coupling Architectures (many possibilities)– Search engine into DBMS or DBMS into search engine– Late & early binding (warehousing vs virtual)– Integration vs articulation (union vs intersection)

Feature extraction from unstructured data Role of meta data & integrity constraints Inconsistency of data sources

– Priorty rules for mediation Management & data organization issues

– Version management , freshness, security Continuous queries over streams

Page 6: Integrating Structured & Unstructured Data. Goals  Identify some applications that have crucial requirement for integration of unstructured and structured

Strucured:People(firstname, lastname, company, location)Semi-structured:Papers(title, {authors}, text)

Unstructured: Reviews

Q1: Reviews of papers by Almaden authors on IISearch reviews using Join(People.<fn,ln>, Papers.authors).keywords

Q2: Folks in Almaden and Watson working on same topicJoin of Papers.text followed by joined with names in People

Q3: Papers on privacy & data mining by Agarwal in WatsonCombine ranks of results from People and Papers

Q4: Almaden authors whose papers had negative reviewsInfer sentiment of a review and interesting joins

Q5: Crrent research topics in AlmadenJoin People and Papers followed by clustering

Page 7: Integrating Structured & Unstructured Data. Goals  Identify some applications that have crucial requirement for integration of unstructured and structured

Combining Scores

DB:– Aggarwal, Watson, s1– Agarwal, Almaden, s2– Agrawal, Almaden, s3

IR– Sigmod 00 paper, r2– PODS 01 papers, r1– KDD00 paper, r3

Query

DB IR

Result

Chopper Combiner

Papers on privacy & data mining

by Agarwal in Watson

Page 8: Integrating Structured & Unstructured Data. Goals  Identify some applications that have crucial requirement for integration of unstructured and structured

Query Processing

Query

Chopper & Router

DB IR

Result Query

Chopper & Router

DB IR

Result

Page 9: Integrating Structured & Unstructured Data. Goals  Identify some applications that have crucial requirement for integration of unstructured and structured

Approaches (1)

Query Languages– XML-based extensions for queries

• W3C working group on Xquery considering extension for full text

• XXL (Weikum), XIRQL (Fuhr)– Specialized languages for highly structured data (e.g. chemical

molecules)?– Graph-based models & languages (RDF, Protégé – Stanford)– Extended relational (e.g. SQL/MM)– Inverse queries on business events– Reasoning systems– Statistical approaches (approximate/ data mining)

Page 10: Integrating Structured & Unstructured Data. Goals  Identify some applications that have crucial requirement for integration of unstructured and structured

Approaches (2)

Pluses of tight coupling– Enforcement of ontologies, schemas– Security, management, query optimization, integriry

constraints Negatives of tight coupling

– Does not address federation issues/autonomy Pluses of loose coupling

– Flexibility Negatives of loose coupling

And the dinner bell rings …

Page 11: Integrating Structured & Unstructured Data. Goals  Identify some applications that have crucial requirement for integration of unstructured and structured

Concluding Remarks

We need further discussion on issues and approaches during the rest of the workshop