14
anwei Zhou, Tao Cheng, Kevin Chen-Chuan Chan WSDM 2010, New York, US 1

Data-oriented Content Query System: Searching for Data into Text on the Web

Embed Size (px)

DESCRIPTION

Data-oriented Content Query System: Searching for Data into Text on the Web. Mianwei Zhou, Tao Cheng, Kevin Chen-Chuan Chang WSDM 2010, New York, USA. Many Web Applications Try to Exploit the “Content” of Web Pages. - PowerPoint PPT Presentation

Citation preview

Mianwei Zhou, Tao Cheng, Kevin Chen-Chuan ChangWSDM 2010, New York, USA

1

Web Info Extractio

n

Web Info Extractio

n

Typed Entity Search

Typed Entity Search

Web-based

Q/A

Web-based

Q/A

In most cases, what we really want are not pages, but the information units inside.

? ? ? ? ? ? ? ?

2

Specialized Information Specialized Information ExtractorsExtractors

Specialized Information Specialized Information ExtractorsExtractors

Web Information Extraction (WIE)(Marius 2006, Cafarella 2005, Etzioni 2004)

Pattern: “X is CEO

of Y”

Company

CEO

Google Eric Schmidt

IBM S. Palmisano

… …Limitation

•Focus on simple patterns.

•Lack of interactivity.

3

Web-based Question Answering (WQA)(Wu 2007, Lin 2003, Brill 2002)

Who is CEO of Dell?

Who is CEO of Dell?

Keywords:“CEO Dell”

Parse Top-k results

Michael Dell

Limitation

•Only rely on top-k pages to

retrieve the answer.

4

Typed-Entity Search (TES)(Cheng 2007, Cafarella 2007, Chakrabarti 2006)

Amazon PhoneAmazon Phone

……

0.60

0.80

0.90

Ranked Entity List

But … Where is CEO

Limitation

•Limited Number of Data Type

•Lack of Flexibility

5

? ?? ?? ?? ?Data-oriented Content Data-oriented Content Query SystemQuery System

Data-oriented Content Data-oriented Content Query SystemQuery System

Web Info Extractio

n

Web Info Extractio

n

Typed Entity Search

Typed Entity Search

Web-based QA

Web-based QA

Requirements1. Extensible Data Types2. Flexible Contextual

Patterns3. Customizable Scoring 6

Input: CQL (Content Query Language)

Output

Entity Searc

h

Entity Searc

h

Web QA

Web QA

Data-oriented Content Data-oriented Content Query SystemQuery System

Data-oriented Content Data-oriented Content Query SystemQuery System

7

8

http://wisdm.cs.uiuc.edu/demos/docqs

What we need Relational Model

Person Organization

LocationNumber

Number

Person

Organization

Location

9

What we need Relational Model

Find the population of China

WHERE pattern(…)

GROUP BY #number

ORDER BY conf()

FROM #number

10

China has a population of 1.3 billion

China with its population of 1.3 billion people

China is established in 1949.

Shanghai is the largest city with 15 million inhabitants in China

1.3 billion 15 million

1. 1.3 billion

2. 15 million

1. 1.3 billion

2. 15 million

What we need Relational Model

Number

Location

Person

PopulationPhonePrice

CapitalHeadquarter

ProfessorCEOPresident

Table

View

Number

population

price

phone

11

Index LayerIndex Layer

Parsing Layer

Parsing Layer

Index Selection Module

Index Selection Module

Execution Tree

Execution Tree

INPUTSELECT …FROM …WHERE …

OUTPUT

Index DesignSpecial Inverted

Index•Contextual Index•Join Index

Index DesignSpecial Inverted

Index•Contextual Index•Join Index

Query OptimizationGraph Coverage

Problem

Query OptimizationGraph Coverage

Problem

Data Type

Repository

Data Type

Repository

Data Type Definition

12

Experimental Result

• Speed improvement: 6-10

times•Space overhead: Around 2

times original corpus size.

Experimental Result

• Speed improvement: 6-10

times•Space overhead: Around 2

times original corpus size.

Data-oriented Content Query Data-oriented Content Query SystemSystem

Data-oriented Content Query Data-oriented Content Query SystemSystem

Web Info ExtractionWeb Info

Extraction

Typed Entity Search

Typed Entity Search

Web-based Q/A

Web-based Q/A

13

14