From Web Search to Semantic Search -...

Preview:

Citation preview

[1]

X i a n g Z h a n g

x . z h a ng @seu . e d u . c n

h t t p : / / c s e . s e u . e d u . c n / Pe r s ona l Pa g e / x . z h a ng /

From Web Search to Semantic Search

[2]

Content 2

The history of Web and Search

Two faces of Semantic Web

Falcons

Watson

Wolfram Alpha

Knowledge Graph

[3]

As We May Think

"A memex is a device in which an

individual stores all his books,

records, and communications, and

which is mechanized so that it may

be consulted with exceeding speed

and flexibility. It is an enlarged

intimate supplement to his memory."

3

[4]

As We May Think

“Our wearable data collection system lets users collect their experiences into a

continually growing and adapting multimedia diary. The system—called

inSense—uses the patterns in sensor readings from a camera, microphone,

and accelerometers to classify the user's activities and automatically collect

multimedia clips when the user is in an "interesting" situation.”

4

[5]

As We May Think

Memex is internally complex in mechenics

dry photography, microphotography…

photocells, thermionic tubes, cathode ray tubes…

Memex is externally simple for human usage

associative indexing

tying two items together and name it

linking enormous items to form a trail and name it

5

[6]

Building Internet in the Cold War 6

[7]

Building Internet in the Cold War

The idea of package switching (early 1960s)

APARNet project launched (1960s) – US Gov

robust / fault-tolerant / distributed

APARNet with 4 nodes(Univ. Nodes) (1969)

The idea of OSI (Open System Interconnection) (1970s)

TCP/IP (1974-1982)

NSFNet (National Science Foundation) (1986)

APARNet closed (1990)

>1 million hosts in Internet (1992)

7

[8]

Building Internet in the Cold War

why Internet becomes a successful creation

divide-and-conquer

package switching

abstraction

OSI

Scalability / reliability > speed / efficiency

TCP / IP

8

[9]

Ted Nelson and His Hypertext

I don’t like Tim Berners-Lee and his

WWW. Too Simple!!!

“HTML is precisely what we were

trying to PREVENT— ever-breaking

links, links going outward only,

quotes you can't follow to their

origins, no version management, no

rights management.

9

[10]

Ted Nelson and His Hypertext

his first job is a photographer and film editor

“Hypertext” is corned by him in 1963

based on files

data was stored once

no deletion

information was accessible by a link from anywhere

10

[11]

Ted Nelson and His Hypertext

other achievement of Ted

Project Xanadu (1960)

found HTTP Company

produced the first bag for the first laptop

now producing 47% bags for laptop on the world

11

[12]

Doug Engelbart and His…(in 1968) 12

[13]

Doug Engelbart

other achievement of Doug (Tuning Award)

mouse (commercially implemented in LISA 1983, anybody

knows LISA??)

Windows

shared-screen teleconferencing

hypermedia

groupware

13

[14]

LISA是史蒂夫乔布斯女儿的

名字,也被乔布斯用于命名图形界面计算机和其上的操作系统。 没有Lisa就没有Macintosh,在Mac的开发

早期,很多系统软件都是在Lisa上设计的。她具有16位CPU,鼠标,硬盘,以及支

持图形用户界面和多任务的操作系统。并且随机捆绑了7个商用软件。LISA电脑售价$9,935。

14

[15]

Windows 1.0

(1985)

Windows 3.1 (1992)

15

[16]

Ted and Doug

they are both really avant-garde

every technology was in place in 1970s

but it took the PC revolution and widespread

internet to inspire the WWW

the process lasted for almost 20 years!

16

[17]

Apple II and 中华学习机

17

[18]

Sir Tim Berners-Lee and His Web Dream

18

[19]

In March 1989, Tim Berners-Lee

submitted a proposal for an

information management system to

his boss, Mike Sendall. ‘Vague, but

exciting’, were the words that

Sendall wrote on the proposal,

allowing Berners-Lee to continue.

http://info.cern.ch/Proposal.html

19

[20]

Tim Berners-Lee's original World Wide Web browser: 1990

20

[21]

Tim and His Web Dream

Why we use HTML for hypertext representation?

HTML vs. XML

21

[22]

Tim and His Web Dream

HTTP merits

quite simple (get or post)

flexible (content-type)

weak coupling (no connection / no state)

22

[23]

TED Talk: Tim Berners-Lee : The Magna Carta for the Web. (2014)

http://www.ted.com/talks/tim_berners_lee_a_magna_carta_for_the_web

23

[24]

CCTV

Documentary

互联网时代

The Internet Age (2014)

https://movie.douban.co

m/subject/25772505/

24

[25]

Semantic Web: A Glance

25 http://ws.nju.edu.cn/falcons

[26]

Architecture of Semantic Web 26

[27]

Why Semantic Web?

Web Intelligence

Machine-readable

Reasoning

Web of Data

Identification and Interlinks

Programmable Web

27

[28]

Ontology and RDF 28

[29]

Instantiation of Ontology 29

[30] 30

[31]

XML Representation of Ontology 31

[32]

The Logic Face of Semantic Web 32

[33]

Description Logic and Reasoning

33

[34]

The Data Face of Semantic Web

34

[35]

Applications of Semantic Web

Biomedicine

Transportation Engineering

Homeland Security

Software Design

Travel Planning

Job Finding

Online Dating…

35

[36]

TED Talk: Tim Berners-Lee : The Next Web. (2009)

http://www.ted.com/talks/tim_berners_lee_on_the_next_web

36

[37]

TED Talk: Tim Berners-Lee : The year open data went worldwide. (2010) http://www.ted.com/talks/tim_berners_lee_the_year_open_data_went_worldwide

37

[38]

Semantic Web in a Nutshell

Similar but different with

Relational Database

Object-oriented System

Shifted from logic to data

To keep it simple

To grow into large scale

38

[39]

Web Search

1993: Web Crawler

1994: Yahoo

1994: Lycos

1995: Altavista

1998: Google

39

[40]

[41]

Book: How Google Works

https://book.douban.com/subject/26008422/

[42]

Google

Larry Page and Sergey Brin from Stanford

From “What Box”to 10100

Refused by InfoSeek

100,000 dollars from Sun

Yahoo, Excite.com, InfoSeek, Lycos

Marissa Mayer

Went to Yahoo in 2012

42

[43]

The two co-founders

of Google working in

the garage with a

monthly rent of $1700

[44]

Google Search Stats

search per second: 2.3M

Unique searchers per month: 1.17B

Number of pages indexed: 60T

Lines of codes for all Google services: 2B

Alphabet market capitalization: $570B

http://expandedramblings.com/index.php/by-the-numbers-a-gigantic-list-

of-google-stats-and-facts/

[45]

More about Google

Pagerank

Avg. 0.25 response time

45

Google-designed Server

Servers in Container

[47]

Google to Alphabet 47

[48]

23andme.com

By Sergey Brin`s wife, Anne Wojcicki

DNA analysis for $99 ($100M in 2000)

Angelina to avoid her breast cancer

An introduction: http://36kr.com/p/5035340.html

48

[49]

Web Search

Challenge of Web Search

Distributed Data

Volatile Data – changed or dead data

Large Volume

Unstructured and Redundant Data

Data Quality

Data Heterogeneous

49

[50]

Semantic Search

What is Semantic Search

Natural Language or Pseudo-natural Language search

Finding answers (data, facts, information, knowledge)

Exploiting domain knowledge to process search requests

50

[51]

Semantic Search

Different with Web Search

RDF Graphs vs. Bag of Words

Ontology, RDF document, Entity(object) vs. Pages

Ranking

Summarization / Snippet / Recommendation

Similar with Web Search

Keyword-based

51

[52]

Semantic Search 52

Tim

Jack

Mary

Southeast University

Nanjing

hasWife

worksAt

worksAt knows

located

Tim.foaf.xml

What can we get when searching “tim”

[53]

More examples: Who is Bob Marley? From Falcons

[54]

More examples: Who is Bob Marley? From Falcons

[55]

More examples: Who is Bob Marley? From Google and Baidu

[56]

Bob Marley:

No Woman, No Cry

Bob Marley:

Three Little Birds

[57]

More examples: Who is the tallest player in NBA ? From Falcons

[58]

More examples: Who is the tallest player in NBA ? From Google and Baidu

[59] 59

[60] 60

[61] 61

More examples: Who is chris@bizer.de? He knows who? From Falcons

[62] 62

More examples: Who knows Chris Bizer? From Falcons

[63]

Semantic Search

2003: TAP (Guha, McCool, & Miller, 2003)

Earliest semantic web search engine

Keyword query

Return an object and its surrounding subgraph using

label information

Selection is based on popularity, user profile, and search

context

63

[64]

Semantic Search

2005: Swoogle (Ding et al., 2005)

One of the most popular semantic web search engine

Class/property search and ontology search

Pagerank-like ranking algorithm

Provides statistical metadata of results

64

[65] 65

[66] 66

Ontology Search of Swoogle

[67] 67

Document Search of Swoogle

[68]

Term Search of Swoogle

[69] 69

Metadata of foaf:knows in Swoogle

[70]

Semantic Search

2007: SWSE (Harth et al., 2007)

Keyword-based

Pagerank-liked ranking

Combining RDF graph ranks with data source ranks

Filtering results by specifying a class

70

[71]

Semantic Search

2007: Watson (d’Aquin et al., 2007)

Organize resulting objects by documents / ontologies

User can specify the searching scope

Local name

Labels

All literals

71

[72]

Searching Steve Jobs in Watson

[73]

Another Watson

An open-domain QA system

IBM DeepQA Project

MIT: Adaptive View-Based Appearance Model

UT: Reasoning and General Knowledge

USC: Information Extraction and Analysis

RPI: Virtualization Tools

University at Albany: Assurance of capacity for massive system

University of Trento: Self-study, Conversation

University of Massachusetts: Information Retrieval

Carnegie Mellon: QA basic algorithms

73

[74]

How Watson Works

[75]

Another Watson

Feature:

Unstructured text;

5 secs into more than 200 million pages;

Understanding human language, subtle meaning,

paddles, humor, satire…

Determing the certainty of the answer

No help from the WWW and engineers

Beat ken and Brad on Jeopardy in 2011;

75

[76]

Jeopardy 2011

[77]

Ken’s Talk in TEDTalk

[78]

Another Watson

Future

Business Decision-Making

Medical Treatment

78

[79]

Semantic Search

2008: Sindice (Oren et al., 2008)

Property-value pair look-up knowing a property of an

object

Keyword-based RDF document and Microformat search

Not focused on linked objects

79

[80]

Searching documents in Sindice

[81]

Searching documents

using property-value pair

in Sindice

[82]

82

SPARQL Query in Sindice

[83]

Semantic Search

2009: Falcons (Cheng et al., 2009 )

http://iws.seu.edu.cn/services/falcons/

Searching entities / linked objects / RDF documents

Keyword-based

Searching based on Virtual Document

Ranking results based on relevance and popularity

Class-based query refinement

83

[84]

Falcons – Scenario 1 84

“Who is the demo chair of ISWC 2008?”

query with “ISWC2008” AND “demo chair”

[85]

Falcons – Scenario 2 85

“Find relations between

Chris Bizer and Tom Heath”

query with “Chris Bizer”

AND “Tom Heath””

[86]

Falcons – Scenario 2

86

After type

refinement

[87]

Falcons – System Architecture

87

[88]

Falcons- Data Crawling

Seeding

Searching filetype:rdf / filetype:owl in Google with keywords

from Open Directory Project

Sampling from pingthesemanticweb.com / schemaweb.info / …

Sampling from Linked Open Data Project

Crawling

300,000 RDF documents per day

20M RDF documents total (55% well-formed) on August 2008

600M quadruples

88

[89]

Falcons – Data Crawling 89

The distribution of

the number of RDF

documents on pay-

level domains

[90]

Falcons – Constructing Virtual Documents

Virtual Documents of an object

local name

rdfs:label

rdfs:comment

other annotations

Virtual documents of neighbors

90

[91]

Falcons – Virtual Documents 91

VD(eg1:Journal) = ”journal” + ”article” + ”magazine”

[92]

Falcons – Ranking Objects

Query Relevance

Cosine similarity between query and virtual document

Popularity

The occurrence of the object in the document set

92

[93]

Falcons – Other Features

Query-relevant Snippet Generation

PD-Thread

Query Refinement with Class Hierarchies

Filtering the resulting objects with class Restrictions

Discovering implicit typing information by class-inclusion

reasoning

Recommending subclasses for incremental refinement

93

[94]

Wolfram Alpha

www.wolframalpha.com

Computational Knowledge Engine

2009.5.18

Wolfram Alpha: Semantic Search Is

Born – InfoToday

The Knowledge engine behind Siri

94

[95]

Wolfram Alpha

Chat with Siri:

Somebody:我的孩子总喜欢缠着我问我离圣诞节还有几天。

Siri: 让我查查……稍等……我为你找到了答案。

Siri:82 天,也就是 2 个月又 21 天,也就是 11 周零 5 天,

也就是 58 个工作日,也就是 0.22 年。

How Siri knows all these?

95

[96]

[97]

[98]

[99]

[10

0]

[10

1]

[10

2]

[10

3]

[10

4]

[10

5]

[10

6]

106

[10

7]

[10

8]

108

[10

9]

[110]

Wolfram Alpha

WA is computation + knowledge

Holding a vast of knowledge-bases

Using computational linguistics

Steven Wolfram was a quantum physicist!

First version of WA

15,000,000 lines of Mathmatica codes

10,000 CPU

The Technology behind Wolfram|Alpha

110

[111

]

Stephen Wolfram:计算万物的理论

[112]

Knowledge Graph 知识图谱

“The Knowledge Graph is a knowledge

base used by Google to enhance its

search engine's search results with

semantic-search information gathered

from a wide variety of sources.”

– From Wikipedia

112

[11

3]

113

[11

4]

[11

5]

Introducing to Knowledge Graph

[116]

Satori

Adopted for Bing.com by Microsoft

From 2013.12.12 TED讲座:当用户搜索某一个人名,如果这个人曾在TED做过演讲,那么用户将可以直接在搜索结果的

Snapshot pane中播放TED视频。

著名演讲及国歌:不仅仅是TED,其他一些著名人物的演讲也将会直接呈现在搜索结果中,另外也包括

国歌这样的内容,用户同意可以直接在Snapshot pane点击播放。

在线课程:搜索一所大学名称,Bing会在搜索结果中直接展现该学校的热门的在线课程。

科学知识:当用户搜索某个科学名词时,Bing将会突出展示来自维基百科的该词条基本内容。

历史事件:Bing会直接提供历史事件的概要预览,包括简单描述、开始时间、结束时间等。

相关人员:相关人员功能将会在用户搜索某一事件或事物时提供,并会给出此人与用户搜索词语相关的

原因。

动物种类:当用户搜索某个动物时,Bing将会提供更详细的相关品种让用户能够进一步进行查询。比如

搜“老虎”,Bing将会在Snapshot pane展现孟加拉虎、东北虎等种类,用户可以进一步进行搜索

116

[117]

Satori 117

[118]

Graph DB

The fundamental component of most KG

Graph Storage

CRUD

Create a graph

Read a graph

Update a graph

Delete a graph

118

[119]

Graph DB

Recommendation:

Neo4j

Virtuoso

119

[122]

Wikipedia 122

[123]

DBpedia

Extracted from Wikipedia

RDF / SPARQL

125 languages, totally >3 billion triples, all freely downloadable

For English version, 4.58M entities:

1,445,000 Persons

735,000 Places

123,000 Music Albums

87,000 Movies

241,000 Species

6,000 Diseases

123

[124]

YAGO

Wikipedia+WordNet+GeoNames

10M entities + 120M facts

Features

Manually evaluated the accuracy of

95%, every relation is annotated with its

confidence value

Combining the taxonomy of wiki and

WordNet

With temporal and spacial dimension

124

[125]

Freebase

Knowledge Graph behind Google

A Structured Wikipedia

Crowdsourcing

125

[126]

WordNet and BabelNet

126

[127]

WordNet and BabelNet 127

BabelNet = Wikipedia + WordNet

[12

8]

[129]

Thanks http://cse.seu.edu.cn/PersonalPage/x.zhang/teaching.html

129

Recommended