56
From queries to answers in the Web Roi Blanco, Sr. Research Scientist Yahoo Labs

From Queries to Answers in the Web

Embed Size (px)

Citation preview

Page 1: From Queries to Answers in the Web

From quer ies to answers in the Web

R o i B l a n c o , S r . R e s e a r c h S c i e n t i s t

Y a h o o L a b s

Page 2: From Queries to Answers in the Web

Web Search today

Page 3: From Queries to Answers in the Web

Web Search in 2001

Page 4: From Queries to Answers in the Web
Page 5: From Queries to Answers in the Web

5

Search now

answers queries

Page 6: From Queries to Answers in the Web

6

Answers arrive even

before finishing

the query!

Page 7: From Queries to Answers in the Web

Mobile shift

7

Desktop Tablet Mobile

Av Words 2.73 2.88 3.05

Av Chars 17.44 18.02 18.93

Song, Ma, Wang, Wang,

Exploring and exploiting

user search behavior on mobile

and tablet devices to improve

search relevance

WWW 2013

Mobile categories are less skewed (Image 42%, Adult 23.5%, Navigational 15%) vs

Desktop (37% Navigational, 19.9% Image, 7.7% commerce)

There’s also a difference between top-level domains:

Mobile Desktop

youtube.com facebook.com

wikipedia.org yahoo.com

answers.yahoo.com wikipedia.org

ehow.com youtube.com

imdb.com walmart.com

Page 8: From Queries to Answers in the Web

Q&A in search engines?

8

Fagni, Perego, Silvestri, Orlando. Caching and prefetching query results by exploiting historical usage data. TOIS 2006

Page 9: From Queries to Answers in the Web

The web search perspective

Web search today is really fast, without necessarily being

intelligent› A search engine without any understanding

Trends› Convergence of search and online media

• End of the 10 blue links

› Personal, social search

• Search over my world

• Search using my profile

› New interfaces

• Contextual, interactive

› Search that anticipates

› Solve tasks not queries

Page 10: From Queries to Answers in the Web

Search is really fast, without necessarily being intelligent

Could Watson

explain why the

answer is

Toronto?

Page 11: From Queries to Answers in the Web

We came to bury the 10 blue links

8/31/201511

Meaningless query

Page 12: From Queries to Answers in the Web

We came to bury the 10 blue links

Meaningfulquery

Page 13: From Queries to Answers in the Web

13

Facebook is asearch engine

Page 14: From Queries to Answers in the Web

Personalized search Yahoo news feed is a personalizedsearch engine

Page 15: From Queries to Answers in the Web

Search that anticipates

15

Google Now

Star Trek computer

• Jason Douglas: Structured Data at Google, SemTechBiz SF 2013

Page 16: From Queries to Answers in the Web

Interactive Voice Search

Apple’s Siri

› Question-Answering

• Variety of backend sources

including Wolfram Alpha and

various Yahoo! services

› Task completion

• E.g. schedule an event

Google Now

Facebook’s M

Page 17: From Queries to Answers in the Web

17

Facebook’s M

mobile assistant

Page 18: From Queries to Answers in the Web

Semantic Search

Page 19: From Queries to Answers in the Web

Web search by 2009

19

Large classes of queries are solved to perfection

Improvements in web search are harder and harder to come by

› Relevance models, hyperlink structure and interaction data

› Combination of features using machine learning

› Heavy investment in computational power

• real-time indexing, instant search, datacenters and edge services

Search ranking features

› Text matching (including anchor text)

› Page authority (Pagerank)

› User behavior signals

› Other features: context, history (still not very well understood)

Page 20: From Queries to Answers in the Web

Language issues

› Multiple interpretations

• jaguar

• paris hilton

› Secondary meaning

• george bush (and I mean the beer brewer

in Arizona)

› Subjectivity

• reliable digital camera

• paris hilton sexy

› Imprecise or overly precise searches

• jim hendler

Complex needs

› Missing information

• brad pitt zombie

• florida man with 115 guns

• 35 year old computer scientist living in barcelona

› Category queries

• countries in africa

• barcelona nightlife

› Transactional or computational queries

• 120 dollars in euros

• digital camera under 300 dollars

• world temperature in 2020

Poorly solved information needs remain

Many of these queries would

not be asked by users, who

learned over time what search

technology can and can not

do.

Page 21: From Queries to Answers in the Web

Semantic Search: a definition

Semantic search is a retrieval paradigm where› User intent and resources are represented using semantic models

• Not just symbolic representations

› Semantic models are exploited in the matching and ranking of resources

Often a hybrid of document and data retrieval› Documents with metadata

• Metadata may be embedded inside the document

• I’m looking for documents that mention countries in Africa.

› Data retrieval

• Structured data, but searchable text fields

• I’m looking for directors, who have directed movies where the synopsis mentions dinosaurs.

Wide range of semantic search systems› Employ different semantic models, possibly at different steps of the search process and in order to support different

tasks

Page 22: From Queries to Answers in the Web

Semantic Search – a process view

Query Constructi

on

•Keywords

•Forms

•NL

•Formal language

Query Processin

g

• IR-style matching & ranking

•DB-style precise matching

•KB-style matching & inferences

ResultPresentation

•Query visualization

•Document and data presentation

•Summarization

Query Refinement

• Implicit feedback

•Explicit feedback

• Incentives

Document Representation

Knowledge Representation

Semantic ModelsResources

Documents

Page 23: From Queries to Answers in the Web

Yahoo’s Knowledge Graph

Chicago Cubs

Chicago

Barack Obama

Carlos Zambrano

10% off ticketsfor

plays for

plays in

lives in

Brad Pitt

Angelina Jolie

Steven Soderbergh

George Clooney

Ocean’s Twelve

partner

directs

casts in

E/R

casts

in

takes place in

Fight Club

casts in

Dust Brotherscasts

in

music by

Nicolas Torzec: Making knowledge reusable at Yahoo!:

a Look at the Yahoo! Knowledge Base (SemTech 2013)

Page 24: From Queries to Answers in the Web

The role of Information Extraction in Semantic Search

Making sense of

› Content

• Web, News, Twitter, email, etc.

› User behavior

• Not just queries, also interaction

› NER, NEC, NEL, Time expressions, topic, event and relation extraction

Mapping to an abstract representation

› Linguistic models

• Taxonomies, thesauri, dictionaries of entity names

• Natural language structures extracted from text, e.g. using dependency parsing

• Inference along linguistic relations, e.g. broader/narrower terms, textual entailment

› Conceptual models

• Ontologies capture entities in the world and their relationships

• Words and phrases in text or records in a database are identified as representations of ontological elements

• Inference along ontological relations, e.g. logical entailment

Page 25: From Queries to Answers in the Web

Linguistic Representations of Text

25

Pablo Picasso was born in Málaga, Spain.

Pablo

Picassowas

born

Málaga Spain

÷£¿≠¥ ÷ŝc£ËËð №£Ë ¿¥r© ŝ© X£≠£g£, Ë÷£ŝ©.

№£Ë¿¥r© ÷ŝc£ËËð

÷£¿≠¥X£≠£g£ Ë÷£ŝ©

IR

Text

Part-of-Speech

tagging

Dependency

parsing

÷£¿≠¥ ÷ŝc£ËËð №£Ë ¿¥r© ŝ© X£≠£g£, Ë÷£ŝ©.

VBDNNP VBN NNP NNPIN

Word-sense

disambiguation

born S#2: (v) give birth, deliver, bear, birth, have (cause to be born) "My wife had twins yesterday!"

Root

Page 26: From Queries to Answers in the Web

born-in

Conceptual Representations of Text

26

Pablo Picasso was born in Málaga, Spain.

Pablo

Picassowas

born

Málaga Spain

÷£¿≠¥ ÷ŝc£ËËð №£Ë ¿¥r© ŝ© X£≠£g£, Ë÷£ŝ©.

№£Ë¿¥r© ÷ŝc£ËËð

÷£¿≠¥X£≠£g£ Ë÷£ŝ©

÷£¿≠¥ ÷ŝc£ËËð №£Ë ¿¥r© ŝ© X£≠£g£, Ë÷£ŝ©.

LOC LOCPER

÷£¿≠¥ ÷ŝc£ËËð №£Ë ¿¥r© ŝ© X£≠£g£, Ë÷£ŝ©.

IR

Text

NER

Mapping to

ontology

(NED)

city-in

Page 27: From Queries to Answers in the Web

Document processing

Goal

› Provide a higher level representation of information in some conceptual space

› Conceptual space is different for Semantic Web and NLP based search engines

Limited document understanding in traditional search

› Page structure such as fields, templates

› Understanding of anchors, other HTML elements

› Limited NLP

In Semantic Search, more advanced text processing and/or reliance on

explicit metadata

› Information sources are not only text but also databases and web services

Page 28: From Queries to Answers in the Web

Example: microformats and RDFa

<div class="vcard">

<a class="email fn" href="mailto:[email protected]">Joe Friday</a>

<div class="tel">+1-919-555-7878</div>

<div class="title">Area Administrator, Assistant</div>

</div>

<p typeof="contact:Info" about="http://example.org/staff/jo">

<span property="contact:fn">Jo Smith</span>.

<span property="contact:title">Web hacker</span> at

<a rel="contact:org" href="http://example.org"> Example.org </a>.

You can contact me <a rel="contact:email"

href="mailto:[email protected]">

via email </a>.

</p> ...

Microformat (hCard)

RDFa

Page 29: From Queries to Answers in the Web

schema.org

Agreement on a shared set of schemas for common types of web content

› Bing, Google, and Yahoo! as initial founders (June, 2011)

• Yandex joins schema.org in Nov, 2011

› Similar in intent to sitemaps.org

• Use a single format to communicate the same information to all three search engines

schema.org covers areas of interest to all search engines

› Business listings (local), creative works (video), recipes, reviews and more

› Microdata, RDFa, JSON-LD syntax

Collaborative effort

› Growing number of 3rd party contributions

› schema.org discussions at [email protected]

Page 30: From Queries to Answers in the Web

Summary

30

If we want to…› Answer queries, not just show links

› Personalize search

› Take context into account

› Anticipate user needs

… we need to understand users, content and the world at large!

Search engine have changed considerably› Queries have changed

• Users seek for more info

• Vertical search (travel, local, images, videos, news, etc.)

• Will move towards a more task-oriented scenario (mobile context shift)

Semantics help tail queries› Head queries solved mostly by clickthrough data

Page 31: From Queries to Answers in the Web

Semantic Search at Yahoo

31

Page 32: From Queries to Answers in the Web
Page 33: From Queries to Answers in the Web
Page 34: From Queries to Answers in the Web
Page 35: From Queries to Answers in the Web

Search over graph data

Unstructured or hybrid search over RDF/graph data› Supporting end-users

• Users who can not express their need in SPARQL

› Dealing with large-scale data

• Giving up query expressivity for scale

› Dealing with heterogeneity

• Users who are unaware of the schema of the data

• No single schema to the data

– Example: 2.6m classes and 33k properties in Billion Triples 2009

Entity search› Queries where the user is looking for a single entity named or described in the query

› e.g. kaz vaporizer, hospice of cincinnati, mst3000

Elbassuoni, Blanco. Keyword Search over RDF graphs. CIKM 2011

Blanco, Mika, Vigna. Effective and Efficient entity search in RDF data. ISWC 2011

Page 36: From Queries to Answers in the Web

Entity-seeking queries make up 40-50% of the query volume› Jeffrey Pound, Peter Mika, Hugo

Zaragoza: Ad-hoc object retrieval in the web of data. WWW 2010

› Thomas Lin, Patrick Pantel, Michael Gamon, Anitha Kannan, Ariel Fuxman: Active objects: actions for entity-centric search. WWW 2012

› Show a summary of the most likely information-needs

› Including related entities for navigation

› Roi Blanco, Berkant Barla Cambazoglu, Peter Mika, Nicolas Torzec: Entity Recommendations in Web Search. ISWC 2013

Application:

entity displays in web search

Page 37: From Queries to Answers in the Web

Semantic understanding of queries

38

Entities play an important role

› [Pound et al, WWW 2010], [Lin et al WWW 2012]

› ~70% of queries contain a named entity (entity mention queries)

• brad pitt height

› ~50% of queries have an entity focus (entity seeking queries)

• brad pitt attacked by fans

› ~10% of queries are looking for a class of entities

• brad pitt movies

Entity mention query = <entity> {+ <intent>}

› Intent is typically an additional word or phrase to

• Disambiguate, most often by type e.g. brad pitt actor

• Specify action or aspect e.g. brad pitt net worth, toy story trailer

Page 38: From Queries to Answers in the Web

oakland as bradd pitt movie moneyball movies.yahoo.com oakland as wikipedia.org

captain america movies.yahoo.com moneyball trailer movies.yahoo.com

money moneyball movies.yahoo.com

moneyball movies.yahoo.com movies.yahoo.com en.wikipedia.org movies.yahoo.com peter brand

peter brand oakland nymag.com moneyball the movie www.imdb.com

moneyball trailer movies.yahoo.com moneyball trailer

brad pitt brad pitt moneyball brad pitt moneyball movie brad pitt moneyball brad pitt moneyball oscarwww.imdb.com

relay for life calvert ocunty www.relayforlife.org trailer for moneyball movies.yahoo.com

moneyball.movie-trailer.com

moneyball en.wikipedia.org movies.yahoo.com map of africa www.africaguide.com

money ball movie www.imdb.com money ball movie trailer moneyball.movie-trailer.com

brad pitt new www.zimbio.com www.usaweekend.com www.ivillage.com www.ivillage.com brad pitt

news news.search.yahoo.com moneyball trailer moneyball trailer www.imdb.com www.imdb.com

Patterns in logs are hard to see

Sample of sessions from June, 2011 containing the term “moneyball”

› What are users trying to do?

Page 39: From Queries to Answers in the Web

oakland as bradd pitt movie moneyball trailer movies.yahoo.com oakland as wikipedia.org

Semantic annotations help to generalize…Sports team

Movie

Actor

Page 40: From Queries to Answers in the Web

… and understand user needs

8/31/201541

moneyball trailer

what the user wants to do with it

Movie

Object of the query

Page 41: From Queries to Answers in the Web

Semantic analysis of query logs

8/31/201542

Multiple approaches

› Dictionary tagging

• Match entities in a fixed dictionary

• Scalable, high recall, not very precise

› Entity retrieval

• Retrieval an index of the knowledge base

› Post-retrieval methods

• Annotate a document corpus with entities

• Retrieve documents and aggregate annotations

Applications

› Usage mining

• L. Hollink, P. Mika and R. Blanco. Web Usage Mining with Semantic Analysis. WWW 2013

› Related-entity recommendations

• R. Blanco, B. Cambazoglu, P. Mika, N. Torzec: Entity Recommendations in Web Search. ISWC 2013

Page 42: From Queries to Answers in the Web

Usage mining

43

Site owners would like to find usage patterns

› Reducing abandonment

› Competitive analysis

Problem: patterns are lost in the data

› 64% of queries are unique within a year

› Even the most frequent patterns have low support

Page 43: From Queries to Answers in the Web

Solving the sparseness problem through annotations

44

Frequent patterns of annotations are more general and less noisy

Page 44: From Queries to Answers in the Web

Match by keywords

› Closer to text retrieval

• Match individual keywords

• Score and aggregate

• https://github.com/yahoo/Glimmer/

Match by aliases

› Closer to entity linking

• Find potential mentions of entities (spots)

in query

• Score candidates for each spot

Two matching approaches

brad

(actor) (boxer) (city)

(actor) (boxer) (lake)

pitt

brad pitt

(actor) (boxer)

Page 45: From Queries to Answers in the Web

… back to query understanding

8/31/201546

moneyball trailer

what the user wants to do with it

Movie

Object of the query

Page 46: From Queries to Answers in the Web

Fast Entity Linking in Queries

47

Use aliases to “entity pages” (Wikipedia, IMDB, local, etc.) as source of

information for entity-query aliases

Chunk the query into the most likely segmentation

Be fast by avoiding entity to entity decisions when scoring

Add context externally using semantic relatedness of keywords and

entities

Compression:

› Minimal perfect hashes + Golomb coding

› All Wiki + 1 year of query logs of aliases + 1 year of query sessions w2v model < 3GB

Blanco, Ottaviano, Meij. Fast and space-efficient entity linking for queries. WSDM 2015

Page 47: From Queries to Answers in the Web

Problem definition

48

Given

› Query q consisting of an ordered list of tokens ti

› Segment s from a segmentation s from all possible segmentations Sq

› Entity e from a set of candidate entities e from the complete set E

Find

› For all possible segmentations and candidate entities

› Select best entity for segment independently of other segments

Page 48: From Queries to Answers in the Web

Keyphraseness

› How likely is a segment to be an entity

mention?

› e.g. how common is “in”(unlinked) vs.

“in” (linked) in the text

Commonness

› How likely that a linked segment refers

to a particular entity?

› e.g. how often does “brad pitt” refers to

Brad Pitt (actor) vs. Brad Pitt (boxer)

49

Intuitions

Assume: also given annotated collections ci with segments of text linked to entities from E.

Page 49: From Queries to Answers in the Web

Ranking function

Probability of the segment generated

by a given collection

Commonness

Keyphraseness

Page 50: From Queries to Answers in the Web

Context-aware extension

51

Estimated by word2vec representation

Probability of segment and

query are independent

of each other given the entity

Probability of segment and

query are independent

of each other

Page 51: From Queries to Answers in the Web

Results: effectiveness

52

Significant improvement over external baselines and internal system

› Measured on public Webscope dataset Yahoo Search Query Log to Entities

Search over Bing, top Wikipedia result

State-of-the-art in literature

A trivial search engine over Wikipedia

Our method: Fast Entity Linker (FEL)

FEL + context

Page 52: From Queries to Answers in the Web

Two orders of magnitude faster

than state-of-the-art

› Simplifying assumptions at scoring time

› Adding context independently

› Dynamic pruning

Small memory footprint

› Compression techniques, e.g. 10x

reduction in word2vec storage

53

Results: efficiency

Page 53: From Queries to Answers in the Web

Wrap-up

54

Page 54: From Queries to Answers in the Web

Mobile search challenges and opportunities

55

Interaction

› Question-answering

› Support for interactive retrieval

› Spoken-language access

› Task completion

Contextualization

› Personalization

› Geo

› Context (work/home/travel)

• Try getaviate.com

Page 55: From Queries to Answers in the Web

Task completion

56

We would like to help our users in task completion

› But we have trained our users to talk in nouns

• Retrieval performance decreases by adding verbs to queries

› We need to understand what the available actions are

Modeling actions

› Understand what actions can be taken on a page

› Help users in mapping their query to potential actions

› Applications in web search, email etc.

THING

THING

Schema.org v1.2

including Actions

published

April 16, 2014

Page 56: From Queries to Answers in the Web

The end

57

Many thanks for the Semantic Search team in London

› Peter Mika,

› Edgar Meij

› Hugues Bouchard

Joint work with many collaborators: Sebastiano Vigna, Laura Hollink,

Giuseppe Ottaviano, Nicolas Torzec, among others.

[email protected]