Web Behavior Analysis. Your Last Words? (in 22 nd century) To family To your best friend?

Preview:

Citation preview

Web Behavior Analysis

Your Last Words? (in 22nd century)

• To family• To your best friend?

Web Behavior Analysis

• Why important?• Why scary?

Part I: Why Important?

Q. In the past six months have you used a search engine to help inform your decisions for the following tasks?

66%of people are using search

more frequently to make

decisions

• We rely more and more on search for our real-life decision– Opportunities for

business– Concerns for

privacy

Length of Sessions by Type

What should be done?

• Focus on new territory

Taxonomy of Web queries

• Navigational (we are good at this)– to reach a particular site

• E.g., Searching for top page of company

• Informational– to acquire pages that provide

knowledge for user’s information need• Conventional ad hoc retrieval

• Transactional– to perform a Web-mediated activity

• E.g., online shopping

Navigational Queries Pseudo- Navigational Queries

Example: Good and Bad

• Car GPS around $300• Four day trip to Bhutan from Delhi to

visit important Buddhist places

Example of “Hard Queries”:Informational/Transactional

Game Consol

es

Party Site

What we want?

Current research directions

• How to classify queries?• Then what?

– Search engines trying to reduce clicks for “hard queries”

– Extracting info from forum

Importance of query classification: “obama”

• Informational: People may search to know more about Barak Obama

• Navigational: visit his official website • Transactional: perhaps the user goal

is to donate money online to support Mr. Obama’s campaign

Yahoo numbers

• ~25 informational content text?• ~40 navigational anchor text?• ~35 transactional site template?

Can you tell if query is “navigational” or not?

Lee et al.[WWW05]: Overview

• Analyzing how query term is used in anchor texts

WWW2008 WWW2008search search

Top page ofWWW2008

Description in Wikipedia

Search engine

Destinations are identical → Navigational

Destinations are diverse → Informational

Q = “search” Q = “WWW2008”

Anchor-link distribution (ALD)

Probability that page linked by t is d

Top page of WWW2008

t = WWW2008

ALD is skewed

)|( tdP

Google Yahoo!Wikipedia

t = search

ALD is uniform

NavigationalInformational

)|( tdP

Lee et al.: Problem• Targeting only anchor texts that are

exactly same as the query– If the same anchor text as the query

does not exist on the Web, ALD cannot be computed

• Problematic queries– Long phrase

• E.g., “information retrieval system research”

– Multiple keywords• E.g., “trec, nist, test collection”

Multi-query solutionQuery Q = “trec, test collection”

t = trec t = test t = collection)|( tdP

Terms T = {trec, test, collection}

destinations D = {d1, d2, …}

Compute ALD on a term-by-term basis and integrate them

Computation of classification score

• Entropy of D

Tt Dd

tdPtdPtPTDH )|(log)|()()|(

Entropy of a single term tWeighted average

Now what?

• For “WWII”– Google: http://www.google.com/search?q=WWII&hl=

en&tbo=1&output=search&tbs=ww:1 – Microsoft: http://

www.bing.com/reference/semhtml/World_War_II?fwd=1&qpvt=wwii&src=abop&q=wwii

– Wolfram: http://www.wolframalpha.com/input/?i=wwII • Can you tell information vs. transactional?

Challenges/Opportunities

• Slightly subtle/interleaved• But huge advertisement revenue (yet to be

explored)!!!!• Classic querylog+Clicks on surface web no

t enough..• Any ideas?

More signals?

• Eye movement? • Brain signal?

More corpus? (social corpus for polls? expert advice?)

More signal

CS: Client Simple

• First representation:– Trajectory length– Horizontal range– Vertical range

Horizontal range

Vertical range

Trajectory length

CF: Client Full

• Second representation: – 5 segments:

initial, early, middle, late, and end

– Each segment: speed, acceleration, rotation, slope, etc.

1

2

3

4

5

Navigational query: “facebook”

Informational query: “spanish wine”

Transactional query: “integrator”

More corpus

• cQA successful, as “additional corpus”, not as “additional means”

• Challenges?

cQA (Yahoo Answers)

How Yahoo Answers works

Good questions draw good answers

Good Q/A? -- Clicks

Good Q/As? -- Community

Why scary?

Useful beyond imagination

• Spell checker: SIGMOD Did you mean “sigmoid”?

• Entity relation: SIGMOD ~ SIGIR• Translation: SIGMOD, 씨그모드 sigmo

d.com• Query suggestion: 영일대 호텔 영일대• Rank learning: top 10 entry is visited all th

e time, what should we do?• Reason of migrain?

Companies need YOUR HELP

• AOL released logs• Guess what happened?

More scientific observations (Yahoo Research)

• X={query1, query2, query3}• Y= age

gender area

XY (how likely?) Validate with ground-truth info (Yahoo

account)

See if you can do it?

• You observe yourself:

http://aolpsycho.com/user/5826-kallemeyn

Gender

• Female: fanfiction, bridal, makeup, women’s, knitting, hair, ecards, glitter, yoga, and diet

• Male: nfl, poker, espn, ufc, railroad, prostate, footb

all, golf, male, wrestling, compusa, as well as a variety of adult terms

Accuracy: 80+%

Age

• YOUNG: myspace, pregnancy, wikipedia, lyrics, quotes, apartments, torrent, baby, wedding, mall, soundtrack;

• OLD: aarp, telephone, lottery, amazon.com, retirement, funeral, senior, mapquest, medicare, newspapers, repair.

Place

• A user’s zip code (US postal code) or other identifier of location may be detectable from place names used in

• Check out YahooGEO Apis

Name?

• 50+% issued their name• (but other names too)

Ref: "Vanity Fair: Privacy in Querylog Bundles"

User Solutions?

TrackMeNot (TMN) Their tool is an extensio

n to the Firefox web browser, and initiates randomized search queries in the background to a number of commercial search engines.

• Tor: change IP/cookie (prevents aggregation)

- Losing services e.g., personalization

Company Solution

• K-anonymity (bundling)Reported to be unsafe for (vanity

search + geo-query, long-tail keywords)

[so far, it is considered to be TOO RISKY]

Summary

• You are leaving trails in the cyber world, which aligns more and more with real-life trails

• Companies are interested in predicting as much as possible of your next behaviors

• More signals? More corpus?• Can you hide as much to protect privacy,

while reveal as much to enable such prediction? (privacy dilemma)

• But it is ok even if we can’t know (product state-of-the-art)

Search UI? Visualization?

What are query aspects?

Challenge

• Intentions are hidden– omission of key information makes intent in q

ueries ambiguous– eg: omitting “reviews” when searching for revie

ws of “Canon EOS 40D SLR”– eg: omitting “location/city” when searching for

“jobs”• Queries are often too broad

Goal

• Mine broad latent aspects from search logs– Formulate the problem based on a real-world m

odel of user interaction with search engine (session = 10 mins)

– Bring interesting aspects to the attention of editors who can then determine saliency and usefulness

User reformulates query by adding

qualifier “reviews”

User reformulates query by selecting “reviews” aspect

User interaction modelUser enters

original query “Canon EOS 40D”

Search engine (SE) returns general

results

SE returns reviews of the camera

User’s query is satisfied. eg: she clicks

on a result.

Search engine (SE) returns general results + query

aspects

Learning of query aspectsfrom reformulations

Results: Examples of aspects found

New directions might be

• Taking target web page clicked into account while constructing aspects

• Or visualization techniques helping to visually/perceptually/cognitively mine such “aspects”– Visualization/refinement iterations to narrow down

Tomorrow 4:15pm (B2 102)Title:

Using Information Visualization to Understand Data Abstract:

Information Visualization is the art and science of representing abstract information in a visual form that enables users to understand data through their perceptual and cognitive capabilities.

Dr. Bongshin Lee (Microsoft Research)