(A taste of) Data Management Over the Web

(A taste of)Data Management Over the Web

Web R&D• The web has revolutionized our world– Relevant research areas include databases, networks,

security… – Data structures and architecture, complexity, image

processing, security, natural language processing, user interfaces design..

• Lots of research in each of these directions– Specialized conferences for web research– Lots of companies

• This course will focus on Web Data

Web Data• The web has revolutionized our world

• Data is everywhere– Web pages, images, movies, social data, likes and

dislikes…• Constitutes a great potential

• But also a lot of challenges– Web data is huge, not structured, dirty..

• Just the ingredients of a fun research topic!

Ingredients

• Representation & Storage– Standards (HTML, HTTP), compact

representations, security…• Search and Retrieval– Crawling, inferring information from text…

• Ranking– What's important and what's not– Google PageRank, Top-K algorithms,

recommendations…

Challenges• Huge– Over 14 Billions of pages indexed in Google

• Unstructured– But we do have some structure, such as html links,

friendships in social networks..

• Dirty– A lot of the data is incorrect, inconsistent,

contradicting, just irrelevant..

Course Goal

• Introducing a selection of fun topics in web data management

• Allowing you to understand some state-of-the-art notions, algorithms, and techniques

• As well as the main challenges and how we approach them

Course outline Ranking: HITS and PageRank• Data representation: XML; HTML• Crawling• Information Retrieval and Extraction, Wikipedia example• Aggregating ranks and Top-K algorithms• Recommendations, Collaborative Filtering for recommending

movies in NetFlix• Other topics (time permitting): Deep Web, Advertisements…• The course is partly based on: Web Data Management and Distribution, Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine

Rousset, Pierre Senellart And on a course by Pierre Senellart (and others) in telecom Paris-tech

Course requirement

• A small final project

• Will involve understanding of 2 or 3 of the subjects studied and some implementation

• Will be given next monday

Ranking

Why Ranking?

• Huge number of pages

• Huge even if we filter according to relevance– Keep only pages that include the keywords

• A lot of the pages are not informative– And anyway it is impossible for users to go

through 10K results

How to rank?

• Observation: links are very informative!• Instead of a collection of Web Pages, we have

a Web Graph!!• This is important for discovering new sites

(see crawling), but also for estimating the importance of a site

• CNN.com has more links to it than my homepage…

Authority and Hubness

• Authority: a site is very authoritative if it receives many citations. Citation from important sites weight more than citations from less-important sitesA(v) = The authority of v

• Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites

H(v) = The hubness of v

HITS (Kleinberg ’99)

• Recursive dependency: a(v) = Σin h(u) h(v) = Σout a(u)

Normalize according to sum of authorities \ hubness values

• We can show that a(v) and h(v) converge

Random Surfer Model

• Consider a "random surfer"• At each point chooses a link and clicks on it

P(W) = P(W1)* (1/O(W1))+…+ P(Wn)* (1/O(Wn))

Where Wi…Wn link to W, O(Wi) is the number of out-edges of Wi

Recursive definition

• PageRank reflects the probability of being in a web-page (PR(w) = P(w))

Then: PR(W) = PR(W1)* (1/O(W1))+…+ PR(Wn)* (1/O(Wn)) • How to solve?

EigenVector!

• PR (row vector) is the left eigenvector of the stochastic transition matrix– I.e. the adjacency matrix normalized to have the

sum of every column to be 1

• The Perron-Frobinius theorem ensures that such a vector exists

• Unique if the matrix is irreducible– Can be guaranteed by small perturbations

Problems

• A random surfer may get stuck in one component of the graph

• May get stuck in loops

• “Rank Sink” Problem– Many Web pages have no outlinks

Damping Factor

• Add some probability d for "jumping" to a random page

• Now P(W) = (1-d) * [P(W1)* (1/O(W1))+…+ P(Wn)* (1/O(Wn))] + d* 1/N Where N is the number of pages in the index

How to compute PR?

• Analytical methods– Can we solve the equations?– In principle yes, but the matrix is huge!– Not a realistic solution for web scale

• Approximations

A random surfer algorithm

• Start from an arbitrary page• Toss a coin to decide if you want to follow a

link or to randomly choose a new page• Then toss another coin to decide which link to

follow \ which page to go to• Keep record of the frequency of the web-

pages visited• The frequency for each page converges to its

PageRank

Power method

• Start with some arbitrary rank row vector R0

• Compute Ri = Ri-1* A

• If we happen to get to the eigenvector we will stay there

• Theorem: The process converges to the eigenvector!

• Convergence is in practice pretty fast (~100 iterations)

Other issues

• Accelerating Computation

• Distributed PageRank

• Mixed Model (Incorporating "static" importance)

• Personalized PageRank

XML

HTML (HyperText Markup Language)

• Used for presentation

• Standardized by W3C (1999)

• Described the structure and content of a (web) document

• HTML is an open format– Can be processed by a variety of tools

HTTP• Application protocol

Client request: GET /MarkUp/ HTTP/1.1

Host: www.google.comServer response:HTTP/1.1 200 OK

• Two main HTTP methods: GET and POST

GET

URL: http://www.google.com/search?q=BGU

Corresponding HTTP GET request:GET /search?q=BGU HTTP/1.1Host: www.google.com

POST

• Used for submitting forms

POST /php/test.php HTTP/1.1Host: www.bgu.ac.ilContent-Type: application/x-www-

formurlencodedContent-Length: 100…

Status codes

• HTTP response always starts with a status code followed by a human-readable message (e.g., 200 OK)

• First digit indicates the class of the response:1 Information2 Success3 Redirection4 Client-side error5 Server-side error

Authentication

• HTTPS is a variant of HTTP that includes encryption, cryptographic authentication, session tracking, etc.

• It can be used instead to transmit sensitive data

GET ... HTTP/1.1Authorization: Basic dG90bzp0aXRp

Cookies

• Key/value pairs, that a server asks a client to store and retransmit with each HTTP request (for a given domain name).

• Can be used to keep information on users between visits

• Often what is stored is a session ID– Connected, on the server side, to all session

information

Crawling

Basics of Crawling

• Crawlers, (Web) spiders, (Web) robots: autonomous agents that retrieve pages from the Web

• Basics crawling algorithm: 1. Start from a given URL or set of URLs 2. Retrieve and process the corresponding page 3. Discover new URLs (next slide) 4. Repeat on each found URL

Problem: The web is huge!

Discovering new URLs

• Browse the "internet graph" (following e.g. hyperlinks)

• Referrer urls

• Site maps (sitemap.org)

The internet graph

• At least 14.06 billion nodes = pages

• At least 140 billion edges = links

Graph-browsing algorithms

• Depth-first

• Breath-first

• Combinations..

Duplicates

• Identifying duplicates or near-duplicates on the Web to prevent multiple indexing

• Trivial duplicates: same resource at the same canonized URL:

http://example.com:80/toto http://example.com/titi/../toto• Exact duplicates: identification by hashing• near-duplicates: (timestamps, tip of the day, etc.)

more complex!

Near-duplicate detection

• Edit distance– Good measure of similarity,– Does not scale to a large collection of documents

(unreasonable to compute the edit distance for every pair!).

• Shingles: two documents similar if they mostly share the same succession of k-grams

Crawling ethics

• robots.txt at the root of a Web server

• User-agent: * Allow: /searchhistory/ Disallow: /search• Per-page exclusion (de facto standard). <meta name="ROBOTS" content="NOINDEX,NOFOLLOW">• Per-link exclusion (de facto standard). <a href="toto.html" rel="nofollow">Toto</a>• Avoid Denial Of Service (DOS), wait 100ms/1s between two repeated requests to the same Web server

Documents

(A taste of) Data Management Over the Web