Upload
jenna-gamble
View
20
Download
2
Embed Size (px)
DESCRIPTION
(A taste of) Data Management Over the Web. Web R&D. The web has revolutionized our world Relevant research areas include databases, networks, security… Data structures and architecture, complexity, image processing, security, natural language processing, user interfaces design.. - PowerPoint PPT Presentation
Citation preview
(A taste of)Data Management Over the Web
Web R&D• The web has revolutionized our world– Relevant research areas include databases, networks,
security… – Data structures and architecture, complexity, image
processing, security, natural language processing, user interfaces design..
• Lots of research in each of these directions– Specialized conferences for web research– Lots of companies
• This course will focus on Web Data
Web Data• The web has revolutionized our world
• Data is everywhere– Web pages, images, movies, social data, likes and
dislikes…• Constitutes a great potential
• But also a lot of challenges– Web data is huge, not structured, dirty..
• Just the ingredients of a fun research topic!
Ingredients
• Representation & Storage– Standards (HTML, HTTP), compact
representations, security…• Search and Retrieval– Crawling, inferring information from text…
• Ranking– What's important and what's not– Google PageRank, Top-K algorithms,
recommendations…
Challenges• Huge– Over 14 Billions of pages indexed in Google
• Unstructured– But we do have some structure, such as html links,
friendships in social networks..
• Dirty– A lot of the data is incorrect, inconsistent,
contradicting, just irrelevant..
Course Goal
• Introducing a selection of fun topics in web data management
• Allowing you to understand some state-of-the-art notions, algorithms, and techniques
• As well as the main challenges and how we approach them
Course outline Ranking: HITS and PageRank• Data representation: XML; HTML• Crawling• Information Retrieval and Extraction, Wikipedia example• Aggregating ranks and Top-K algorithms• Recommendations, Collaborative Filtering for recommending
movies in NetFlix• Other topics (time permitting): Deep Web, Advertisements…• The course is partly based on: Web Data Management and Distribution, Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine
Rousset, Pierre Senellart And on a course by Pierre Senellart (and others) in telecom Paris-tech
Course requirement
• A small final project
• Will involve understanding of 2 or 3 of the subjects studied and some implementation
• Will be given next monday
Ranking
Why Ranking?
• Huge number of pages
• Huge even if we filter according to relevance– Keep only pages that include the keywords
• A lot of the pages are not informative– And anyway it is impossible for users to go
through 10K results
How to rank?
• Observation: links are very informative!• Instead of a collection of Web Pages, we have
a Web Graph!!• This is important for discovering new sites
(see crawling), but also for estimating the importance of a site
• CNN.com has more links to it than my homepage…
Authority and Hubness
• Authority: a site is very authoritative if it receives many citations. Citation from important sites weight more than citations from less-important sitesA(v) = The authority of v
• Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites
H(v) = The hubness of v
HITS (Kleinberg ’99)
• Recursive dependency: a(v) = Σin h(u) h(v) = Σout a(u)
Normalize according to sum of authorities \ hubness values
• We can show that a(v) and h(v) converge
Random Surfer Model
• Consider a "random surfer"• At each point chooses a link and clicks on it
P(W) = P(W1)* (1/O(W1))+…+ P(Wn)* (1/O(Wn))
Where Wi…Wn link to W, O(Wi) is the number of out-edges of Wi
Recursive definition
• PageRank reflects the probability of being in a web-page (PR(w) = P(w))
Then: PR(W) = PR(W1)* (1/O(W1))+…+ PR(Wn)* (1/O(Wn)) • How to solve?
EigenVector!
• PR (row vector) is the left eigenvector of the stochastic transition matrix– I.e. the adjacency matrix normalized to have the
sum of every column to be 1
• The Perron-Frobinius theorem ensures that such a vector exists
• Unique if the matrix is irreducible– Can be guaranteed by small perturbations
Problems
• A random surfer may get stuck in one component of the graph
• May get stuck in loops
• “Rank Sink” Problem– Many Web pages have no outlinks
Damping Factor
• Add some probability d for "jumping" to a random page
• Now P(W) = (1-d) * [P(W1)* (1/O(W1))+…+ P(Wn)* (1/O(Wn))] + d* 1/N Where N is the number of pages in the index
How to compute PR?
• Analytical methods– Can we solve the equations?– In principle yes, but the matrix is huge!– Not a realistic solution for web scale
• Approximations
A random surfer algorithm
• Start from an arbitrary page• Toss a coin to decide if you want to follow a
link or to randomly choose a new page• Then toss another coin to decide which link to
follow \ which page to go to• Keep record of the frequency of the web-
pages visited• The frequency for each page converges to its
PageRank
Power method
• Start with some arbitrary rank row vector R0
• Compute Ri = Ri-1* A
• If we happen to get to the eigenvector we will stay there
• Theorem: The process converges to the eigenvector!
• Convergence is in practice pretty fast (~100 iterations)
Other issues
• Accelerating Computation
• Distributed PageRank
• Mixed Model (Incorporating "static" importance)
• Personalized PageRank
XML
HTML (HyperText Markup Language)
• Used for presentation
• Standardized by W3C (1999)
• Described the structure and content of a (web) document
• HTML is an open format– Can be processed by a variety of tools
HTTP• Application protocol
Client request: GET /MarkUp/ HTTP/1.1
Host: www.google.comServer response:HTTP/1.1 200 OK
• Two main HTTP methods: GET and POST
GET
URL: http://www.google.com/search?q=BGU
Corresponding HTTP GET request:GET /search?q=BGU HTTP/1.1Host: www.google.com
POST
• Used for submitting forms
POST /php/test.php HTTP/1.1Host: www.bgu.ac.ilContent-Type: application/x-www-
formurlencodedContent-Length: 100…
Status codes
• HTTP response always starts with a status code followed by a human-readable message (e.g., 200 OK)
• First digit indicates the class of the response:1 Information2 Success3 Redirection4 Client-side error5 Server-side error
Authentication
• HTTPS is a variant of HTTP that includes encryption, cryptographic authentication, session tracking, etc.
• It can be used instead to transmit sensitive data
GET ... HTTP/1.1Authorization: Basic dG90bzp0aXRp
Cookies
• Key/value pairs, that a server asks a client to store and retransmit with each HTTP request (for a given domain name).
• Can be used to keep information on users between visits
• Often what is stored is a session ID– Connected, on the server side, to all session
information
Crawling
Basics of Crawling
• Crawlers, (Web) spiders, (Web) robots: autonomous agents that retrieve pages from the Web
• Basics crawling algorithm: 1. Start from a given URL or set of URLs 2. Retrieve and process the corresponding page 3. Discover new URLs (next slide) 4. Repeat on each found URL
Problem: The web is huge!
Discovering new URLs
• Browse the "internet graph" (following e.g. hyperlinks)
• Referrer urls
• Site maps (sitemap.org)
The internet graph
• At least 14.06 billion nodes = pages
• At least 140 billion edges = links
Graph-browsing algorithms
• Depth-first
• Breath-first
• Combinations..
Duplicates
• Identifying duplicates or near-duplicates on the Web to prevent multiple indexing
• Trivial duplicates: same resource at the same canonized URL:
http://example.com:80/toto http://example.com/titi/../toto• Exact duplicates: identification by hashing• near-duplicates: (timestamps, tip of the day, etc.)
more complex!
Near-duplicate detection
• Edit distance– Good measure of similarity,– Does not scale to a large collection of documents
(unreasonable to compute the edit distance for every pair!).
• Shingles: two documents similar if they mostly share the same succession of k-grams
Crawling ethics
• robots.txt at the root of a Web server
• User-agent: * Allow: /searchhistory/ Disallow: /search• Per-page exclusion (de facto standard). <meta name="ROBOTS" content="NOINDEX,NOFOLLOW">• Per-link exclusion (de facto standard). <a href="toto.html" rel="nofollow">Toto</a>• Avoid Denial Of Service (DOS), wait 100ms/1s between two repeated requests to the same Web server