Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
The Web and Searching for InformationMultimedia Information Systems VO/KU (707.020)
Christoph Trattner
Know-Center
November 23, 2015
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 1 / 65
Outline
1 Internet and the Web
2 Web as a Graph
3 Navigation Behavior
4 Search
5 Data analysis for navigation
6 Social Web
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 2 / 65
Internet and the Web
The Web
What are the reasons for the success of the Web?
Network - the Internet
Addressability across the network
Simplicity
Cross platform, extensible, based on standards
Architecture that scales
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 3 / 65
Internet and the Web
Internet Growth
Figure: Internet Growth (http://wstweb1.ecs.soton.ac.uk/web-observatory/about/tracking-explosive-growth/)
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 4 / 65
Internet and the Web
Internet Growth
Not only computers, but other devices, e.g. phones connected to theInternet
Billions devices connected today, 80 billion expected by 2020 (IDATE)
1990: 0.01 PB/Month100.000G transferred over the Internet per year
Global internet traffic (Cisco estimates)
1990: 0.001 PB/month
2000: 84 PB/month
2010: more than 20000 PB/month
forecast for 2018: 132000 PB/month
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 5 / 65
Internet and the Web
Internet Growth
Figure: Internet Map http://www.opte.org/
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 6 / 65
Internet and the Web
The Web
The fastest growth of any technology in the human history
Time to reach 50 million people
Telephone 75 yearsRadio 35 yearsTV 13 yearsThe Web 4 years
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 7 / 65
Internet and the Web
The Web
Figure: Jakob Nielsen, 100 Million Web Sites,http://www.useit.com/alertbox/web-growth.html
Explosive growth (1991-1997): 850%/year
Rapid growth (1998-2001): 150%/year
Maturing growth (2002-2006): 25%/year
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 8 / 65
Internet and the Web
The Web
The size of the Web
Visible Web and The Deep Web (behind passwords)
Estimates: Deep Web several orders of magnitude larger
The size of the Web ≈ 1000 billions (2008) http://googleblog.
blogspot.com/2008/07/we-knew-web-was-big.html
Indexed pages ≈ 50 billions http://www.worldwidewebsize.com/
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 9 / 65
Internet and the Web
The Web
Figure: http://techcrunch.com/2009/05/08/is-the-growth-of-the-web-slowing-down-or-just-taking-a-breather/
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 10 / 65
Internet and the Web
The Web
The Incredible Growth Of The Web (1984-2013) infographic:http://www.mediabistro.com/alltwitter/
web-growth-history_b48671
Internet users: from 1000 in 1984 to 3 billion in 2014
Web sites: from 130 in 1994 to over over one billion in 2014
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 11 / 65
Internet and the Web
The Web
Queries per day (Google): 10000 in 1998 to 4.4 billion in 2012 (http://www.internetlivestats.com/google-search-statistics/)
Social media users: Facebook - 1 billon, Twitter and LinkedIn - 200million
Mobile: 1.3 billon Smartphones in 2012, over half used for browsing
See more Internet and Web stats:http://www.internetlivestats.com/
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 12 / 65
Internet and the Web
The Web
Figure: The Web is dead,http://www.wired.com/magazine/2010/08/ff_webrip/all/1
Web grows but its share is sinking
Mobile apps get things done on the Internet without using the Web
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 13 / 65
Internet and the Web
The Web
The Web Ain’t Dead Yet (And It’s Getting Easier to Create)http://www.wired.com/epicenter/2011/08/
web-aint-dead-easier-to-make/
Apps and big platforms: easy to use but hard to program
Figure: HTML5
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 14 / 65
Web as a Graph
Information retrieval on the Web
How do we access and retrieve data on the Web?
Type an URL
Browse/Navigate
Search
To understand these we need to analyze the Web as a naturalphenomenon, as an object of scientific inquiry
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 15 / 65
Web as a Graph
Navigating the Web
Graph Structure in the Web, Broder er al. 2000
What is the structure of the Web?
Which pages can be accessed by navigation?
How fast can you reach an arbitrary Web page by navigation?
Analysis of the Web crawl ≈ 200 million pages, 1,5 billion links
Goal: understand Web structure on a macroscopic scale
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 16 / 65
Web as a Graph
Graph Structure in the Web
Figure: Bow-tie model of the Web graph
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 17 / 65
Web as a Graph
Graph Structure in the Web
SCC: the heart of the Web
IN: new pages, not discovered yet
OUT: corporate websites
TENDRILS: disconnected from SCC
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 18 / 65
Web as a Graph
Graph Structure in the Web
The diameter of SCC is 28
The diameter of the graph is over 500
Two randomly chosen pages are connected with a path in only 24%of the cases
Average directed path length around 16
Average undirected path length around 6
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 19 / 65
Web as a Graph
Graph Structure in the Web
Figure: In-degree distribution on the Web
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 20 / 65
Web as a Graph
Graph Structure in the Web
In-degree: Power law with exponent 2.1
Graph Evolution: Densification and Shrinking Diameters by J.Leskovec, 2007.
Study of various real world graphs
Densification: edges grow superlinearly in the number of nodes withtime
Average distance between nodes often shrinks
Shrinking diameter as graph grows
The current Web graph has a similar structure
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 21 / 65
Navigation Behavior
Navigational behavior on the Web
Study by Huberman in 1998: Strong Regularities in World Wide WebSurfing
Model gives a probability distribution for number of pages (depth) auser will visit in a site
Observing the number of links users follow on a website
Theoretical model confirmed with the log analysis of several largewebsites
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 22 / 65
Navigation Behavior
Navigational behavior on the Web
Figure: Number of links followed (clicks) vs. number of users (frequency)
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 23 / 65
Navigation Behavior
Navigational behavior on the Web
Study by Gleich et al on 2010
Tracking the Random Surfer: Empirically Measured TeleportationParameters in PageRank
Teleportation parameter α is the probability that a user will not followa link but will jump to another page by e.g. typing a URL in theaddress bar
In Google they made an estimation setting α = 0.15
Study measured α empirically
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 24 / 65
Navigation Behavior
Navigational behavior on the Web
Browser toolbar logs of Microsoft toolbar
The entire Web: α ≈ 0.35
HelloMovies (structured hierarchical navigation): α ≈ 0.35
Wikipedia: α ≈ 0.6
Findings: Users still navigate
Wikipedia vs. HelloMovies: more link structure → more navigation
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 25 / 65
Navigation Behavior
Navigation: summary of problems
Web graph not completely connected
Central navigational structures are not possible
Users do not follow too many links
But, still users navigate!
Some further studies showed the importance of combination of searchand navigation: first search then navigate, then refine, etc.
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 26 / 65
Search
Search engines - categories
Index search engines with spiders/robots
Catalog search engines (Web directories)
Combinations of index and catalog search engines
Meta search engines
Recommendation systems
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 27 / 65
Search
Search engine architecture
Figure: Generic search engine architecture
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 28 / 65
Search
Index search engines
Robots collect data by following links
Complex web-pages: HTML, CSS, JavaScript, text as graphics, flash,frames... → problems for robots
Gathered data stored in database (page repository)
Indexing module analyses pages and writes them into an index
Query module searches in the index using keywords
Ranking module sorts results according to estimated relevance
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 29 / 65
Search
Ranking of search results
No problems in finding information
Problems in ranking (millions of) results
Ranking strategies
Word counts (how many times does a search word appears?)
Proximity (how close search words are?)
Position of words in a document (title, meta tags, ...)
title and meta informations< metaname =′′ keywords ′′content =′′ fruits, vegetables ′′ >,< metaname =′′ description′′content =′′ onlinefruit − shop′′ >
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 30 / 65
Search
Google ranking
Currently Google has the best results
Ranking of Google based on two components
Hits (in content)
PageRank (most important)
How does it work?
Find documents with hits, calculate weights
Apply PageRank
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 31 / 65
Search
Google ranking
Plain hits - full-text hits (words are somewhere in the text)
URL
Title (second important)
Anchor text (most important)
Meta Tags
Font sizes of the text - relative to the document
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 32 / 65
Search
Google ranking
Why is the idea of using anchor text cool?
If destination document is an image you can still find it!
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 33 / 65
Search
Google ranking
PageRank
Google robot investigates links on the Web
Calculate link statistics for the Web
Pages that have more links pointing to them get higher PageRank
Higher PageRank - more relevant
Pointing pages also have a PageRank
PageRank contributes to PageRank - recursive definition
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 34 / 65
Search
Google Ranking
PageRank (formula)
PR(A) = (1− d) + d(PR(T1)C(T1) + ...+ PR(Tn)
C(Tn) )
d - constant, usually 0.85
PR(A) - PageRank of Page A
T1 ... Tn - all pages pointing to Page A
PR(T1) ... PR (Tn) - PageRanks of pages pointing to Page A
C(Tx) - number of outgoing links from Page Tx
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 35 / 65
Search
Google Ranking
Formula is iterative
To calculate PR(A) you need to know PR(T1) ... PR(Tn) - but youdon’t know it
Start with 0 for all PRs and iterate until there is no difference invalues
The formula converges ;)
For small networks 20-40 iteration steps needed
For big networks - hundreds of iterations (but each iteration isextremely costly for the Web graph)
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 36 / 65
Search
Google ranking
Each page gives its PageRank to pages that it points to
No discrimination: each page shares its PageRank equally PR(Tn)C(Tn)
PageRank forms a probability distribution of pages being accessed
The normalized sum of all PRs (in closed topology) is equal to 1
1
n
n∑i=1
PR(Ti ) = 1
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 37 / 65
Search
Google ranking
PageRank Example 1
Calculate http://kmi.tugraz.at/staff/vsabol/courses/
mmis1/examples/google/google_01.php
Source Code http://kmi.tugraz.at/staff/vsabol/courses/
mmis1/examples/google/google_01.phps
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 38 / 65
Search
Google ranking
PageRank Example 2
Calculate http://kmi.tugraz.at/staff/vsabol/courses/
mmis1/examples/google/google_02.php
Source Code http://kmi.tugraz.at/staff/vsabol/courses/
mmis1/examples/google/google_02.phps
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 39 / 65
Search
Google ranking
PageRank Example 3
Calculate http://kmi.tugraz.at/staff/vsabol/courses/
mmis1/examples/google/google_03.php
Source Code http://kmi.tugraz.at/staff/vsabol/courses/
mmis1/examples/google/google_03.phps
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 40 / 65
Search
Google ranking
Average < 1
Page C saves its PR
If no page saves its PR Average = 1
If number of pages is very high Average ≈ 1
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 41 / 65
Search
Google ranking
PageRank calculates probability that you can access a page if youbrowse “randomly”
Each page gets at least something:
Obvioulsy: PR(C ) > PR(B) > PR(A)
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 42 / 65
Search
Google ranking
PageRank Example 4
Calculate http://kmi.tugraz.at/staff/vsabol/courses/
mmis1/examples/google/google_04.php
Source Code http://kmi.tugraz.at/staff/vsabol/courses/
mmis1/examples/google/google_04.phps
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 43 / 65
Search
Google ranking
PageRank Example 5
Calculate http://kmi.tugraz.at/staff/vsabol/courses/
mmis1/examples/google/google_05.php
Source Code http://kmi.tugraz.at/staff/vsabol/courses/
mmis1/examples/google/google_05.phps
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 44 / 65
Search
Google ranking
PageRank Example 6
Calculate http://kmi.tugraz.at/staff/vsabol/courses/
mmis1/examples/google/google_06.php
Source Code http://kmi.tugraz.at/staff/vsabol/courses/
mmis1/examples/google/google_06.phps
Receiving PR externally is good for a PR of a site
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 45 / 65
Search
Google ranking
Analysis of results
Hierarchy increases PR of the page on the top (homepage)
If you point out you give away some of your PR
Hope that what you give you will get back
If links point to your homepage you will get a lot of PR
Especially if a page with high PR points in!
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 46 / 65
Search
Google ranking
What does PageRank actually measure?
Popularity!
People create links to a page because they know about the page!
Well-known page gets a lot of links - high PR
It relies on the very nature of the Web and its community
Reasons for the success!
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 47 / 65
Search
Google ranking
PageRank Example 7 (Google Bombing)
Calculate http://kmi.tugraz.at/staff/vsabol/courses/
mmis1/examples/google/google_07.php
Source Code http://kmi.tugraz.at/staff/vsabol/courses/
mmis1/examples/google/google_07.phps
1000 spam pages, no wasted PageRank
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 48 / 65
Search
Google ranking
PageRank Example 8 (Google Bombing)
Calculate http://kmi.tugraz.at/staff/vsabol/courses/
mmis1/examples/google/google_08.php
Source Code http://kmi.tugraz.at/staff/vsabol/courses/
mmis1/examples/google/google_08.phps
external page with a huge PageRank
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 49 / 65
Search
Google ranking
Analysis of results (Google Bombing)
Quality of the page is most important
People will point to your page!
Google can remove you from the index because of bombing!
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 50 / 65
Search
Google ranking
When bombing can be “successful”?
Take some unusual anchor text
Make many links with that text to a known page
Submit a query with that text
Famous bomb:http://www.google.com/search?q=miserable+failure
Jokes, cannot earn you money
Useful for political activism and raising awarness
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 51 / 65
Search
Google ranking
Original PageRank paper:http://www-db.stanford.edu/~backrub/google.html
Mathematical analysis of PageRank
Langville and Meyer: Deeper inside PageRank, 2004
PR as a variant of eigenvector centrality of the Web graph
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 52 / 65
Search
Google ranking
Centrality measures: identifying most important nodes in a graph
Eigenvector centrality
A node is important if it is connected to other important nodes
Issue 1: in a directed network a node with no incoming links haseigenvector centrality of 0
Correct by giving each page a small amount of centrality (α)
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 53 / 65
Search
Google ranking
Issue 2: higy centrality node with many links passes huge amounts ofcentrality to targets
Correct this by splitting the centrality equally among all linked nodes
PageRank made exactly those two corrections
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 54 / 65
Search
Catalog Search Engines
Listing of Web sites organised into a hierarchical structure
Editorial office checks links/pages
Smaller amount of pages
Easier to find things (for beginners)
Yahoo directory (http://dir.yahoo.com),
DMOZ open directory project (http://www.dmoz.org/)
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 55 / 65
Search
Meta Search Engines
Search simultaneous more search engines, collect results
No specific syntax to learn
Cannot use special features of search engines
Issues with ranking (often round robin)
Add additional capabilities (e.g. result clustering)
Mamma (mother of all search engines) http://www.mamma.com
Clusty (previously Vivisimo) http://www.clusty.com
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 56 / 65
Data analysis for navigation
Recommendation systems
Data and link analysis to support navigation
Automatic creation of recommendations
Collaborative filtering: recommendations based on past behaviour ofusers
Content-base filtering: recommendations based on similar itemproperties
Automatic creation of hierarchies (for navigation)
Automatic creation of overviews, etc.
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 57 / 65
Data analysis for navigation
Recommendation systems
Bookstore: client likes book; users who liked that book wereinterested in these books as well
Shops
Books, videos, cds, ...
e.g. Amazon.com
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 58 / 65
Data analysis for navigation
The End
Any questions?
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 59 / 65
Data analysis for navigation
The End
30.11.2015: Visualization in the Web
07.12.2015: Introduction to Social Web
14.12.2015: Recent Trends in Social Media and the second partialexam
Christoph Trattner (Know-Center) The Web and Searching for Information November 23, 2015 60 / 65