Upload
dominy
View
40
Download
0
Embed Size (px)
DESCRIPTION
Sökmotorer och agenter i framtidens webb TREFpunkt 2006. Anders Arpteg Ph D in Computer Science. Google Architecture. Pre-Google Ranking. WebCrawler, AltaVista, Evreka TF/IDF Term Frequency Inverse Document Frequency Problems Returns many irrelevant pages - PowerPoint PPT Presentation
Citation preview
1
Sökmotorer och agenter i framtidens webb
TREFpunkt 2006
Anders ArptegPh D in Computer Science
2
Google Architecture
3
Pre-Google Ranking
WebCrawler, AltaVista, Evreka TF/IDF
Term Frequency Inverse Document Frequency
Problems Returns many irrelevant pages Easy to cheat, to manipulate rankings
Other techniques used Lexical analysis Stop words Stemming
4
The PageRank Algorithm
Links instead of terms Many IMPORTANT inbound links
( )( ) 1
( )i
i i
PR tPR A d d
C t
PR(x) = PageRank value for page xd = damping factorC(x) = outbound links from page x
5
PageRank Example 1
A page's PageRank = 0.15 +0.85 * (a "share" of the PageRank of every page that links to it)
A B C
0 1 1 1
1 0.15 0.15 0.15
6
PageRank Example 2
A page's PageRank = 0.15 +0.85 * (a "share" of the PageRank of every page that links to it)
A B C
0 1 1 1
1 0.15 1 0.15
2 0.15 0.2775 0.15
7
PageRank Example 3
A page's PageRank = 0.15 +0.85 * (a "share" of the PageRank of every page that links to it)
A B C
0 1 1 1
1 1 1 1
2 1 1 1
8
PageRank Example 4
A page's PageRank = 0.15 +0.85 * (a "share" of the PageRank of every page that links to it)
A B C
0 1 1 1
1 1.85 0.575 0.575
2 2.82 0.93 0.93
99 1.46 0.77 0.77
9
Other ranking factors
PageRank is not as important any more Targeted keyword techniques
Choose keywords carefully Font-size identification
Keywords in title, headings, … Keywords in URL, preferably in domain name Keywords in link text
Relevant in-bound links Links from sites with related content Links from sites with high PageRank
Patience, time will favor Sandbox effect Trusted and old domains
Clean code, valid HTML Beware JavaScript links Beware frames
Use Google sitemaps,but beware link farming
10
Google Summary
Links represents popularity, and we want popular sites highly ranked
Difficult to cheat PageRank compared to TF/IDF Revolutionary architecture
High coverage High performance
PageRank is not the only factor Keyword targeting Clean design, valid code
General rule Google tries to simulate human behavior;
keywords that are highlighted for humans are highly valued by Google.
Sites with good structure for humans have good structure for Google.
11
The Semantic Web
Definition of the Semantic Web "The Semantic Web is an extension of the current web
in which information is given well-defined meaning, better enabling computers and people to work in cooperation." -- Tim Berners-Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American, May 2001.
Why the Semantic Web topic? Connection to my research work How will the Semantic Web influence you
12
History of the Internet
1969 ARPANET (Internet) 1971 Email 1974 TCP introduced 1979 USENET 1984 DNS introduced 1989 First Web proposal 1991 WWW introduced 1994 Order pizza online 1994 Webcrawler 1995 Sun launch Java 1998 Google 1998 XML defined 1999 RDF defined 2004 Yahoo-, MSN Search 2004 OWL defined
"We set up a telephone connection between us and the guys at SRI...," Kleinrock ... said in an interview: "We typed the L and we asked on the phone, "Do you see the L?" "Yes, we see the L," came the response.
"We typed the O, and we asked, "Do you see the O." "Yes, we see the O." "Then we typed the G, and the system crashed"...Yet a revolution had begun"...
13
Growth of the Web
14
Problems with the current Web
15
Semantic Web Principles
1. Everything can be identified by URI's 2. Resources and links can have types 3. Partial information is tolerated 4. There is no need for absolute truth 5. Evolution is supported 6. Minimalistic design
Make simple things simple, and complex things possible!
16
Semantic Web Languages
XML Defines the data language How to encode “words” into a string
RDF Defines resources and links How “things” are related to each other
OWL Defines ontology What things “mean” and their constraints
17
Semantic Web Layers
18
Semantic Web Example
<?xml version="1.0" encoding="ISO-8859-1"?><rdf:RDF xmlns:daml="http://www.daml.org/2001/03/daml+oil#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oiled="http://img.cs.man.ac.uk/oil/oiled#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:xsd="http://www.w3.org/2000/10/XMLSchema#"> <daml:Ontology rdf:about=""> <dc:title>Distribution Company</dc:title> <dc:date></dc:date> <dc:creator>Anders Arpteg</dc:creator> <dc:description></dc:description> <dc:subject></dc:subject> <daml:versionInfo></daml:versionInfo> </daml:Ontology> <daml:Class
rdf:about="http://www.bugsoft.nu/aa/logics2003/company.daml#item">
<rdfs:label>item</rdfs:label> <rdfs:comment><![CDATA[]]></rdfs:comment>
<oiled:creationDate><![CDATA[2003-12-17T10:06:20Z]]></oiled:creationDate>
19
Summary
Problem with the current Web Huge amount of information, needs KM Machines can not understand the information
Semantic Web technologies Standardized languages Minimalistic approach
Good or bad? Nothing really new, we can already do that Amazing, think of all new possibilities