19
1 Sökmotorer och agenter i framtidens webb TREFpunkt 2006 Anders Arpteg Ph D in Computer Science

Sökmotorer och agenter i framtidens webb TREFpunkt 2006

  • Upload
    dominy

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

Sökmotorer och agenter i framtidens webb TREFpunkt 2006. Anders Arpteg Ph D in Computer Science. Google Architecture. Pre-Google Ranking. WebCrawler, AltaVista, Evreka TF/IDF Term Frequency Inverse Document Frequency Problems Returns many irrelevant pages - PowerPoint PPT Presentation

Citation preview

Page 1: Sökmotorer och agenter i framtidens webb TREFpunkt 2006

1

Sökmotorer och agenter i framtidens webb

TREFpunkt 2006

Anders ArptegPh D in Computer Science

Page 2: Sökmotorer och agenter i framtidens webb TREFpunkt 2006

2

Google Architecture

Page 3: Sökmotorer och agenter i framtidens webb TREFpunkt 2006

3

Pre-Google Ranking

WebCrawler, AltaVista, Evreka TF/IDF

Term Frequency Inverse Document Frequency

Problems Returns many irrelevant pages Easy to cheat, to manipulate rankings

Other techniques used Lexical analysis Stop words Stemming

Page 4: Sökmotorer och agenter i framtidens webb TREFpunkt 2006

4

The PageRank Algorithm

Links instead of terms Many IMPORTANT inbound links

( )( ) 1

( )i

i i

PR tPR A d d

C t

PR(x) = PageRank value for page xd = damping factorC(x) = outbound links from page x

Page 5: Sökmotorer och agenter i framtidens webb TREFpunkt 2006

5

PageRank Example 1

A page's PageRank = 0.15 +0.85 * (a "share" of the PageRank of every page that links to it)

A B C

0 1 1 1

1 0.15 0.15 0.15

Page 6: Sökmotorer och agenter i framtidens webb TREFpunkt 2006

6

PageRank Example 2

A page's PageRank = 0.15 +0.85 * (a "share" of the PageRank of every page that links to it)

A B C

0 1 1 1

1 0.15 1 0.15

2 0.15 0.2775 0.15

Page 7: Sökmotorer och agenter i framtidens webb TREFpunkt 2006

7

PageRank Example 3

A page's PageRank = 0.15 +0.85 * (a "share" of the PageRank of every page that links to it)

A B C

0 1 1 1

1 1 1 1

2 1 1 1

Page 8: Sökmotorer och agenter i framtidens webb TREFpunkt 2006

8

PageRank Example 4

A page's PageRank = 0.15 +0.85 * (a "share" of the PageRank of every page that links to it)

A B C

0 1 1 1

1 1.85 0.575 0.575

2 2.82 0.93 0.93

99 1.46 0.77 0.77

Page 9: Sökmotorer och agenter i framtidens webb TREFpunkt 2006

9

Other ranking factors

PageRank is not as important any more Targeted keyword techniques

Choose keywords carefully Font-size identification

Keywords in title, headings, … Keywords in URL, preferably in domain name Keywords in link text

Relevant in-bound links Links from sites with related content Links from sites with high PageRank

Patience, time will favor Sandbox effect Trusted and old domains

Clean code, valid HTML Beware JavaScript links Beware frames

Use Google sitemaps,but beware link farming

Page 10: Sökmotorer och agenter i framtidens webb TREFpunkt 2006

10

Google Summary

Links represents popularity, and we want popular sites highly ranked

Difficult to cheat PageRank compared to TF/IDF Revolutionary architecture

High coverage High performance

PageRank is not the only factor Keyword targeting Clean design, valid code

General rule Google tries to simulate human behavior;

keywords that are highlighted for humans are highly valued by Google.

Sites with good structure for humans have good structure for Google.

Page 11: Sökmotorer och agenter i framtidens webb TREFpunkt 2006

11

The Semantic Web

Definition of the Semantic Web "The Semantic Web is an extension of the current web

in which information is given well-defined meaning, better enabling computers and people to work in cooperation." -- Tim Berners-Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American, May 2001.

Why the Semantic Web topic? Connection to my research work How will the Semantic Web influence you

Page 12: Sökmotorer och agenter i framtidens webb TREFpunkt 2006

12

History of the Internet

1969 ARPANET (Internet) 1971 Email 1974 TCP introduced 1979 USENET 1984 DNS introduced 1989 First Web proposal 1991 WWW introduced 1994 Order pizza online 1994 Webcrawler 1995 Sun launch Java 1998 Google 1998 XML defined 1999 RDF defined 2004 Yahoo-, MSN Search 2004 OWL defined

"We set up a telephone connection between us and the guys at SRI...," Kleinrock ... said in an interview: "We typed the L and we asked on the phone, "Do you see the L?" "Yes, we see the L," came the response.

"We typed the O, and we asked, "Do you see the O." "Yes, we see the O." "Then we typed the G, and the system crashed"...Yet a revolution had begun"...

Page 13: Sökmotorer och agenter i framtidens webb TREFpunkt 2006

13

Growth of the Web

Page 14: Sökmotorer och agenter i framtidens webb TREFpunkt 2006

14

Problems with the current Web

Page 15: Sökmotorer och agenter i framtidens webb TREFpunkt 2006

15

Semantic Web Principles

1. Everything can be identified by URI's 2. Resources and links can have types 3. Partial information is tolerated 4. There is no need for absolute truth 5. Evolution is supported 6. Minimalistic design

Make simple things simple, and complex things possible!

Page 16: Sökmotorer och agenter i framtidens webb TREFpunkt 2006

16

Semantic Web Languages

XML Defines the data language How to encode “words” into a string

RDF Defines resources and links How “things” are related to each other

OWL Defines ontology What things “mean” and their constraints

Page 17: Sökmotorer och agenter i framtidens webb TREFpunkt 2006

17

Semantic Web Layers

Page 18: Sökmotorer och agenter i framtidens webb TREFpunkt 2006

18

Semantic Web Example

<?xml version="1.0" encoding="ISO-8859-1"?><rdf:RDF xmlns:daml="http://www.daml.org/2001/03/daml+oil#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oiled="http://img.cs.man.ac.uk/oil/oiled#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"

xmlns:xsd="http://www.w3.org/2000/10/XMLSchema#"> <daml:Ontology rdf:about=""> <dc:title>Distribution Company</dc:title> <dc:date></dc:date> <dc:creator>Anders Arpteg</dc:creator> <dc:description></dc:description> <dc:subject></dc:subject> <daml:versionInfo></daml:versionInfo> </daml:Ontology> <daml:Class

rdf:about="http://www.bugsoft.nu/aa/logics2003/company.daml#item">

<rdfs:label>item</rdfs:label> <rdfs:comment><![CDATA[]]></rdfs:comment>

<oiled:creationDate><![CDATA[2003-12-17T10:06:20Z]]></oiled:creationDate>

Page 19: Sökmotorer och agenter i framtidens webb TREFpunkt 2006

19

Summary

Problem with the current Web Huge amount of information, needs KM Machines can not understand the information

Semantic Web technologies Standardized languages Minimalistic approach

Good or bad? Nothing really new, we can already do that Amazing, think of all new possibilities