Memex: A Browsing Assistant for Collaborative Archiving and Mining of Surf Trails Soumen Chakrabarti...

Preview:

Citation preview

Memex: A Browsing Assistant forCollaborative Archiving and

Mining of Surf Trails

Soumen ChakrabartiSandeep Srivastava

Mallela SubramanyamMitul Tiwari

Indian Institute of Technology Bombay

IITB 2000

Sources of Web information Sources already exploited

• Text on pages (keyword search)• Link between pages (popularity rating)• Topic taxonomies (query expansion)

Sources not exploited enough yet• Public surfing history• Public bookmarks

Collaboration is central to hypertext Lack of trust limits collaboration on Web

IITB 2000

Our goals Infrastructure to support spontaneous

formation of topic-based collaborative Web communities• Browsing assistant client• Community server

Mining algorithms for personal and community level topic management and collaborative resource discovery

Extensible API for plugging in additional hypertext analysis tools

IITB 2000

1: Create aMemex account(password sent

by email)

3: Allow the Memexclient to attach toyour Web browser

4: Log on to theMemex server

2: Install theMemex applet signing

certificate and visitthe applet page

IITB 2000

Memex clientapplet attachesto browser

Privacy choice

Function ta

bs

IITB 2000

Preparing toimport initialbookmarks

IITB 2000

Bookmarksimported

IITB 2000

For Memex to suggestan initial topic organization,select all bookmarks…

IITB 2000

…and send themto the clustering tab

IITB 2000

Switch to theclustering tab

URLs to beclusteredappear here

IITB 2000

Submit the URLsto the server-sideMemex clusteringdemon

IITB 2000

Check later if theserver has completedthe clustering task

IITB 2000

Two top-levelclusters aboutsoftware andmusic

IITB 2000

Expanding thesoftware clusterto study it inmore detail

IITB 2000

User can freelyreorganize URLplacement usingcut-and-paste

IITB 2000

User can freelyreorganize URLplacement usingcut-and-paste

IITB 2000

User can freelyreorganize URLplacement usingcut-and-paste

IITB 2000

Moving an entirefolder from thecluster tab…

IITB 2000

…to the foldertab together withexample URLs

IITB 2000

…to the foldertab together withexample URLs

IITB 2000

Folder names can beedited as per taste; thisalso gives Memexadditional clues aboutthe folder’s contents

IITB 2000

New folders can becreated to hold clustersfound in the cluster tab

IITB 2000

New folders can becreated to hold clustersfound in the cluster tab

IITB 2000

A topic hierarchy which istoo detailed for the user canbe flattened

IITB 2000

A topic hierarchy which istoo detailed for the user canbe flattened

IITB 2000

Groups of closely relatedURLs can be moved backto folders in the folder tab

IITB 2000

Groups of closely relatedURLs can be moved backto folders in the folder tab

IITB 2000

Memex helps the user derivea starting topic hierarchy fromunstructured bookmarks

IITB 2000

The user then continuesbrowsing in multiple sessions.Relevant pages found by othermembers of the communityand made public are availablefor collaborative surfing

IITB 2000

If permission is granted, theMemex applet monitors the trailthat the surfer follows anduploads it to the server forfurther analysis and mining

IITB 2000

If permission is granted, theMemex applet monitors the trailthat the surfer follows anduploads it to the server forfurther analysis and mining

IITB 2000

Such surf trails together withpage contents are valuableinputs to the Memex server-sidehypertext mining and resourcediscovery demons

IITB 2000

In the background, the Memexclassifier finds the most suitablefolders to assign to each historyitems. History is never deleted (diskis cheap). When the user refreshesthe view, surf history from othersand herself are found categorizedinto the user’s familiar topic tree.

‘?’ indicates that Memex is not

sure about the folder assignment.

Users can easily correct mistakes

and this forms additional

valuable training data.

IITB 2000

Automatic collaborativeclassification also lets usersreturn to a topic-restrictedsurfing context quickly, andreplay the last few surfingactions within that topicof interest.

IITB 2000

Personalized topic-basedhistory management is farsuperior to the one-dimensional history listprovided by popularbrowsers

IITB 2000

Users can switch topics witha single click, and browsingis not limited by the linear“back and forward” paradigmsupported by browsers.

IITB 2000

Users can switch topics witha single click, and browsingis not limited by the linear“back and forward” paradigmsupported by browsers.

IITB 2000

A flexible interactive searchlets the user locate any pageever visited from anywhereusing this account, combiningcontent with popularity, siteselections and timeliness

IITB 2000

A flexible interactive searchlets the user locate any pageever visited from anywhereusing this account, combiningcontent with popularity, siteselections and timeliness

IITB 2000

Close integration of theMemex client with thebrowser is non-trivial toimplement but adds greatlyto comfort and ease of use

IITB 2000

Memex system diagram

Browser

Memex server

Client JARVisit

Runningclient applet

Download

Attach

Eve

nt-

han

dle

r se

rvle

ts

Search

Folder

Context

Archive

Memex client-serverprotocol and workloadsharing negotiations

Relationalmetadata

Textindex

Min

ing

de

mo

ns

Topicmodels

Taxonomy synthesis

Resource discovery

Recommendation

Classification

Clustering

IITB 2000

Document workflow

Demon Registry

X

Per-document version queue

NODEtable

Crawler

Searchindexer

Classifierservice

Clusteringservice

Garbagecollector

Push newversion

Pop anddiscard

old version

BrowserMemexclient

Page visit andbookmarkingevents logged

IITB 2000

Autonomous topic organization Bookmarks often collected into topics Surfers use personal topic organization One-size-fits all taxonomy inadequate

• Many topics over-developed for most of us• http://dmoz.org/Sports/Hockey/Underwater_Hockey/

• But deeper interests often underdeveloped• Structure reorganization also desirable

Best taxonomy depends on community behavior as well as page content

IITB 2000

Autonomy and collaboration Personalization picking Yahoo nodes Complex relations between topics Need “simplest common ground”

• Coalesce similar topics where possible…• …without sacrificing individual taste

Sports

Hiking

Subsumption

User2User1Yahoo

Biz

Shops

Bikeshops

Sports

Cycling

Cycling

Bikeshops

Sports

User3

Tree ‘inversion’

IITB 2000

Taxonomy synthesis example

Generating themes makes map simpler But distorts contents of original folders Joint optimization gives best themes

Entertainment

Studios

Broadcasting

Media kpfa.org

bbc.co.uk

kron.com

channel4.com

kcbs.com

foxmovies.com

miramax.com

lucasfilms.com

Share document

Share folder

Share termsThemes

‘Radio’

‘Television’

‘Movies’

IITB 2000

Summary and project status Collaborative resource discovery and topic

management system Testbed for hypertext mining research Signed Java2 client

• Netscape 4.5+ available• IE5+ planned

Server for Unix and Windows• IBM UDB, Berkeley DB, servlets• Non-trivial to install and manage• Simple-to-use RPMs being planned

http://www.cse.iitb.ernet.in/~soumen

Recommended