22
Locality in Search Engine Queries and Its Implications for Caching Yinglian Xie David O’Hallaron May 2001 CMU-CS-01-128 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Abstract Caching is a popular technique for reducing both server load and user response time in dis- tributed systems. In this paper, we are interested in the question of whether caching might be effective for search engines as well. We study two real search engine traces by examining query locality and its implications for caching. Our trace analysis results show that: (1) Queries have significant locality, with query frequency following a Zipf distribution. Very popular queries are shared among different users and can be cached at servers or proxies, while 16% to 22% of the queries are from the same users and should be cached at the user side. Multiple-word queries are shared less and should be cached mainly at the user side. (2) If caching is to be done at the user side, short-term caching for hours will be enough to cover query temporal locality, while server/proxy caching should be based on longer periods such as days. (3) Most users have small lexicons when submitting queries. Frequent users who submit many search requests tend to reuse a small subset of words to form queries. Thus, with proxy or user side caching, prefetching based on user lexicon looks promising. Effort sponsored in part by the Advanced Research Projects Agency and Rome Laboratory, Air Force Materiel Command, USAF, under agreement number F30602-96-1-0287, in part by the National Science Foundation under Grant CMS-9980063, and in part by a grant from the Intel Corporation.

Locality in Search Engine Queries and Its Implications for Caching · 2011-05-14 · findings about user lexicon analysis and its implications. Finally, we review analysis results

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Locality in Search Engine Queries and Its Implications for Caching · 2011-05-14 · findings about user lexicon analysis and its implications. Finally, we review analysis results

Locality in SearchEngineQueriesandItsImplicationsfor Caching

YinglianXie David O’Hallaron

May 2001CMU-CS-01-128

Schoolof ComputerScienceCarnegieMellon University

Pittsburgh,PA 15213

Abstract

Cachingis a populartechniquefor reducingboth server load anduserresponsetime in dis-tributedsystems.In this paper, we areinterestedin thequestionof whethercachingmight beeffectivefor searchenginesaswell. Westudytwo realsearchenginetracesby examiningquerylocality andits implicationsfor caching.Our traceanalysisresultsshow that: (1) Querieshavesignificantlocality, with queryfrequency following a Zipf distribution. Very popularqueriesaresharedamongdifferentusersandcanbe cachedat serversor proxies,while 16%to 22%of the queriesarefrom the sameusersandshouldbe cachedat the userside. Multiple-wordqueriesaresharedlessandshouldbecachedmainlyat theuserside.(2) If cachingis to bedoneat theuserside,short-termcachingfor hourswill beenoughto cover querytemporallocality,while server/proxycachingshouldbe basedon longerperiodssuchasdays. (3) Most usershavesmalllexiconswhensubmittingqueries.Frequentuserswhosubmitmany searchrequeststendto reusea small subsetof wordsto form queries.Thus,with proxy or usersidecaching,prefetchingbasedonuserlexicon lookspromising.

Effort sponsoredin partby theAdvancedResearchProjectsAgency andRomeLaboratory, Air ForceMaterielCommand,USAF, underagreementnumberF30602-96-1-0287,in partby theNationalScienceFoundationunderGrantCMS-9980063,andin partby a grantfrom theIntel Corporation.

Page 2: Locality in Search Engine Queries and Its Implications for Caching · 2011-05-14 · findings about user lexicon analysis and its implications. Finally, we review analysis results

Report Documentation Page Form ApprovedOMB No. 0704-0188

Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering andmaintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, ArlingtonVA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if itdoes not display a currently valid OMB control number.

1. REPORT DATE MAY 2001 2. REPORT TYPE

3. DATES COVERED 00-00-2001 to 00-00-2001

4. TITLE AND SUBTITLE Locality in Search Engine Queries and Its Implications for Caching

5a. CONTRACT NUMBER

5b. GRANT NUMBER

5c. PROGRAM ELEMENT NUMBER

6. AUTHOR(S) 5d. PROJECT NUMBER

5e. TASK NUMBER

5f. WORK UNIT NUMBER

7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) Carnegie Mellon University,School of Computer Science,Pittsburgh,PA,15213

8. PERFORMING ORGANIZATIONREPORT NUMBER

9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)

11. SPONSOR/MONITOR’S REPORT NUMBER(S)

12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited

13. SUPPLEMENTARY NOTES

14. ABSTRACT

15. SUBJECT TERMS

16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT

18. NUMBEROF PAGES

21

19a. NAME OFRESPONSIBLE PERSON

a. REPORT unclassified

b. ABSTRACT unclassified

c. THIS PAGE unclassified

Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18

Page 3: Locality in Search Engine Queries and Its Implications for Caching · 2011-05-14 · findings about user lexicon analysis and its implications. Finally, we review analysis results

Keywords: searchengines,querylocality, caching,traceanalysis,userlexicon,prefetching

Page 4: Locality in Search Engine Queries and Its Implications for Caching · 2011-05-14 · findings about user lexicon analysis and its implications. Finally, we review analysis results

1 Introduction

Cachingis an important techniqueto reduceserver workloadand userresponsetime. Forexample,clientscansendrequeststo proxies,which thenrespondusinglocally cacheddata.By cachingfrequentlyaccessedobjectsin the proxy cache,the transmissiondelaysof theseobjectsareminimizedbecausethey areserved from nearbycachesinsteadof remoteservers.In addition,by absorbinga portionof theworkload,proxy cachescanincreasethecapacityofbothserversandnetworks,therebyenablingthemto serviceapotentiallylargerclientele.

We areinterestedin thequestionof whethercachingmight beeffective for searchenginesaswell. Becauseservinga searchrequestalsorequiresa significantamountof computationaswell asI/O andnetworkbandwidth,cachingsearchresultscouldimproveperformancein threeways. First, repeatedqueryresultsarefetchedwithout redundantprocessingto minimize theaccesslatency. Second,becauseof thereductionin server workload,scarcecomputingcyclesin theserver aresaved, allowing thesecyclesto be appliedto moreadvancedalgorithmsandpotentiallybetterresults.Finally, by disseminatinguserrequestsamongproxy caches,we candistributepartof thecomputationaltasksandcustomizesearchresultsbasedonusercontextualinformation.

AlthoughWebcachinghasbeenwidely studied,few researchershave tackledtheproblemof cachingsearchengineresults. While it is alreadyknown that searchenginequerieshavesignificantlocality, several importantquestionsarestill open:

� Whereshouldwe cachesearchengineresults? Shouldwe cachethemat the server’smachine,at theuser’s machine,or in intermediateproxies?To determinewhich typeofcachingwouldresultin thebesthit rates,weneedto look atthedegreeof querypopularityateachlevel andthe“shareness”of thequeries.1

� How long shouldwe keepa queryin cachebeforeit becomesstale?That is, do querieshave strongtemporallocality?

� Whatotherbenefitsmightaccruefrom caching?Sincebothproxyandclientsidecachingaremoredistributedwaysof servingsearchrequests,canwe prefetchor rerankqueryresultsbasedon individualuserrequirements?

In this paper, we studytwo realsearchenginetracesandinvestigatetheir implicationsforcachingsearchengineresultswith respectto the above questions.Our analysisyielded thefollowing key results:

� Querieshave significantlocality. About 30%to 40%of queriesarerepeatedqueriesthathave beensubmittedbefore.Queryrepetitionfrequency followsa Zipf distribution. Thepopularquerieswith highrepetitionfrequenciesaresharedamongdifferentusersandcan

1The sharenessof a queryA is computedas the numberof distinct userswho submittedA over sometimeperiod.Thus,sharenessis ameasureof a query’spopularityacrossmultipleusers.

1

Page 5: Locality in Search Engine Queries and Its Implications for Caching · 2011-05-14 · findings about user lexicon analysis and its implications. Finally, we review analysis results

becachedat serversor proxies.Queriesarealsofrequentlyrepeatedby thesameusers.About16%to 22%of all queriesarerepeatedqueriesfrom thesameusers,whichshouldbecachedat theuserside. Multiple-word querieshave lesssharenessandthuscanalsobecachedmainlyat theuserside.

� Themajorityof therepeatedqueriesarereferencedagainwithin shorttime intervals.Butthereremainsa significantportion of queriesthat arerepeatedwithin relatively longertime intervals,which arelargely sharedby differentusers.So if cachingis to be doneat the userside,short-termcachingfor hourswill be enoughto cover query temporallocality, while server/proxycachingshouldbe basedon longerperiods,on theorderofdays.

� Most usershave small lexicons whensubmittingqueries. Frequentuserswho submitmany searchrequeststendto reusea smallsubsetof wordsto form queries.Thus,withproxy or usersidecaching,prefetchingbasedon theuser’s lexicon is promising.Proxyor usersidecachingalsoprovideuswith opportunitiesto improvequeryresultsbasedonindividualuserpreferences,which is animportantfutureresearchdirection.

In therestof thepaper, we first discussrelatedworksin Section1.1. We thendescribethetraceswe analyzedandsummarizethegeneralstatisticsof thedatain Section2. In Section3,we focuson repeatedqueriesanddiscussquerylocality in both traces.Section4 presentsourfindingsaboutuserlexiconanalysisandits implications.Finally, wereview analysisresultsanddiscusspossiblefutureresearchdirections.

1.1 Related works

Dueto theexponentialgrowth of theWeb,therehasbeenmuchresearchon theimpactof Webcachingandhow to maximizeits performancebenefits.Most Web browserssupportcachingdocumentsin theclient’s memoryor local disk to reducethe responsetime of theclient. De-ployingproxiesbetweenclientsandserversyieldsanumberof performancebenefits.It reducesserver load,networkbandwidthusageaswell asuseraccesslatency [5, 8, 11,12]. Prefetchingdocumentsto proxiesor clientshasalsobeenstudiedfor furtherperformanceimprovementbyutilizing useraccesspatterns[4, 7].

Therearealsostudiesof searchenginetraces.Jasenetal analyzedtheExcitesearchenginetraceto determinehow userssearchtheWebandwhatthey searchfor [6]. Silversteinetal ana-lyzedtheAltavistasearchenginetrace[13], studyingtheinteractionof termswithin queriesandpresentingresultsof a correlationanalysisof the log entries.Althoughthesestudieshave notfocusedon cachingsearchengineresults,all of themsuggestquerieshave significantlocality,whichparticularlymotivatesourwork.

Queryresultcachinghasalsobeeninvestigatedasa wayto reducethecostof queryexecu-tion in distributeddatabasesystemsby cachingtheresultsof ’similar’ queries[3, 14]. Recently,Markatoshasstudiedthequerylocality basedon theExcitetraceandshown that

�����������

2

Page 6: Locality in Search Engine Queries and Its Implications for Caching · 2011-05-14 · findings about user lexicon analysis and its implications. Finally, we review analysis results

of thequeriesarerepeatedones[9, 10]. He suggestsa server-sidequeryresultcacheandhasmainly focusedon leveragingdifferentcachereplacementalgorithms.Our work builds on thisby systematicallystudyingquerylocality andderiving theimplicationsfor cachingsearchen-gineresults.

2 The search engine query traces

The two traceswe analyzedarefrom the Vivisimo searchengine [2] andthe Excite searchengine [1] respectively. In this section,we briefly takea look at the two searchenginesandreview their tracedata.

2.1 The Vivisimo and the Excite search engines

Vivisimo is a clusteringmeta-searchenginethat organizesthe combinedoutputsof multiplesearchengines.Uponreceptionof eachuserquery, Vivisimo combinesthe resultsfrom othersearchenginesandorganizesthesedocumentsinto meaningfulgroups.Thegroupingsaregen-erateddynamicallybasedon extractsfrom thedocuments,suchastitles, URLs, andshortde-scriptions.By default,Vivisimo refersto oneor multiple majorsearchengines,including(ca.Feb. 2001)Yahoo,Altavista,Lycos,Excite,andreturns200combinedresultsusinglogic op-eration’ALL ’. Vivisimo alsosupportsadvancedsearchoptionswhereuserscanspecifywhichsearchenginesto query, thenumberof resultsto bereturned,andwhich logic operationto beperformedon thequery, includingANY, PHRASEandBOOLEAN.

Exciteis abasicsearchenginethatautomaticallyproducessearchresultsby listing relevantweb sitesandinformationuponreceptionof eachuserquery. Capitalizationof the query isdisregardedand the default logic operationto be performedis ’ALL ’. It alsosupportsotherlogic operationslike ’AND’, ’OR’, ’AND NOT’. More advancedsearchingfeaturesof Exciteincludewidecardmatching,’PHRASE’ searchingandrelevancefeedbacks.

2.2 The query trace descriptions

TheVivisimoquerytracewascollectedfrom January14,2001to February17,2001,soonaftertheVivisimo launchin earlyJanuary, 2001. Thetracecapturesthebehavior of earlyadopterswho maynot berepresentative of a steadystateusergroup.TheExcitetracewascollectedonDecember20,1999.In bothtraces,eachentrycontainsthefollowing fieldsof interest:

� an anonymous ID identifying theuserIP address.For privacy reasons,we do not haveactualuserIP addresses.EachIP addressin the original traceis replacedby a uniqueanonymousID.

� a timestamp specifyingwhentheuserrequestis received.Thetimestampis recordedasthewall clock timewith a 1 secondresolution.

3

Page 7: Locality in Search Engine Queries and Its Implications for Caching · 2011-05-14 · findings about user lexicon analysis and its implications. Finally, we review analysis results

Trace Vivisimotrace ExcitetraceStart-time 14/Jan/2001:04:0220/Dec/1999:09:00Stop-time 17/Feb/2001:00:00 20/Dec/1999:16:59Numberof bytes 657,623,865 118,318,788Numberof HTTPrequests 2,588,827 notknownNumber of user queries (includingnext pagerequests)

205,342 2,477,283

Number of user queries (excludingnext pagerequests)

110,881 1,920,997

Numberof distinctuserqueries 75,343 1,099,682Numberof multiplewordqueries 77,181 1,429,618Numberof queriesusingdefaultlogicoperation(ALL)

107,880 1,792,174

Numberof users 20,220 520,883Averagequeriessubmittedperuser 5.48 3.69Averagenumberof termsin aquery 2.22 2.63

Table1: Tracestatisticalsummary. Thenumberof HTTPrequestscannotbeinferredfrom theExcitetracesincethetracedid not containinformationaboutHTTPrequestsfrom users.

� a query string submittedby theuser. If any advancedqueryoperationsareselected,theywill alsobespecifiedin thisstring.

� a number indicatingwhethertherequestis for next pageresultsor a new userquery.

2.3 Statistical summaries of the traces

After extracting a query string from eachtraceentry, we transformthe string to a uniformformatfor easyprocessing.Weremove stopwordsfrom thequerybecausemostsearchenginesdiscardthem anyway. We convert all query termsto lower caseand the query is thus caseinsensitive, which is alsotypical for searchengines.However, the removal of the stopwordsandthe upper-to-lower caseconversionactuallyhave little impacton our analysisresults. Itaffectsour statisticsby about1% andtheeffect couldbeignored. In therestof thepaper, weuse’query’ to denoteall thewordsasa wholeenteredby theuserin a querysubmission,and’words’ or ’ terms’to denotetheindividualwordscontainedin auserquery. Becausewecannotdistinguishuserswhousedmultiple IP addressesor userswhosharedIP addressesin thetrace,we uniformly use’user’ to denotetheIP addresswherethequerycamefrom.

Table1 summarizesthe statisticsaboutthe traces.TheExcite tracelastsfor 8 hoursin asingledayandtheVivisimo tracewascollectedmorerecentlyover a periodof 35 days.Thus,the two tracesprovide uswith both long-termandshort-termviews to userqueriessincethey

4

Page 8: Locality in Search Engine Queries and Its Implications for Caching · 2011-05-14 · findings about user lexicon analysis and its implications. Finally, we review analysis results

standfor differenttime scales.Several factsareobvious from this summaryfor both traces.First,usersdonot issuemany next-pagerequests.Lessthantwo pagesonaverageareexaminedfor eachquery. Second,usersdo repeatqueriesa lot. Over32% of thequeriesin theVivisimotracearerepeatedonesthathave beensubmittedbeforeby eitherthe sameuseror a differentuser, while morethan42% of thequeriesarerepeatedqueriesin theExcite trace. Third, themajority of usersdo not useadvancedqueryoptions: 97% of the queriesfrom the Vivisimotraceand93% of thequeriesfrom theExcite traceusethe defaultlogic operationofferedbythecorrespondingsearchengines.Fourth,userson averagedonot submita lot of queries.Theaveragenumbersof queriessubmittedby a userare5.48 and3.69 respectively. Finally, about70% of the queriesconsistof morethanoneword, althoughthe averagequerylengthis lessthan threeterms,which is short. Figures1 shows the query lengthdistributionsof the twotraces.We canobserve that mostof the queriesarelessthanfive termslong. Overall, theseresultsareconsistentwith thosereportedin [6] and [13] andthusarenotsurprising.

1 2 3 4 5 6 7 8 9 10 >100

5

10

15

20

25

30

35

40

Number of words in a query

(%)

Per

cent

age

of th

e qu

erie

s

1 2 3 4 5 6 7 8 9 10 >100

5

10

15

20

25

30

35

Number of words in a query

(%)

Per

cent

age

of th

e qu

erie

s

(a)theVivisimotrace (b)theExcitetrace

Figure1: Userquerydistributionaccordingto thenumberof wordsin eachquery

3 Query locality and its implications

As mentionedin Section2.3,32%to 42%of thequeriesin thetracearerepeatedqueries,whichsuggestscachingasa way to reduceserver workloadandnetworktraffic. In this section,wefocuson thestudyof repeatedqueries,anddiscusshow thelocality in thesequeriesmotivatesdifferentkindsof caching.

3.1 Query repetition distribution

Amongthe35,538 queriesthatarerepeatedonesin theVivisimotrace,thereare16,162 distinctqueries. This meanson average,eachrepeatedquerywas submitted3.20 times. Similarly,

5

Page 9: Locality in Search Engine Queries and Its Implications for Caching · 2011-05-14 · findings about user lexicon analysis and its implications. Finally, we review analysis results

0 1 2 3 4 50

0.5

1

1.5

2

2.5

Query ID(10 based log)

Num

ber

of ti

mes

rep

eate

d(10

bas

ed lo

g)

0 1 2 3 4 5 60

0.5

1

1.5

2

2.5

3

3.5

4

Query ID(10 based log)

Num

ber

of ti

mes

rep

eate

d(10

bas

ed lo

g)

(a)TheVivisimotrace (b)TheExcitetrace

Figure 2: Distribution of the query repetitionfrequency (logarithmic scaleson both axes).QueryIDs aresortedby thenumberof timesbeingrepeated.

eachrepeatedqueryin theExcitetracewassubmitted4.49 times,with 235,607 distinctqueriesamong821,315 repeatedones.Interestingly, wefoundthatthequeryrepetitionfrequenciescanbe characterizedby Zipf distributions. Figure2 plots the distributionsof the repeatedqueryfrequenciesfor thetraces.Eachdatapoint representsonerepeatedquery, with X axisshowingqueriessortedby therepetitionfrequency: thefirst queryis themostpopularone,thesecondqueryis thesecond-mostrepeatedquery, andsoonuntil wereachthequerynumber16,162and235,607which wereonly repeateda singletime.

We arealsointerestedin whethertherepeatedquerieswerefrom thesameusersor sharedby differentusers.Out of the 32.05% of the repeatedqueriesin the Vivisimo trace,70.58%arefrom thesameusers.In theExcitetrace,we have42.75% of therepeatedqueries,of which37.35% are repeatedby the sameusers. Therefore,about22% and16% of all queriesarerepeatedqueriesfrom thesameusersin theVivisimotraceandtheExcitetracerespectively.

We thencountedthenumberof distinctqueriesthatwererepeatedby morethanoneuser.Thereare5,675 suchqueriesfrom theVivisimotraceand135,020 suchqueriesfrom theExcitetrace.Thedistributionsof thesequeries,however, arenon-uniformaccordingto thefrequencyof the repetitions.Table2 shows thatqueriesrepeatedat least10 timesaremorelikely to besharedby multipleusersthanqueriesrepeatedlessthan10times.In theVivisimotrace,amongthe 404 queriesthat wererepeatedat least10 times,78.96% weresharedby differentusers.For the queriesthat wererepeatedlessthan10 times,only 33.99% wereshared.The sametrendholdsfor theExcitetrace.Amongthe12,071 queriesrepeatedat least10 times,asmanyas99.08% areshared,while only 55.05% aresharedover the queriesrepeatedlessthan10times.GiventheZipf distributionfollowedby thequeryrepetitionfrequency, therearea smallnumberof queriesthatwererepeatedveryfrequently, whichwerebothsharedby differentusersandrepeatedby thesameusers.Therealsoexist a largenumberof queriesthatwererepeatedonly a few times,which weremostlyfrom thesameusers.

Theseresultssuggestthat we shouldcachequeryresultsin differentways. For the small

6

Page 10: Locality in Search Engine Queries and Its Implications for Caching · 2011-05-14 · findings about user lexicon analysis and its implications. Finally, we review analysis results

Typeof queries Queriesrepeated� 10 times Queriesrepeated 10 timesshared notshared shared notshared

Numberof appearances 319 85 5356 10402Percentage 78.96% 21.04% 33.99% 66.01%

Table2 (a)Sharenessof thequeriesfrom theVivisimotrace

Typeof queries Queriesrepeated� 10 times Queriesrepeated 10 timesshared notshared shared notshared

Numberof appearances 11,960 111 123,060 100,476Percentage 99.08% 0.92% 55.05% 45.95%

Table2 (b) Sharenessof thequeriesfrom theExcitetrace

Table2: Sharenessof thequeriesbasedon their repetitionfrequencies.A querywassharedifit wassubmittedby morethanoneuserin theexactsameform. Otherwise,a querywasonlysubmittedby a singleuserandwasnot shared

numberof queriesthat werevery popular, we could cachethemat searchengineservers toexploit the high degreeof sharenessamongdifferentusers.We couldalsocachethemat theuserside, which would reducethe server processingoverheadsdue to their high repetitionfrequencies.For the othertype of queriesthat areonly repeatedby the sameusers,cachingthemat serversis not very effective, consideringthe limited server resourcesandthe diverserequirementsfrom thelargenumberof users.Instead,we couldconsidercachingthesequeriesat theusersideaccordingto theuniquerequirementof eachindividualuser.

3.2 Query locality based on individual user

The querysharenessdistribution indicatesthat querylocality exists with respectto the sameusersaswell asamongdifferentusers.More specifically, mostof the queriesat the long tailwererepeatedby thesameusers.Therefore,we furtherexplorequerylocality basedon eachindividualuser.

Thereare20,220 usersrecordedin theVivisimo trace,of whom6,628 repeatedqueriesatleastonce.Eachuseron averagerepeated5.36 queries.In theExcite trace,thereare520,883userswho submittedqueries. Among them,136,626 usersrepeatedqueries,with eachuserrepeating6.01 querieson average.Figure3 plots thepercentageof the repeatedqueriesoverthetotalnumberof queriessubmittedby eachuser. For example,if ausersubmitted10queriesin total,andoutof the10queries,5 wererepeatedonesthathadbeensubmittedbefore,thentheuserhas50%of therepeatedqueries.Frombothtraces,we observe that,80% to 90% of theuserswho repeatedquerieshadat least20% of therepeatedqueries,andaroundhalf of theseusershadat least50% of therepeatedqueries.

From theseresults,we canseethatnot only a lot of usersrepeatedqueries,but eachuser

7

Page 11: Locality in Search Engine Queries and Its Implications for Caching · 2011-05-14 · findings about user lexicon analysis and its implications. Finally, we review analysis results

0 1000 2000 3000 4000 5000 6000 70000

20

40

60

80

100

User ID

(%)P

erce

ntag

e of

the

repe

ated

que

ries

0 2 4 6 8 10 12 14

x 104

0

0.2

0.4

0.6

0.8

1

User ID

(%)P

erce

ntag

e of

the

repe

ated

que

ries

(a)TheVivisimotrace (b)TheExcitetrace

Figure3: Thepercentageof therepeatedqueriesover thenumberof queriessubmittedby eachuser(Only thoseuserswhorepeatedqueriesareplottedhere).

alsorepeatedqueriesa lot. Consideringthat 16%to 22%of all querieswererepeatedby thesameusersandthatsearchengineserverscancacheonly limited data,cachingthesequeriesandqueryresultsbasedonindividualuserrequirementsin amoredistributedwayis thusimportant.It reducesuserquerysubmissionoverheadsandaccesslatenciesas well asserver load. Bycachingqueriesat theuserside,wealsohave theopportunityto improvequeryresultsbasedonindividualusercontext, which cannotbeachievedby cachingqueriesata centralizedserver.

3.3 Temporal query locality

In this section,we quantifythenotionof temporalquerylocality, that is, thetendency of usersto repeatquerieswithin a shorttime interval. Figure6 shows thenumberof queriesrepeatedwithin differenttime intervals. Overall, querieswererepeatedwithin shortperiodsof time. IntheVivisimo trace,about65% of thequerieswererepeatedwithin an hour. SincetheExcitetracelastsfor only 8 hours,asmany as83% of thequerieswererepeatedwithin an hour.

In the Vivisimo trace,thesequeriesweremostly from the sameusers.More specifically,45.5% of thequerieswererepeatedby thesameuserswithin 5 minutes, which is very short.Therearealsomany queriesthatwererepeatedover relatively longertime intervals; they arelargely sharedby differentusers.Out of the21.98% of thequeriesthatwererepeatedover aday, only 5.09% camefrom thesameusers.TheExcitetracegenerallyhasasmallerpercentageof thequeriesrepeatedby thesameusers.But westill observe thesamepattern:theshorterthetime interval, themorelikely a querywill berepeatedby thesameuser.

Thesestatisticssuggestthatif cachingqueryresultsis to bedoneat serversor proxies,thenwe shouldconsiderlongerperiodof cachingin order to exploit maximumsharenessamongdifferentusers.Usually, cachingqueryresultsfor a long time is morelikely to resultin staledata.Sincethis is to beperformedby theserver, thestaledatacanberemovedin timewhenevertheserverupdatestheresults.If cachingis to beperformedat theuserside,thenshortperiodof

8

Page 12: Locality in Search Engine Queries and Its Implications for Caching · 2011-05-14 · findings about user lexicon analysis and its implications. Finally, we review analysis results

< 1 min 1−5 mins 5−10 mins 10−30mins 30−60mins 1h − 1day > 1 day 0

5

10

15

20

25

time intervals

(%)P

erce

ntag

e of

rep

eate

d qu

erie

s number of queries repeated overall number of queries repeated by the same users

< 1 min 1−5 mins 5−10 mins 10−30mins 30−60mins 1h − 8h 0

5

10

15

20

25

30

time intervals

(%)P

erce

ntag

e of

rep

eate

d qu

erie

s number of queries repeated overall number of queries repeated by the same users

(a)TheVivisimotrace (b)TheExcitetrace

Figure4: Repeatedquerydistributionwithin differenttime intervals

cachingwouldbeenoughto cover mostof thetemporalquerylocality, which is alsolesslikelyto resultin staledata.

3.4 Multiple-word query locality

Multiple-wordqueriesareimportantbecausethey accountfor about70% of thequeries.Multiple-wordqueriesalsohave significantlocality, whichwill bediscussedin this section.

Table3 summarizesthestatisticsaboutmultiple-wordqueries.TheExcitetracehasalargerportionof multiple-wordqueriesthantheVivisimotrace,but bothtraceshave asmany as62%of the multiple-wordqueriesover the repeatedones. We observe that multiple-wordqueriesarerepeatedlessfrequentlycomparedwith single-wordqueries. In the Vivisimo trace,eachrepeatedmultiple-wordquerywassubmitted3.00 timeson average,comparedwith 3.63 timesfor single-wordcases.In the Excite trace,eachrepeatedmultiple-wordquerywassubmitted3.77 times,comparedwith 7.12 timesfor single-wordcases.

Wealsoobserve thatmultiple-wordquerylocality mostlyexistsamongthesameusers,thatis, multiple-wordquerieshave lessdegreeof shareness.Table4 showsthecomparisonbetweenmultiple-wordqueriesandsingle-wordquerieswith respectto theirdegreesof shareness.Fromboth traces,we can seethat multiple-word queriesare lesslikely to be sharedby differentusers.Eachsharedmultiple-wordqueryalsotendsto besharedby fewer users.This is easilyexplainedbecausethechancesfor differentusersto submitthesamemultiple-wordqueriesaremuchsmallerthanthosefor single-wordqueries.

Fromtheabove results,wecanseethatmultiple-wordquerylocality is significant.Cachingmultiple-wordqueriesis morepromisingbecauseit takesmoretime to computemultiple-wordqueryresults.Sincemultiple-wordquerieshave lessdegreeof shareness,we couldcachethemmainlyat theuserside.

9

Page 13: Locality in Search Engine Queries and Its Implications for Caching · 2011-05-14 · findings about user lexicon analysis and its implications. Finally, we review analysis results

Numberof appearance PercentageVivisimo Excite Vivisimo Excite

Multiple wordqueries 77,181 1,429,618 69.61% 74.42%(over thenumberof thequeries)

Uniquemultiplewordqueries 55,193 924,546 73.26% 84.07%(over thenumberof theuniquequeries)

Multiple wordqueriesthatwere 21,995 505,249 61.89% 61.52%repeated (over thenumberof therepeatedqueries)Unique multiple word queriesthat

11,020 182,335 68.18% 77.39%

wererepeated (over the number of unique repeatedqueries)

Unique multiple word queriesthat

3,181 101,098 5.76% 11%

were submittedby more thanoneuser

(over thenumberof uniquemultiple wordqueries)

Table3: Multiple-wordquerysummary

Typeof queries Sharedmultiple-wordqueries Sharedsingle-wordqueriesVivisimo Excite Vivisimo Excite

Percentage 5.76% 10.93% 12.28% 19.37%Numberof users 2.53 3.95 3.24 7.38

Table4: Thecomparisonbetweenmultiple-wordqueriesandsingle-wordquerieswith respectto thedegreesof thesharenessin bothtraces.’Percentage’meanstheportionof sharedmultiple-word or single-wordqueriesover the total numberof multiple-wordor single-wordqueries.’Numberof users’meanstheaveragenumberof userssharingeachsuchquery.

10

Page 14: Locality in Search Engine Queries and Its Implications for Caching · 2011-05-14 · findings about user lexicon analysis and its implications. Finally, we review analysis results

3.5 Users with shared IP addresses

Someusersusedial up servicessuchasAOL to accessthe Internet. For thoseusers,their IPaddressesaredynamicallyallocatedby DHCPservers.Unfortunately, thereis nocommonwayto identify thesekindsof users.This impactsouranalysisin two ways.First,becausedifferentuserscansharethesameIP addressat differenttimes,their queriesseemlike they comefromthe sameuser, leadingto an overestimateof the querylocality from the sameusers.Second,becausethesameuserscanusedifferentIP addressesat differenttimes,it is alsopossibleforus to underestimatethe querylocality from the sameusers.But sinceAOL clientswill havekeyword’AOL’ in theiruser-agentfields,whichwasrecordedin theVivisimotrace,weareableto identify AOL userswhomighthavesharedIP addressesin theVivisimotrace.Wefoundthatamongthe110,881queriesreceived,thereareonly 2,949 queriessubmittedby 749 AOL clients.And only threeof themarefrequentuserswhosubmittedmorethan70 queries.Therefore,webelieveour resultsabouttheVivisimotracearenotbiasedby theseusers.

4 User lexicon analysis and its implications

In this section,we analyzetheuserquerylexicons.We alsoproposepossiblewaysto prefetchqueryresultsfor eachindividualuser, by recognizingtheirmostfrequentlyusedterms.

4.1 Distribution of the user lexicon size

We noticedthattheword frequency in thetracecannotbecharacterizedby a Zipf distribution,which wasalsonoticedin [6]. Thegraphfalls steeplyat thebeginningandhasanunusuallylong tail.

Insteadof looking at the overall lexiconsusedin the trace,we groupall the wordsusedby eachuserindividually, andexaminetheuserlexicon sizedistribution. Amongthe249,541wordsin the queriesfrom the Vivisimo trace,thereare51,895 distinct words. In the Excitetrace,thereare350,879 distinctwordsamongthe5,095,189 wordsfrom thequeries.Wenoticethatsomeusersin theExcite tracesubmitteda largenumberof queries,for example,oneusersubmitted130,220queriesduring the 8 hours. Theseusersarevery likely to be meta-searchenginesinsteadof normalusers.Thuswe ignorethe 60 userswho submittedmorethan100queriesin theExcite traceandexaminethe remaininguserson their lexicon sizes.Comparedwith theoverall lexiconsize,theuserlexiconsizesaremuchsmaller. Thelargestuserlexiconsin the two traceshave only 885 wordsand202 wordsrespectively. We alsofind that theuserlexicon size doesnot follow a Zipf distribution. Figure5 plots the distributionsof the userlexiconsize.Thegraphshaveheavy tails,meaningthemajorityof theusershavesmalllexicons.

We alsolookedat the relationshipbetweenthe numberof queriessubmittedby eachuserand the correspondinglexicon size. For both traces,the more queriessubmittedby a user,the larger the user’s lexicon. Figure7 shows the relationshipbetweenthe numberof queriessubmittedby a userandthecorrespondinguserlexicon sizebasedon theVivisimo trace.The

11

Page 15: Locality in Search Engine Queries and Its Implications for Caching · 2011-05-14 · findings about user lexicon analysis and its implications. Finally, we review analysis results

0 1 2 3 4 50

0.5

1

1.5

2

2.5

3

3.5

User ID(sorted by lexicon size)(10 based log)

Lexi

con

Siz

e(10

bas

ed lo

g)

0 1 2 3 4 5 60

0.5

1

1.5

2

2.5

User ID(sorted by lexicon size)(10 based log)

Lexi

con

Siz

e(10

bas

ed lo

g)

(a)TheVivisimotrace (b)TheExcitetrace

Figure5: Distribution of the userlexicon sizeswith logarithmicscaleson both axes. X axisdenotesthe userIDs sortedby the lexicon sizes. Whenplotting the Excite trace,we removethoseuserswhosubmittedmorethan100queriesover the8-hourperiod.Thoseusersareverylikely to bemeta-searchenginesor robotsinsteadof normalusers.

Figure6: Repeatedquerydistributionwithin differenttime intervals

0 500 1000 1500 20000

50

100

150

200

250

300

350

400

User ID (sorted by lexicon size)

User lexicon size Number of queries submitted by the user

The user with a small lexicon size but submitted many queries

Figure7: Userlexiconsizesandthenumberof thequeriesthey submittedin theVivisimotrace.Solid line plots the sorteduserlexicon sizes. Dashedline plots the numbersof the queriessubmittedby thecorrespondingusers.For scalereasons,this figureis a zoomedin part from acompletefigure,but thepatternshown hereis generalfor thecompletefigureaswell.

userlexicon sizesarein proportionto thenumberof queriessubmittedby theusers.But therearealsoa few exceptions,wheretheuserssubmitteda lot of queriesout of small lexicons.TheExcitetraceshows thesamepatternandis thusnotplottedhere.

12

Page 16: Locality in Search Engine Queries and Its Implications for Caching · 2011-05-14 · findings about user lexicon analysis and its implications. Finally, we review analysis results

Numberof fre-users 157Numberof queriessubmittedby fre-users 25,722Numberof fre-userswith fre-lexiconsize � 1 153Numberof fre-userswith fre-lexiconsize � 20 128

Table5: Fre-usersstatisticsin the Vivisimo trace. Fre-usersaredefinedas thoseuserswhosubmittedat least70 queriesover the 35 days. Thoseusersvisited Vivisimo frequentlyandsubmittedat least2 querieseachdayon average.Fre-lexicon is definedasthesetof termsthatwereusedby thecorrespondinguserfor at leastfive times.

4.2 Analysis of frequent users and their lexicons

Thoughsomeusershave large lexicons, they do not useall of the wordsuniformly. We areinterestedin how many wordsaremost frequentlyusedin the queriesfor eachuser. For theusersin theVivisimotracewhosubmittedonly a few queriesover the35days,calculatingtheirfrequentlyusedwordsis not meaningful.Similarly, we do not look at theusersin theExcitetraceeitherbecausethe tracelastsfor too shorta period. Sowe only focuson thosefrequentusersin theVivisimo trace. We call a usera fre-userif theusersubmittedat least70 queries.Thefre-usersthussubmittedat least2 querieseachdayonaverage.Wealsodefinea fre-lexiconfor eachfre-user. Thefre-lexiconconsistsof thewordsthatwereusedat leastfive timesby thecorrespondingusers

Table5 shows thestatisticalsummaryaboutthe fre-usersin theVivisimo trace.Thereare157 fre-users,amongwhom, 153 fre-usershave non-emptyfre-lexicons. Although fre-usersaccountfor lessthan1% of theusers,they submitted25,722 queriesin total,whichaccountfor23.20%, asignificantportionover thetotalnumberof queries.Sincetheseusersarenon-trivial,cachingandimproving queryresultsbasedon their individual requirementlookspromising.

We alsoobserve from the tablethatmostof the fre-usershadsmall fre-lexicons. Thusweareinterestedin how many queriesweregeneratedpurelyby thewordsfrom fre-lexicons. Iftheuserstendto re-useasmallnumberof wordsveryoftento form queries,thenwecanpredictqueriesandprefetchqueryresultsby simply enumeratingall the word combinations.Figure8 (a) plots the numberof queriesgeneratedfrom fre-lexicons. Therearea small numberoffre-userswith relatively larger fre-lexicons, from which they submitteda lot of queries. Soprefetchingbasedon fre-lexicons for theseuserswill help reducethe numberof queriestobe submitteddramatically. But we alsoneeda large cachesizeto storeall the possiblewordcombinationsdueto therelatively largefre-lexicon size. Therearealsoquitea few userswhogenerateda lot of queriesfrom smallfre-lexicons.For example,theuserspecifiedin thefiguregenerated106 queriesfrom an 8 word fre-lexicon. For theseusers,prefetchingaccordingtofre-lexiconswouldbemosteffective.

Sincethemajority of querieshave fewer thanfive terms,it is interestingto seehow manyquerieswith fewerthanfivetermsweregeneratedfrom fre-lexicons.Figure8 (b) showsboththe

13

Page 17: Locality in Search Engine Queries and Its Implications for Caching · 2011-05-14 · findings about user lexicon analysis and its implications. Finally, we review analysis results

0 20 40 60 80 100 120 140 1600

50

100

150

200

250

300

User ID(sorted by fre−lexicon size)

Fre−lexicon size Number of queries generated from fre−lexicon

The user who generated a lot ofqueries out of a small fre−lexicon

0 50 100 1500

10

20

30

40

50

60

70

80

90

100

Fre−User ID

(%)P

erce

ntag

e of

the

quer

ies

Queries generated from fre−lexicon Quereies generated from fre−lexicon with <= 4 terms

(a)QueriesandFre-lexicon (b)Percentageof thequeriesgeneratedfrom fre-lexicons

Figure 8: (a) The numberof queriesgeneratedfrom fre-lexicon and the correspondingfre-lexicon size. Fre-userIDs aresortedby the lexicon sizes. (b) The percentageof the queriesgeneratedby thefre-lexiconfor eachfre-user. Thefre-userIDs aresortedby thepercentages.

percentageof thequeriesfrom fre-lexiconsandthepercentageof thequeriesfrom fre-lexiconswith lessthanfive terms.Theadditionof theextraconstraintsimposedon thenumberof termsdoesnot affect the resultsfor mostof the fre-users.Therefore,we could just enumeratetheword combinationsusingno morethanfour terms,which would greatlyreducethenumberofqueriesto beprefetched.

5 Discussions on Research directions

Although the Vivisimo traceand the Excite tracewere collectedindependentlyat differenttimes,over differenttemporalperiods,andwith differentuserpopulations,their statisticalre-sultsaresimilar. In this section,we review theseresultsanddiscusstheir implicationson fu-tureresearchdirections.We focuson threeaspects:cachingsearchengineresults,prefetchingsearchengineresults,andimproving queryresultrankings.

5.1 Caching search engine results

With������������

of repeatedqueries,cachingsearchengineresultsis a non-trivial problem.Cacheplacementis oneof thekey aspectsof theeffectivenessof caching.Queryresultscanbecachedon theservers,theproxies,andtheclients. For optimalperformance,we shouldmakedecisionsbasedon thefollowing aspects:

1. Scalability. A cacheschemeshouldscalewell with theincreasingsizeandthedensityoftheInternet.

14

Page 18: Locality in Search Engine Queries and Its Implications for Caching · 2011-05-14 · findings about user lexicon analysis and its implications. Finally, we review analysis results

Level of caching Servercaching Proxycaching UsersidecachingScalability worst medium bestPerformanceimprovement worst medium bestHit rateandshareness best medium worstOverhead small large smallOpportunityfor otherbenefits least medium most

Table6: Comparisonsbetweendifferentcacheplacementschemes.

2. Performanceimprovement.This includesthe amountof reductionin server workload,useraccesslatency, andnetworktraffic.

3. Hit rateandshareness.Thehigherthehit rate,themoreefficient thecaching.We wouldalsolike cachedqueryresultsto besharedamongdifferentusersof commoninterests.

4. Overhead.Theoverheadsincludethesystemresourcesdevotedfor cachingandtheextranetworktraffic induced.

5. Opportunityfor otherbenefits.By disseminatinguserrequestsamongcaches,whatotherbenefitscanweachieve with theexistenceof caching?

Table6 comparesthe prosandconsof differentcacheplacementschemesin the generalcase.(1) Server cachinghasonly limited systemresourceavailable,so it doesnot scalewellwith theincreasingsizeof theInternet.In addition,thoughwe cansave theredundantcompu-tation,we cannotreducethenumberof requestsreceivedby serversandthereductionin useraccesslatency is alsovery limited. However, server cachinghassmall overhead.Moreover,it allows maximumquerysharenessamongdifferentusersandthe hit ratewould be high bycachingpopularqueries. (2) Proxy cachingis effective to reduceboth server workloadandnetworktraffic. In thecaseof cachingqueryresults,this assumesthattheusersnearbya proxywouldsharequeriesto resultin largehit rate.However, oneof themaindisadvantagesof proxycachingis thesignificantoverheadof placingdedicatedproxiesamongthe Internet. (3) Usersidecachinghasthebestperformanceimprovementin caseof acachehit becausetheredundantuserrequestswill notbesentoutat all. Comparedwith thelimited systemresourcesatservers,thesumof eachindividual userresourceis almostinfinite. Sousersidecachingalsoachievesthe bestscalability. Becausethe overheadof cachingcanbe amortizedto a large numberofusers,theoverheadat eachusersideis small. More importantly, with usersidecaching,it isnow possibleto prefetchor improve queryresultsbasedon individualuserrequirement.How-ever, no sharenesswill beexploitedwith usersidecaching.Thusif queriesaremostlysharedinsteadof beingrepeatedby thesameusers,wewouldhave low hit ratein suchcase.

Theabovediscussionindicatesthatthedegreeof sharenessis importantto decidewhereweshouldcachequeryresults. In oneextremecasewhereusersnever repeattheir own queries,we reachthe maximumdegreeof shareness.In suchcase,we shouldcachequeryresultsat

15

Page 19: Locality in Search Engine Queries and Its Implications for Caching · 2011-05-14 · findings about user lexicon analysis and its implications. Finally, we review analysis results

serversor proxiesbecauseusersidecachingwould resultin zerohit rate. In anotherextremecasewhereusersneversharequeries,cachingqueryresultsat usersidewouldbemosthelpful.

Our traceanalysisresultsshow that,amongall queries,repeatedqueriesaccountfor 32%to 42%,while repeatedqueriesfrom thesameusersaccountfor 16%to 22%. Thesignificantportion of the queriesrepeatedby the sameusersand the non-trivial differencebetweentheabovetwo percentagessuggestbothserver/proxycachingandusersidecaching.For thequeriesrepeatedonly by thesameusers,wecancachethematusersidefor efficiency, while therestoftherepeatedqueriescanbecachedateitherserversor proxies.Sincewedonotknow theactualIP addressesof the usersin both traces,we cannotfurther distinguishthe queriesthat canbesharedby nearbyusersandthuscanbecachedatproxies.But it is clearfrom thetracesthatthepercentageof suchquerieswouldbebetween16%and42%.Therefore,weleaveproxycachingasa futurework for searchenginesthemselvesto determinewhetherandwhereto placesuchproxies.

Sincemultiple-wordquerieshavelessdegreeof shareness,wealsosuggestcachingmultiple-word queriesmainly at theuserside. Therefore,by cachingonly popularsingle-wordqueriesatservers/proxies,wecanachieve largerhit ratesandgreatlyreducetherequiredsystemspace.

Anotherimportantquestionaboutcachingis how longweshouldcachequeryresults.Tem-poralquerylocality indicatesthatmostof thequeriesarerepeatedwithin shorttime intervals.Soin general,cachingqueryresultsfor shortperiodsshouldwork well. Moredetailedanalysisin Section3.3shows thattheshortertimeinterval, themorelikely for aqueryto berepeatedbythesameusers.Thusfor usersidecaching,cachingqueryresultsfor hourswill beenoughtocover thequerylocality existing on thesameusers.This alsohelpsto remove or updatestalequeryresultsin time. Therealsoexist a non-trivial portionof queriesrepeatedover relativelylongertime intervalsandthesequeriesaremainly sharedamongdifferentusers.So if cachingis to be doneat serversor proxies,we canconsiderlonger-term cachingsuchasa coupleofdays.

5.2 Prefetching search engine results

Ouruserlexiconanalysisin Section4 suggeststhatprefetchingqueryresultsbasedonuserfre-lexiconsis promising.Prefetchinghasalwaysbeenanimportantmethodto reduceuseraccesslatency. In thecaseof prefetchingsearchengineresults,it canbeperformedat bothusersideor proxies.

Cachingqueryresultsat usersideprovidesusefulinformationaboutuserinterest.For thisreason,usersideprefetchingis naturalandworthlookingat. Fromtheuserlexiconanalysis,weobserve thatthemajorityof theuserlexiconsizesaresmall.Frequentuserswhosubmittedalotof queriesusuallyuseasmallsubsetof wordsmoreoftenthanotherwords.Soastraightforwardway of prefetchingis to enumerateall the word combinationsfrom fre-lexiconsandprefetchthe correspondingqueryresults. Sincequeriesareusuallyshort,we canskip querieslongerthan four termsto reducethe prefetchingoverheadwhile achieving approximatelythe sameperformanceimprovement.For example,a 10 word fre-lexicon needsto prefetch385queries

16

Page 20: Locality in Search Engine Queries and Its Implications for Caching · 2011-05-14 · findings about user lexicon analysis and its implications. Finally, we review analysis results

usingtheabove naive algorithm,which will costlessthan20 minutesnetworkdownloadtimeusing56Kbpsmodemandabout8M disk for storage.Theseoverheadsaretrivial consideringthatanormalPCtodayis idle about70%of thetimeonaverageandthenormaldisksizeis tensof Gigabytes.

Prefetchingcanalso be performedat proxies. In suchcases,server’s global knowledgeaboutuserquerypatternscanbeutilizedto decidewhatto prefetchandwhento prefetch.Sinceproxiesallow querysharenessamongdifferentusers,exploring how to achieve maximumhitratein suchcasewouldalsobeaninterestingproblemfor futureresearch.

5.3 Improving query result rankings

Althoughtoday’ssearchenginesoftenreturntensof thousandsof resultsfor a userquery, onlyafew resultswill beactuallyreviewedby theusers.Section2.3showsthateachuseronaveragereviews lessthantwo pagesof results.Thusimproving queryresultrankingsbasedon individ-ual userrequirementis moreimportantthanever. Thepersonalnatureof ’relevance’requiresincorporatingusercontext to find desiredinformation.Becausecentralizedsearchenginespro-vide servicesto millions of users,it is impracticalto customizeresultsfor eachuser. Somespecializedsearchenginesdo offer searchresultswhich aredifferentthanstandardfor somespecializeduserrequirements.But noneof themallowsusersto definetheirown requirementsat will. With userside/proxycaching,it is now possibleto re-rankthe returnedsearchengineresultsbasedon theuniqueinterestof individual user. For example,a naive algorithmwouldbeto increasetheranksof theWebpagesvisitedby theuseramongthenext queryresults.Wecanalsoexploreotheralgorithmsandintegratethemwith cachingto customizesearchengineresultsfor individualuser.

6 Conclusions

Cachingis animportanttechniquefor reducingserverworkloadanduseraccesslatency. In thispaper, we investigatedtheissueof whethercachingmightwork in thecaseof searchengines,asit doesin many otherareas.We studiedtwo realsearchenginetracesandinvestigatedthefol-lowing threequestions:(1) Whereshouldwe cachesearchengineresults?At servers,proxies,or userside?(2) How longshouldwecachesearchresults?Or doquerieshavestrongtemporalquerylocality? (3) Whataretheotherbenefitsof cachingsearchengineresults?

Our analysisof both the Vivisimo searchenginetraceandthe Excite searchenginetraceindicatethat: (1) Querieshave significantlocality. Queryrepetitionfrequency follows a Zipfdistribution. Thepopularquerieswith high repetitionfrequenciesaresharedamongdifferentusersandcanbecachedat serversor proxies.Therearealsoabout16%to 22%of thequeriesrepeatedby thesameusers,which shouldbecachedat userside. Multiple-word querieshavelessdegreeof sharenessand shouldbe cachedmainly at userside. (2) The majority of therepeatedqueriesarereferencedagainwithin short time intervals. Thereis alsoa significant

17

Page 21: Locality in Search Engine Queries and Its Implications for Caching · 2011-05-14 · findings about user lexicon analysis and its implications. Finally, we review analysis results

portion of queriesthat arerepeatedwithin relatively longer time intervals, which are largelysharedby differentusers.Soif cachingis to bedoneat userside,short-termcachingfor hourswill be enoughto cover query temporallocality, while server/proxycachingshouldbaseonlongerperiodssuchasdays.(3) Mostof theusershavesmalllexiconswhensubmittingqueries.Frequentuserswho submitteda lot of searchrequeststendto re-usea small subsetof wordsto form queries. Thuswith proxy or usersidecaching,prefetchingbasedon userlexicon ispromising. Proxy or usersidecachingalsoprovide us with opportunitiesto improve queryresultsbasedon individualuserrequirement,which is animportantfutureresearchdirection.

7 Acknowledgements

Wegratefullyacknowledgetheassistanceof RaulValdes-PerezandtheVivisimosearchengineteamfor providing us with a trace. Thanksalso to the Excite teamfor making their traceavailable. Without the generoussharingof the tracedataby Vivisimo andExcite, this workwouldnot bepossible.We alsothankJamieCallanfor his helpful commentsandsuggestions.

References[1] Excite. http://www.excite.com.

[2] Vivisimo. http://www.vivisimo.com.

[3] S. Adali, K.S. Candan,Y. Papakonstantinou,andV.S. Subrahmanian.QueryCachingandOpti-mizationin DistributedMediatorSystems.In H. V. JagadishandInderpalSinghMumick, editors,Proc. of the 1996 ACM SIGMODConf. on Managementof Data,, pages137–148.ACM Press,1996.

[4] P. Cao,L. Fan, andQ. Jacobson.Web PrefetchingBetweenLow-BandwidthClientsandProx-ies:PotentialandPerformance.In Proceedingsof the ACM SIGMETRICSConference,Atlanta,GE, May 1999.

[5] A. Feldmann,R. Caceres,F. Douglis,G. Glass,andM. Rabinovich. Performanceof WebProxyCachingin HeterogenousEnvironments.In Proceedingsof theIEEE infocomm’99,New York, NY,March1999.

[6] B.J.Jansen,A. Spink,J.Bateman,andT. Saracevic. Reallife informationretrieval: astudyof userquerieson theWeb. In SIGIRForum,Vol. 32.No.1, pages5–17,1998.

[7] T. Kroeger, D. Long,andJ.Mogul. ExploringtheBoundsof WebLatency Reductionfrom CachingandPrefetching.In Proceedingsof theUSENIXSymposiumon InternetTechnologiesandSystems,December1997.

[8] C. Maltzahn,K. Richardson,andD. Grunwald.PerformanceIssuesof EnterpriseLevel WebProx-ies. In Proceedingsof theSIGMETRICSConferenceon MeasurementandModelingof ComputerSystems, June1997.

18

Page 22: Locality in Search Engine Queries and Its Implications for Caching · 2011-05-14 · findings about user lexicon analysis and its implications. Finally, we review analysis results

[9] EvangelosP. Markatos. On CachingSearchEngineResults. TechnicalReport241, InstituteofComputerScience,Foundationfor Research& Technology- Hellas(FORTH), 1999.

[10] EvangelosP. Markatos. On CachingSearchEngineQuery Results. In Proceedingsof the 5thInternationalWebCaching andContentDeliveryWorkshop, May 2000.

[11] M.S.Raunak,P. Shenoy, P. Goyal,andK. Ramamritham.Implicationsof ProxyCachingfor Provi-sioningNetworksandServers.In Proceedingsof theACMSIGMETRICSConference,SantaClara,CA, June2000.

[12] A. Rousskov andV. Soloviev. On Performanceof CachingProxies. In Proceedingsof the ACMSIGMETRICSConference,Madison,WI, June1998.

[13] C. Silverstein,M. Henzinger, H. Marais,andM. Moricz. Analysisof aVeryLargeAltaVistaQuerylog. TechnicalReport1998-014,Digital SystemResearchCenter, October1998.

[14] M. Taylor, K. Stoffel, J.Saltz,andJ.Hendler. UsingDistributedQueryResultCachingto EvaluateQueriesfor ParallelDataMining Algorithms. In Proc.of theInt. Conf. onParallel andDistributedTechniquesandApplications, 1998.

19