Transcript
Page 1: Measuring the impact of Google Analytics

Measuringtheimpact:

StephenMerity/smerity.com @smerity

Page 2: Measuring the impact of Google Analytics

Smerity@CommonCrawl

ContinuingthecrawlDocumentingbestpractices

GuidesfornewcomerstoCommonCrawl+bigdataReferenceforseasonedveterans

Spendingmanyhoursblessingand/orcursingHadoop

Before:UniversityofSydney'11,Harvard'14

GoogleSydney,Freelancer.com,GrokLearning

Page 3: Measuring the impact of Google Analytics

[email protected]

Iwashopingoncreatingatoolthatwillautomaticallyextractsomeofthemostcommonmemes("ButdoesitrunLinux?"and

"InSovietRussia..."stylejokesetc)andIneededacorpus-

.Idointenselyapologise.

Iwroteaprimitive(threaded:S)webcrawlerandstarteditbeforeI

consideredrobots.txt

--PastSmerity(16/12/2007)

Page 4: Measuring the impact of Google Analytics

WheredidalltheHTTPreferrersgo?

Page 5: Measuring the impact of Google Analytics

Referrers:leakingbrowsinghistory

Ifyouclickfrom

to

http://www.reddit.com/r/sanfrancisco

http://www.sfbike.org/news/protected-bikeways-planned-for-the-embarcadero/

thenSFBikeknowsyoucamefromReddit

Page 6: Measuring the impact of Google Analytics

1)HowmanywebsitesisGoogleAnalytics(GA)on?

2)Howmuchofauser'sbrowsinghistorydoesGAcapture?

Page 7: Measuring the impact of Google Analytics

Top10kdomains:65.7%

Top100kdomains:64.2%

Topmilliondomains:50.8%

Itkeepsdroppingoff,butbyhowmuch..?

Page 8: Measuring the impact of Google Analytics

Estimateofcapturedbrowsinghistory...

?

Page 9: Measuring the impact of Google Analytics

ReferrersalloweasywebtrackingwhendoneatGoogle'sscale!

Noinformation!GA→!GA

Fullinformation!GA→GA

GA→!GA→GAGA→!GA→GA→!GA→GA→!GA→GA→!GA→GA

Page 10: Measuring the impact of Google Analytics

Keyinsight:leakedbrowsinghistory

GoogleonlyneedsoneineverytwolinkstohaveGAinordertohaveyourfullbrowsingpath*

*possiblylessiflinkgraph+clicktiming+machinelearningused

Page 11: Measuring the impact of Google Analytics

Estimatingleakedbrowserhistory

foreach :link={pageA}→{pageB}total_links+=1if{pageA}or{pageB}hasGA:

total_leaked+=1

Estimateofleakedbrowserhistoryissimply:total_leaked/total_links

Page 12: Measuring the impact of Google Analytics

JointprojectwithChadHornbaker*atHarvardIACS

*Bestfullnameever:CaptainCharlesLafforestHornbakerII

Page 13: Measuring the impact of Google Analytics

Thetask

GoogleAnalyticscount:" "

Generatelinkgraph

Mergelinkgraph&GAcount

.google-analytics.com/ga.jswww.winradio.net.auNoGA1www.winrar.com.cnGA6www.winratzart.comGA1www.winrenner.chGA244

domainA.com->domainB.com<totaltimes>

cnet-cnec-driver.softutopia.com->www.softutopia.com24

Page 14: Measuring the impact of Google Analytics

Excitingageofopendata

Opendata+

Opentools+

Cloudcomputing

Page 15: Measuring the impact of Google Analytics

WARCrawwebdata

WATmetadata(links,title,...)foreachpage

WETextractedtext

Page 16: Measuring the impact of Google Analytics

WARC=GAusagerawwebdata

WAT=hyperlinkgraphmetadata(links,title,...)foreachpage

Page 17: Measuring the impact of Google Analytics

Estimatingthetask'ssize

Pagelevel( ):http://en.wikipedia.org/3.5billionnodes,128billionedges,331GBcompressed

Subdomainlevel( ):101millionnodes,2billionedges,9.2GBcompressed

Decidedonusingsubdomainsinsteadofpagelevel

http:// /

Page 18: Measuring the impact of Google Analytics

Engineeringforscale

✓Usetheframeworkthatmatchesbest

✓Debuglocally

✓StandardHadoopoptimizations(combiner,compression,re-useJVMs...)

✓Manysmalljobs≫onebigjob

✓Gangliaformetrics&monitoring

Page 19: Measuring the impact of Google Analytics

Hadoop:'(

Page 20: Measuring the impact of Google Analytics

Hadoop:'(

Page 21: Measuring the impact of Google Analytics

Monitoring&metricswithGanglia

Page 22: Measuring the impact of Google Analytics

Engineeringforcost

✓AvoidHadoopifit'ssimpleenough✓Usespotinstanceseverywhere*✖UseEMRifhighlycostsensitive

(ElasticMapReduce=hostedHadoop)

*Everywherebutthemasternode!

Page 23: Measuring the impact of Google Analytics

Jugglingspotinstances

c1.xlargegoesfrom$0.58p/hto$0.064p/h

Page 24: Measuring the impact of Google Analytics

EMR:Thegood,thebad,theugly

significantlyeasier,oneclicksetup

priceisinsanewhenusingspotinstances(spot=$0.075withEMR=$0.12)

Guesshowmanylogfilesfora100nodecluster?

Page 25: Measuring the impact of Google Analytics

584,764+logfiles.

Ouch.

Page 26: Measuring the impact of Google Analytics

Costprojection

BestoptimizedsmallHadoopjob:1/177ththedatasetin23minutes(12c1.xlargemachines+Hadoopmaster)

Estimatedfulldatasetjob:~210TBforwebdata+~90TBforlinkdata~$60inEC2costs(177hoursofspotinstances)~$100inEMRcosts(avoidEMRforcost!)

Page 27: Measuring the impact of Google Analytics

Finalresults

29.96%of48milliondomainshaveGA(topmilliondomainswas50.8%)

Thatmeansthat

oneineverytwohyperlinkswillleakinformationtoGoogle

Page 28: Measuring the impact of Google Analytics

Thewiderimpact

Page 29: Measuring the impact of Google Analytics

WantBigOpenData?

WebData

Coverseverythingatscale!Languages...

Topics...Demographics...

Page 30: Measuring the impact of Google Analytics

Processingthewebisfeasible

Downloadingitisapain!CommonCrawldoesthatforyou

Processingitisscary!Bigdataframeworksexistandare(relatively)painless

Theseexperimentsaretooexpensive!Cloudcomputingmeansexperimentscanbejustafewdollars

Page 31: Measuring the impact of Google Analytics

Getstartednow..!

Wantrawwebdata?CommonCrawl.org

Wanthyperlinkgraph/webtables/RDFa?WebDataCommons.org

Wantexamplecodetogetyoustarted?https://github.com/Smerity/cc-warc-examples

Page 32: Measuring the impact of Google Analytics

Measuringtheimpact:

Fullwrite-up:http://smerity.com/cs205_ga/

StephenMerity/smerity.com @smerity


Recommended