29
CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

MIT DM Projects

Embed Size (px)

DESCRIPTION

MIT

Citation preview

  • CS345a:DataMiningJureLeskovec andAnand RajaramanjStanfordUniversity

  • Friday 5:30 at Gates B12 5:307:30pm Friday5:30atGates B125:307:30pm Youwilllearnandgethandsonexperienceon:

    Login to Amazon EC2 and request a cluster LogintoAmazonEC2andrequestacluster RunHadoop MapReduce jobsU A t Cl t ft UseAsternCluster software

    Amazonhaveus$12kofcomputingtime Each students has about $200 worth of Eachstudentshasabout$200worthofcomputingtime

    1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 2

  • Ideally teams of 2 students (1 (3) is also ok) Ideallyteamsof2students(1(3)isalsook) Project:

    Discovers interesting relationships within a Discoversinterestingrelationshipswithinasignificantamountofdata

    Have some original idea that extends/builds on Havesomeoriginalideathatextends/buildsonwhatwelearnedinclass Extend/Improve/Speedup some existing algorithmExtend/Improve/Speed upsomeexistingalgorithm Defineanewproblem andsolveit

    1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 3

  • Answer the following questions: Answerthefollowingquestions: Whatistheproblem youaresolving? Wh t d t ill ( h ill t it)? Whatdata willyouuse(wherewillyougetit)? Howwillyoudoit?

    Whi h l ith /t h i l t ? Whichalgorithms/techniques youplantouse? Beasspecificasyoucan!

    Who will you evaluate measure success? Whowillyouevaluate,measuresuccess? Whatdoyouexpecttosubmit attheendofthequarter?quarter?

    1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 4

  • Due on midnight Feb 1 2010 DueonmidnightFeb12010

    EmailthePDF [email protected]

    TAswillassigngroupnumbers

    Name your file: proposal pdf Nameyourfile:_proposal.pdf

    1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 5

  • Wikipedia Wikipedia IMbuddygraph Yahoo Altavista web graph YahooAltavista webgraph StanfordWebBase Twitter Data TwitterData Blogsandnewsdata Netflix Netflix Restaurantreviews Yahoo Music Ratings YahooMusicRatings

    1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 6

  • Complete edit history of Wikipedia until CompleteedithistoryofWikipediauntilJanuary2008

    For every single edit the complete Foreverysingleeditthecompletesnapshotofthearticleissaved

    Each page has a talk page: Eachpage hasatalk page:

    1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 7

  • Talk page: Talkpage:

    Editors discuss things like: Editorsdiscussthingslike:

    1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 8

  • Every registered Everyregisteredusehasapage:

    1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 9

  • Every users page has a talk page: Everyuser spagehasatalkpage:

    Users discuss things: Usersdiscussthings:

    1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 10

  • 1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 11

  • AnarchismAnarchism12182012002-02-25T15:00:22Z

    Conversion scriptConversion script

    Automated conversion''Anarchism'' is the political theory that advocates the abolition of all forms of government....

    19746/2002-02-25T15:43:11Z

    140.232.153.45

    *''Anarchism' is the political Anarchism is the politicaltheory that advocates the abolition of all forms of government. ...

    1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 12

  • Complete edit and talk history of Wikipedia:CompleteeditandtalkhistoryofWikipedia: Howdoarticlesevolve?

    Usestringeditdistancelikeapproachtomeasuredifferencesbetweenversionsofthearticle

    Modeltheevolutionofthecontent Which users make what types of edits?Whichusersmakewhattypesofedits?

    Bigvs.smallchanges,reorganization? Suggesttoawhichusershouldeditthepage?H d lk d h di ? Howdouserstalkandtheneditsamepages? Dousersfirsttalkandthenedit? Isittheotherwayaround?y

    Suggestuserswhichpagestoedit1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 13

  • Altavista web graph from 2002: Altavista webgraphfrom2002: Nodesarewebpages Di t d d h li k Directededgesarehyperlinks 1.4billionpublicwebpagesS l billi d Severalbillionedges

    ForeachnodewealsoknowthepageURL

    1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 14

  • SPAM: SPAM: Usethewebgraphstructureto more efficiently extracttomoreefficientlyextractspamwebpages

    Link farmsLinkfarms Spidertraps

    Personalized and topicPersonalizedandtopicsensitivePageRank

    1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 15

  • Website structure identification: Websitestructureidentification: Fromthewebgraph extractwebsites What are common navigational structures ofWhatarecommonnavigationalstructuresofwebsites? Clusterwebsitegraphs Identifycommonsubgraphs andpatterns

    Whatarerolespages/linksplayinthegraph: C t t Contentpages Navigationalpages Indexpagesp g

    Buildasummary/mapofthewebsite1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 16

  • A collection of focused snapshots of the Web AcollectionoffocusedsnapshotsoftheWeb Datastartsin2004andcontinuestilltoday

    General crawls Generalcrawls startfrom~1000seedwebpages Crawl up to ~150 000 pager per siteCrawlupto 150,000pagerpersite

    Specializedcrawls: UniversitiesUniversities USGovernment HurricaneKatrina(2005) dailycrawls Monthlynewspapercrawls

    1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 17

  • Smaller than Altavista but you SmallerthanAltavista butyoualsohavethepagecontent

    Candotopicanalysis

    TopicsensitivePageRank

    Study the evolution of websites and Studytheevolutionofwebsitesandwebpages

    1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 18

  • 50 million tweets per month starting 50milliontweetspermonthstartingJune2009(6months)

    Format: Format:T 2009-06-07 02:07:42U http://twitter.com/redsoxtweetsW #redsox Extra Bases: Sox win, 8-1: The Rangers spoiled Jon Lester's perfecto and his shutout.. http://tinyurl.com/pyhgwyT i t t thi Twoimportantthings: URLs Hashtags

    1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 19

  • Trending topics: raising fallingTrendingtopics:raising,falling

    Inferringlinksofthewhofollowswhomnetwork

    WhatisthelifecyclesofURLsandhashtags?

    Finding early/influential users? Findingearly/influentialusers?

    Clusteringtweetsbytopicorcategory

    Sentimentanalysis arepeopleiti / ti b t thi ( d t?)positive/negativeaboutsomething(aproduct?)

    1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 20

  • More than 1 million newsmedia and blog Morethan1millionnewsmedia andblogarticlesperdaysinceAugust2008

    Extract phrases (quotes) and links Extractphrases(quotes)andlinks http://memetracker.org Format: Format:P http://cnnpoliticalticker.wordpress.com/2008/08/31/mccain-defends-

    palins-experience-levelT 2008-09-01 00:00:13Q dangerously unprepared to be presidentQ dangerously unprepared to be presidentQ even more dangerously unpreparedQ understands the challenges that we faceQ worked and succeededQ still to this day refuses to acknowledge that the surge has

    succeededL http://www.cnn.com

    1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 21

  • Find all variants (mutations) of the same Findallvariants(mutations)ofthesamephrase clusterphrasesbasedoneditdistance and time:distanceandtime: lipstick on a pig you can put lipstick on a pig you can put lipstick on a pig but it's still a pig you can put lipstick on a pig but it s still a pig i think they put some lipstick on a pig but it's still a pig putting lipstick on a pig

    Temporal variations of the phrase volume Temporalvariationsofthephrasevolume

    1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 22

  • Predictthepopularityofaphraseovertimep p y p Howdoesinformationmutate/changeovertime? Which media sites are the most influential? Build aWhichmediasitesarethemostinfluential?Buildapredictivemodelofsiteinfluence

    Whichnodesareearlymentioners,latecomers,i ?summarizers?

    Sentimentanalysis arepeoplepositive/negativeaboutsomething (news, a product)something(news,aproduct)

    Createamodelofpoliticalbias(liberalvs.conservative) What is genuine news what are genuine phrases and what is Whatisgenuinenews,whataregenuinephrasesandwhatisspam?

    1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 23

  • We also have the Wikipedai webserver logs WealsohavetheWikipedai webserver logs,i.e.,pagevisitstatistics

    HowdoesWikipagevisitstatisticscorrelatewith external events natural disasters?withexternalevents,naturaldisasters? UseTwitterorMemeTracker datatodetectthose Compare occurrence of phrases and visits to CompareoccurrenceofphrasesandvisitstoWikipediapages

    1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 24

  • A large IM buddy graph from March 2005 AlargeIMbuddygraphfromMarch2005 230millionnodes 7 340 million undirected edges 7,340millionundirectededges

    Limitations: Onlyhavethebuddygraphwithrandomnodeids No communication or edge strengthNocommunicationoredgestrength

    1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 25

  • Find communities clusters in such a big graph Findcommunities,clustersinsuchabiggraph

    Countfrequentsubgraphsq g p

    Designalgorithmstocharacterizethefstructureofthenetworkasawhole

    1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 26

  • Movie ratings: Movieratings: Netflixprizedataset: http://www.netflixprize.com/http://www.netflixprize.com/

    YahooMusicratings: YahooMusicuserratingsofsongswithartist,g g ,albumandgenreinformation

    717millionratings 136,000songs 1.8usersR t t i Restaurantreviews

    1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 27

  • Collaborative filtering: Collaborativefiltering: Predictwhatratingswillusergivetoparticularsongs/movies,i.e.,whichsonswillhe/shelike?g / , , /

    Supplementthedatawithadditionaldatasources: Movies IMDB Playlistsfromtheweb Lyric(textofthesong)

    Includetaste,temporalcomponent,diversity into the modeldiversityintothemodel

    1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 28

  • Stanford Search Queries StanfordSearchQueries NewYorkTimesarticlessince1987

    Article are manually annotated by subjectArticlearemanuallyannotatedbysubjectcategoriesandkeywords

    Entityorrelationextraction Extractkeywords,predictarticlecategory

    f l l d b h Dontfeellimitedbythese Youcancollectthedatasetyourself And define the project/question yourself Anddefinetheproject/questionyourself

    1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 29