Upload
usaid-mandvia
View
216
Download
0
Embed Size (px)
DESCRIPTION
MIT
Citation preview
CS345a:DataMiningJureLeskovec andAnand RajaramanjStanfordUniversity
Friday 5:30 at Gates B12 5:307:30pm Friday5:30atGates B125:307:30pm Youwilllearnandgethandsonexperienceon:
Login to Amazon EC2 and request a cluster LogintoAmazonEC2andrequestacluster RunHadoop MapReduce jobsU A t Cl t ft UseAsternCluster software
Amazonhaveus$12kofcomputingtime Each students has about $200 worth of Eachstudentshasabout$200worthofcomputingtime
1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 2
Ideally teams of 2 students (1 (3) is also ok) Ideallyteamsof2students(1(3)isalsook) Project:
Discovers interesting relationships within a Discoversinterestingrelationshipswithinasignificantamountofdata
Have some original idea that extends/builds on Havesomeoriginalideathatextends/buildsonwhatwelearnedinclass Extend/Improve/Speedup some existing algorithmExtend/Improve/Speed upsomeexistingalgorithm Defineanewproblem andsolveit
1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 3
Answer the following questions: Answerthefollowingquestions: Whatistheproblem youaresolving? Wh t d t ill ( h ill t it)? Whatdata willyouuse(wherewillyougetit)? Howwillyoudoit?
Whi h l ith /t h i l t ? Whichalgorithms/techniques youplantouse? Beasspecificasyoucan!
Who will you evaluate measure success? Whowillyouevaluate,measuresuccess? Whatdoyouexpecttosubmit attheendofthequarter?quarter?
1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 4
Due on midnight Feb 1 2010 DueonmidnightFeb12010
EmailthePDF [email protected]
TAswillassigngroupnumbers
Name your file: proposal pdf Nameyourfile:_proposal.pdf
1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 5
Wikipedia Wikipedia IMbuddygraph Yahoo Altavista web graph YahooAltavista webgraph StanfordWebBase Twitter Data TwitterData Blogsandnewsdata Netflix Netflix Restaurantreviews Yahoo Music Ratings YahooMusicRatings
1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 6
Complete edit history of Wikipedia until CompleteedithistoryofWikipediauntilJanuary2008
For every single edit the complete Foreverysingleeditthecompletesnapshotofthearticleissaved
Each page has a talk page: Eachpage hasatalk page:
1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 7
Talk page: Talkpage:
Editors discuss things like: Editorsdiscussthingslike:
1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 8
Every registered Everyregisteredusehasapage:
1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 9
Every users page has a talk page: Everyuser spagehasatalkpage:
Users discuss things: Usersdiscussthings:
1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 10
1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 11
AnarchismAnarchism12182012002-02-25T15:00:22Z
Conversion scriptConversion script
Automated conversion''Anarchism'' is the political theory that advocates the abolition of all forms of government....
19746/2002-02-25T15:43:11Z
140.232.153.45
*''Anarchism' is the political Anarchism is the politicaltheory that advocates the abolition of all forms of government. ...
1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 12
Complete edit and talk history of Wikipedia:CompleteeditandtalkhistoryofWikipedia: Howdoarticlesevolve?
Usestringeditdistancelikeapproachtomeasuredifferencesbetweenversionsofthearticle
Modeltheevolutionofthecontent Which users make what types of edits?Whichusersmakewhattypesofedits?
Bigvs.smallchanges,reorganization? Suggesttoawhichusershouldeditthepage?H d lk d h di ? Howdouserstalkandtheneditsamepages? Dousersfirsttalkandthenedit? Isittheotherwayaround?y
Suggestuserswhichpagestoedit1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 13
Altavista web graph from 2002: Altavista webgraphfrom2002: Nodesarewebpages Di t d d h li k Directededgesarehyperlinks 1.4billionpublicwebpagesS l billi d Severalbillionedges
ForeachnodewealsoknowthepageURL
1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 14
SPAM: SPAM: Usethewebgraphstructureto more efficiently extracttomoreefficientlyextractspamwebpages
Link farmsLinkfarms Spidertraps
Personalized and topicPersonalizedandtopicsensitivePageRank
1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 15
Website structure identification: Websitestructureidentification: Fromthewebgraph extractwebsites What are common navigational structures ofWhatarecommonnavigationalstructuresofwebsites? Clusterwebsitegraphs Identifycommonsubgraphs andpatterns
Whatarerolespages/linksplayinthegraph: C t t Contentpages Navigationalpages Indexpagesp g
Buildasummary/mapofthewebsite1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 16
A collection of focused snapshots of the Web AcollectionoffocusedsnapshotsoftheWeb Datastartsin2004andcontinuestilltoday
General crawls Generalcrawls startfrom~1000seedwebpages Crawl up to ~150 000 pager per siteCrawlupto 150,000pagerpersite
Specializedcrawls: UniversitiesUniversities USGovernment HurricaneKatrina(2005) dailycrawls Monthlynewspapercrawls
1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 17
Smaller than Altavista but you SmallerthanAltavista butyoualsohavethepagecontent
Candotopicanalysis
TopicsensitivePageRank
Study the evolution of websites and Studytheevolutionofwebsitesandwebpages
1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 18
50 million tweets per month starting 50milliontweetspermonthstartingJune2009(6months)
Format: Format:T 2009-06-07 02:07:42U http://twitter.com/redsoxtweetsW #redsox Extra Bases: Sox win, 8-1: The Rangers spoiled Jon Lester's perfecto and his shutout.. http://tinyurl.com/pyhgwyT i t t thi Twoimportantthings: URLs Hashtags
1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 19
Trending topics: raising fallingTrendingtopics:raising,falling
Inferringlinksofthewhofollowswhomnetwork
WhatisthelifecyclesofURLsandhashtags?
Finding early/influential users? Findingearly/influentialusers?
Clusteringtweetsbytopicorcategory
Sentimentanalysis arepeopleiti / ti b t thi ( d t?)positive/negativeaboutsomething(aproduct?)
1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 20
More than 1 million newsmedia and blog Morethan1millionnewsmedia andblogarticlesperdaysinceAugust2008
Extract phrases (quotes) and links Extractphrases(quotes)andlinks http://memetracker.org Format: Format:P http://cnnpoliticalticker.wordpress.com/2008/08/31/mccain-defends-
palins-experience-levelT 2008-09-01 00:00:13Q dangerously unprepared to be presidentQ dangerously unprepared to be presidentQ even more dangerously unpreparedQ understands the challenges that we faceQ worked and succeededQ still to this day refuses to acknowledge that the surge has
succeededL http://www.cnn.com
1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 21
Find all variants (mutations) of the same Findallvariants(mutations)ofthesamephrase clusterphrasesbasedoneditdistance and time:distanceandtime: lipstick on a pig you can put lipstick on a pig you can put lipstick on a pig but it's still a pig you can put lipstick on a pig but it s still a pig i think they put some lipstick on a pig but it's still a pig putting lipstick on a pig
Temporal variations of the phrase volume Temporalvariationsofthephrasevolume
1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 22
Predictthepopularityofaphraseovertimep p y p Howdoesinformationmutate/changeovertime? Which media sites are the most influential? Build aWhichmediasitesarethemostinfluential?Buildapredictivemodelofsiteinfluence
Whichnodesareearlymentioners,latecomers,i ?summarizers?
Sentimentanalysis arepeoplepositive/negativeaboutsomething (news, a product)something(news,aproduct)
Createamodelofpoliticalbias(liberalvs.conservative) What is genuine news what are genuine phrases and what is Whatisgenuinenews,whataregenuinephrasesandwhatisspam?
1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 23
We also have the Wikipedai webserver logs WealsohavetheWikipedai webserver logs,i.e.,pagevisitstatistics
HowdoesWikipagevisitstatisticscorrelatewith external events natural disasters?withexternalevents,naturaldisasters? UseTwitterorMemeTracker datatodetectthose Compare occurrence of phrases and visits to CompareoccurrenceofphrasesandvisitstoWikipediapages
1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 24
A large IM buddy graph from March 2005 AlargeIMbuddygraphfromMarch2005 230millionnodes 7 340 million undirected edges 7,340millionundirectededges
Limitations: Onlyhavethebuddygraphwithrandomnodeids No communication or edge strengthNocommunicationoredgestrength
1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 25
Find communities clusters in such a big graph Findcommunities,clustersinsuchabiggraph
Countfrequentsubgraphsq g p
Designalgorithmstocharacterizethefstructureofthenetworkasawhole
1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 26
Movie ratings: Movieratings: Netflixprizedataset: http://www.netflixprize.com/http://www.netflixprize.com/
YahooMusicratings: YahooMusicuserratingsofsongswithartist,g g ,albumandgenreinformation
717millionratings 136,000songs 1.8usersR t t i Restaurantreviews
1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 27
Collaborative filtering: Collaborativefiltering: Predictwhatratingswillusergivetoparticularsongs/movies,i.e.,whichsonswillhe/shelike?g / , , /
Supplementthedatawithadditionaldatasources: Movies IMDB Playlistsfromtheweb Lyric(textofthesong)
Includetaste,temporalcomponent,diversity into the modeldiversityintothemodel
1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 28
Stanford Search Queries StanfordSearchQueries NewYorkTimesarticlessince1987
Article are manually annotated by subjectArticlearemanuallyannotatedbysubjectcategoriesandkeywords
Entityorrelationextraction Extractkeywords,predictarticlecategory
f l l d b h Dontfeellimitedbythese Youcancollectthedatasetyourself And define the project/question yourself Anddefinetheproject/questionyourself
1/21/2010 JureLeskovec&AnandRajaraman,StanfordCS345a:DataMining 29