Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
CS224W: Social and Information Network AnalysisJure Leskovec Stanford UniversityJure Leskovec, Stanford University
http://cs224w.stanford.edu
Teams of 2‐3 students (1 is also ok) Teams of 2 3 students (1 is also ok) Project: Experimental evaluation of algorithms and modelsExperimental evaluation of algorithms and models on an interesting dataset A theoretical project that considers a model, an algorithm or a network property and derives a rigorous result about it An in depth critical survey of one of the course An in‐depth critical survey of one of the course topics relating models, experimental results and underlying social theories and offering a novel perspective on the area
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2
Answer the following questions: Answer the following questions: What is the problem you are solving? Wh t d t ill (h ill t it)? What data will you use (how will you get it)? How will you do the project?
Whi h l ith /t h i / d l l t Which algorithms/techniques/models you plan to use/develop? Be as specific as you can!p y
Who will you evaluate, measure success? What do you expect to submit/accomplish byWhat do you expect to submit/accomplish by the end of the quarter?
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3
The project should contain at least some amount of p jmathematical analysis, and some experimentation on real or synthetic data
h l f h j ill i ll b 10 The result of the project will typically be a 10 page paper, describing the approach, the results, and the related work.
Due on midnight OCT 18 2010
Upload PDF to http://coursework.stanford.eduUpload PDF to http://coursework.stanford.edu
TAs will assign group numbers – we will send a link to a GoogleDocg
Name your file: <group#>_proposal.pdf10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4
Wikipedia Wikipedia IM buddy graph Yahoo Altavista web graph Yahoo Altavista web graph Stanford WebBase Twitter Data Twitter Data Blogs and news data Yahoo Music Ratings Yahoo Music Ratings
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5
Richly labeled network containing extracted Richly labeled network containing extracted data from Wikipedia (based on infoboxes): Richly labeled networkRichly labeled network multiple types of nodes and edges About 2.6 million concepts described by 247 million triples, including abstracts in 14 different languages http://dbpedia org http://dbpedia.org
Other OpenLinkedData datasets available at http://esw.w3.org/DataSetRDFDumps
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6
Networks of positive and negative edges Networks of positive and negative edges Data includes: Trust/distrust edges Trust/distrust edges Also Epinions product reviews and review ratings
SNAP: http://snap stanford edu/data/#signnetsSNAP: http://snap.stanford.edu/data/#signnets Trustlet: http://www.trustlet.org/wiki
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7
Prosper marketplace – Peer to peer lending: Prosper marketplace – Peer‐to‐peer lending: Lenders ask for loans P l th bid ( i i t t t ) l t People then bid (price, interest rate) on loans to fund them Rich social structure around the website Rich social structure around the website
Data at http://www prosper com/tools/DataExport aspxData at http://www.prosper.com/tools/DataExport.aspx
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8
Turiya is a start up that collects game data from gameTuriya is a start up that collects game data from game publishers and processes these to produce business intelligence of value to it’s clients
Data collected includes: Data collected includes: Players and their attributes Logs of game eventsg g Information about virtual items Information about transactions in real money or creditsA l i l d Analyses include: Player segmentation Virtual goods recommendations If i t t d Virtual goods recommendations Lifetime value estimation of players
If you are interested – send us an email!
10/11/2010 9Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
What to Wear is a Social Game played on FacebookC t t t t tfit d b it th t d il Contestants create outfits and submit these to a daily competition, which has a theme like e.g. “an outfit for attending your ex’s‐wedding”
Contestants can also vote and comment on other people’sContestants can also vote and comment on other people s submissions
You get credit for both participating and judging Items for outfits are either bought from the store or reused fromItems for outfits are either bought from the store or reused from the contestant’s closet
~30,000 players/month Data about this game includes: Player data Data about previous competitions Fashion items data
If i t t d Data about outfits Many other data (~400 relations in all)
If you are interested – send us an email!
10/11/2010 10Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
Amazon product review data: Amazon product review data:For each product: P d t i f l k Product info: name, salesrank Product categorizationAll i All reviews user, rating, how helpful was the review
P l h b ht X l b ht Y t k! People who bought X also bought Y – network!
If i t t d
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11
If you are interested – send us an email!
Collaboration network of computer scientists Collaboration network of computer scientists Each CS publication is included: Author names Author names Title YearYear Conference, journal name
Get the data at: http://dblp.uni‐trier.de/xml/ http://kdl.cs.umass.edu/data/dblp/dblp‐info.htmlp // / / p/ p
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12
Patents (http://www.nber.org/patents/)( p // g/p /) Citations between patents For each patent we also know: Time Time Patent categorization Patent inventor data, …
Arxiv High‐energy Physics:g e e gy ys cs Citation network between papers For each paper we also know Author names Author names Title and abstract of the paper Year of publication JournalJournal
Data at: http://snap.stanford.edu/data/#citnets
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13
~50 million tweets per month starting50 million tweets per month starting in June 2009 (6 months)
Format:2009 06 07 02 07 42T 2009-06-07 02:07:42
U http://twitter.com/redsoxtweetsW #redsox Extra Bases: Sox win, 8-1: The Rangers spoiled Jon Lester's perfecto and his shutout.. http://tinyurl.com/pyhgwyhttp://tinyurl.com/pyhgwy
Two important things: URLs H h t
If you are interested send us an email! Hash‐tags
Twitter social graph and some profiles: http://an kaist ac kr/traces/WWW2010 html
– send us an email!
http://an.kaist.ac.kr/traces/WWW2010.html
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14
Inferring links of the who‐follows‐whom Inferring links of the who follows whom network
h h l f l f d h h ? What is the lifecycle of URLs and hash‐tags? How do hash‐tags get adopted?
M l i l i h h hi h i ? Multiple competing hash‐tags, which one wins?
Finding early/influential users?
Community discovery
Where/how will the information propagate?10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15
More than 1 million newsmedia and blog More than 1 million newsmedia and blog articles per day since August 2008
Extracted phrases (quotes) and links Extracted phrases (quotes) and links http://memetracker.org Format: Format:P http://cnnpoliticalticker.wordpress.com/2008/08/31/mccain-defends-
palins-experience-levelT 2008-09-01 00:00:13Q dangerously unprepared to be presidentQ dangerously unprepared to be presidentQ even more dangerously unpreparedQ understands the challenges that we faceQ worked and succeededL http://www.cnn.com
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 16
How does information mutate/change over time?How does information mutate/change over time?
Which media sites are the most influential? Build a predictive model of site influencepredictive model of site influence
Role discovery: Which nodes are early adopters, late comers, summarizers?late comers, summarizers?
Create a model of political bias (liberal vs. conservative)conservative)
What is genuine news, what are genuine phrases and what is spam?and what is spam?
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17
About the Dataset:About the Dataset: 6.5 million legal opinions from the United States Judiciary from 1900 to the present;d t li k d (l t f t li ) documents are linked (later cases refer to earlier ones)
the documents are both stored in raw form on Amazon S3 and also have been pre‐processed for analysis by Hadoop
Project ideas: label cases as pro‐plaintiff or pro‐defendant run PageRank Hub Authorities or other graph algorithms run PageRank, Hub‐Authorities, or other graph algorithms on the documents (they are hyperlinked)
identify legally important concepts
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18
If you are interested – send us an email!
Complete edit history of Wikipedia until Complete edit history of Wikipedia until January 2008
For every single edit the complete For every single edit the complete snapshot of the article is saved
Each page has a talk page: Each page has a talk page:
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19
Talk page: Talk page:
Editors discuss things like: Editors discuss things like:
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20
Every registered use has a personal page: Every registered use has a personal page:
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21
Every user’s page has a talk page: Every user s page has a talk page:
Users discuss things: Users discuss things:
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22
We have nicely parsed Wikipedia dataWe have nicely parsed Wikipedia data Each edit: REVISION 4781981 72390319 Steven_Strogatz 2006-08-28T14:11:16Z SmackBot
433328 CATEGORY American mathematiciansCATEGORY American_mathematicians MAIN Boston_University MIT Harvard_University Cornell_University OTHER De:Steven_Strogatz Es:Steven_Strogatz EXTERNAL http://www.edge.org/3rd_culture/bios/strogatz.html TEMPLATE Cite_book Cite_book Cite_journal COMMENT ISBN formatting &/or general fixes using [[WP:AWB|AWB]] COMMENT ISBN formatting &/or general fixes using [[WP:AWB|AWB]] MINOR 1 TEXTDATA 229
Can identify networks:Wh t lk t h
If you are interested Who talks to whom Who edits what
Also: Wikipedia has elections for admins, articles
– send us an email!
p ,get reverted, disputes resolved, …
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23
We also have the Wikipedai webserver logs, i.e.,We also have the Wikipedai webserver logs, i.e., page visit statistics http://dammit.lt/wikistats/h //l /bl / http://lmonson.com/blog/ http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2596
How does Wiki page visit statistics correlate with external events, natural disasters?, Use Twitter or MemeTracker data to detect those Compare occurrence of phrases and visits to Wikipedia pagespages
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 24
Altavista web graph from 2002: Altavista web graph from 2002: Nodes are webpages Di t d d h li k Directed edges are hyperlinks 1.4 billion public webpagesS l billi d Several billion edges For each node we also know the page URL
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25
SPAM: SPAM: Use the web‐graph structure to more efficiently extractto more efficiently extract spam webpages Link farmsLink farms Spider traps
Personalized and topic‐Personalized and topicsensitive PageRank
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26
Website structure identification: Website structure identification: From the webgraph extract “websites” What are common navigational structures ofWhat are common navigational structures of websites? Cluster website graphs Identify common subgraphs and patterns What are roles pages/links play in the graph: C t t Content pages Navigational pages Index pagesp g Build a summary/map of the website
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 27
A collection of focused snapshots of the Web A collection of focused snapshots of the Web Data starts in 2004 and continues till today General crawls General crawls start from ~1000 seed webpages Crawl up to ~150 000 pager per siteCrawl up to 150,000 pager per site
Specialized crawls: UniversitiesUniversities US Government Hurricane Katrina (2005) – daily crawls Monthly newspaper crawls
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 28
Smaller than Altavista but you Smaller than Altavista but you also have the page content
Study the evolution of the webgraph How does website structure change and evolve gover time How do webpages (webpage structure) change p g ( p g ) gover time
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 29
A large IM buddy graph from March 2005 A large IM buddy graph from March 2005 230 million nodes 7 340 million undirected edges 7,340 million undirected edges
Limitations: Only have the buddy graph with random node ids No communication or edge strengthNo communication or edge strength
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 30
Find communities clusters in such a big graph Find communities, clusters in such a big graph
Count frequent subgraphsq g p
Design algorithms to characterize the fstructure of the network as a whole
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 31
Stanford Search Queries Stanford Search Queries New York Times articles since 1987 Article are manually annotated by subjectArticle are manually annotated by subject categories and keywords Entity or relation extraction Extract keywords, predict article category
’ f l l d b h Don’t feel limited by these You can collect the dataset yourself And define the project/question yourself And define the project/question yourself
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 32