32
CS224W: Social and Information Network Analysis Jure Leskovec Stanford University Jure Leskovec, Stanford University http://cs224w.stanford.edu

Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

CS224W: Social and Information Network AnalysisJure Leskovec  Stanford UniversityJure Leskovec, Stanford University

http://cs224w.stanford.edu

Page 2: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

Teams of 2‐3 students (1 is also ok) Teams of 2 3 students (1 is also ok) Project: Experimental evaluation of algorithms and modelsExperimental evaluation of algorithms and models on an interesting dataset A theoretical project that considers a model, an algorithm or a network property and derives a rigorous result about it An in depth critical survey of one of the course An in‐depth critical survey of one of the course topics relating models, experimental results and underlying social theories and offering a novel perspective on the area

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2

Page 3: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

Answer the following questions: Answer the following questions: What is the problem you are solving? Wh t d t ill (h ill t it)? What data will you use (how will you get it)? How will you do the project?

Whi h l ith /t h i / d l l t Which algorithms/techniques/models you plan to use/develop? Be as specific as you can!p y

Who will you evaluate, measure success? What do you expect to submit/accomplish byWhat do you expect to submit/accomplish by the end of the quarter?

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3

Page 4: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

The project should contain at least some amount of p jmathematical analysis, and some experimentation on real or synthetic data

h l f h j ill i ll b 10 The result of the project will typically be a 10 page paper, describing the approach, the results, and the related work. 

Due on midnight OCT 18 2010

Upload PDF to http://coursework.stanford.eduUpload PDF to http://coursework.stanford.edu

TAs will assign group numbers – we will send a link to a GoogleDocg

Name your file: <group#>_proposal.pdf10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4

Page 5: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

Wikipedia Wikipedia IM buddy graph Yahoo Altavista web graph Yahoo Altavista web graph Stanford WebBase Twitter Data Twitter Data Blogs and news data Yahoo Music Ratings Yahoo Music Ratings

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5

Page 6: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

Richly labeled network containing extracted Richly labeled network containing extracted data from Wikipedia (based on infoboxes): Richly labeled networkRichly labeled network multiple types of nodes  and edges About 2.6 million concepts described by 247 million triples, including abstracts in 14 different languages http://dbpedia org http://dbpedia.org

Other OpenLinkedData datasets available at http://esw.w3.org/DataSetRDFDumps

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6

Page 7: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

Networks of positive and negative edges Networks of positive and negative edges Data includes: Trust/distrust edges Trust/distrust edges Also Epinions product reviews and review ratings

SNAP: http://snap stanford edu/data/#signnetsSNAP: http://snap.stanford.edu/data/#signnets Trustlet: http://www.trustlet.org/wiki

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7

Page 8: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

Prosper marketplace – Peer to peer lending: Prosper marketplace – Peer‐to‐peer lending: Lenders ask for loans P l th bid ( i i t t t ) l t People then bid (price, interest rate) on loans to fund them Rich social structure around the website Rich social structure around the website

Data at http://www prosper com/tools/DataExport aspxData at http://www.prosper.com/tools/DataExport.aspx

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8

Page 9: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

Turiya is a start up that collects game data from gameTuriya is a start up that collects game data from game publishers and processes these to produce business intelligence of value to it’s clients

Data collected includes: Data collected includes: Players and their attributes Logs of game eventsg g Information about virtual items Information about transactions in real money or creditsA l i l d Analyses include: Player segmentation Virtual goods recommendations If     i t t d Virtual goods recommendations Lifetime value estimation of players

If you are interested – send us an email!

10/11/2010 9Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

Page 10: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

What to Wear is a Social Game played on FacebookC t t t t tfit d b it th t d il Contestants create outfits and submit these to a daily competition, which has a theme like e.g. “an outfit for attending your ex’s‐wedding”

Contestants can also vote and comment on other people’sContestants can also vote and comment on other people s submissions

You get credit for both participating and judging Items for outfits are either bought from the store or reused fromItems for outfits are either bought from the store or reused from the contestant’s closet

~30,000 players/month Data about this game includes: Player data Data about previous competitions Fashion items data

If     i t t d  Data about outfits Many other data (~400 relations in all) 

If you are interested – send us an email!

10/11/2010 10Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

Page 11: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

Amazon product review data: Amazon product review data:For each product: P d t i f l k Product info: name, salesrank Product categorizationAll i All reviews user, rating, how helpful was the review

P l h b ht X l b ht Y t k! People who bought X also bought Y – network!

If     i t t d 

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11

If you are interested – send us an email!

Page 12: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

Collaboration network of computer scientists Collaboration network of computer scientists Each CS publication is included: Author names Author names Title YearYear Conference, journal name

Get the data at: http://dblp.uni‐trier.de/xml/ http://kdl.cs.umass.edu/data/dblp/dblp‐info.htmlp // / / p/ p

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12

Page 13: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

Patents (http://www.nber.org/patents/)( p // g/p /) Citations between patents For each patent we also know: Time Time Patent categorization Patent inventor data, …

Arxiv High‐energy Physics:g e e gy ys cs Citation network between papers For each paper we also know  Author names Author names Title and abstract of the paper Year of publication JournalJournal

Data at: http://snap.stanford.edu/data/#citnets

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13

Page 14: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

~50 million tweets per month starting50 million tweets per month starting in June 2009 (6 months)

Format:2009 06 07 02 07 42T 2009-06-07 02:07:42

U http://twitter.com/redsoxtweetsW #redsox Extra Bases: Sox win, 8-1: The Rangers spoiled Jon Lester's perfecto and his shutout.. http://tinyurl.com/pyhgwyhttp://tinyurl.com/pyhgwy

Two important things: URLs H h t

If you are interested send us an email! Hash‐tags

Twitter social graph and some profiles: http://an kaist ac kr/traces/WWW2010 html

– send us an email!

http://an.kaist.ac.kr/traces/WWW2010.html

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14

Page 15: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

Inferring links of the who‐follows‐whom Inferring links of the who follows whom network

h h l f l f d h h ? What is the lifecycle of URLs and hash‐tags? How do hash‐tags get adopted?

M l i l i h h hi h i ? Multiple competing hash‐tags, which one wins?

Finding early/influential users?

Community discovery

Where/how will the information propagate?10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15

Page 16: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

More than 1 million newsmedia and blog More than 1 million newsmedia and blog articles per day since August 2008

Extracted phrases (quotes) and links Extracted phrases (quotes) and links http://memetracker.org Format: Format:P http://cnnpoliticalticker.wordpress.com/2008/08/31/mccain-defends-

palins-experience-levelT 2008-09-01 00:00:13Q dangerously unprepared to be presidentQ dangerously unprepared to be presidentQ even more dangerously unpreparedQ understands the challenges that we faceQ worked and succeededL http://www.cnn.com

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 16

Page 17: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

How does information mutate/change over time?How does information mutate/change over time?

Which media sites are the most influential? Build a predictive model of site influencepredictive model of site influence

Role discovery: Which nodes are early adopters, late comers, summarizers?late comers, summarizers?

Create a model of political bias (liberal vs. conservative)conservative)

What is genuine news, what are genuine phrases and what is spam?and what is spam?

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17

Page 18: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

About the Dataset:About the Dataset: 6.5 million legal opinions from the United States Judiciary from 1900 to the present;d t li k d (l t f t li ) documents are linked (later cases refer to earlier ones)

the documents are both stored in raw form on Amazon S3 and also have been pre‐processed for analysis by Hadoop

Project ideas: label cases as pro‐plaintiff or pro‐defendant run PageRank Hub Authorities or other graph algorithms run PageRank, Hub‐Authorities, or other graph algorithms on the documents (they are hyperlinked)

identify legally important concepts

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18

If you are interested – send us an email!

Page 19: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

Complete edit history of Wikipedia until Complete edit history of Wikipedia until January 2008

For every single edit the complete For every single edit the complete snapshot of the article is saved

Each page has a talk page: Each page has a talk page:

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19

Page 20: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

Talk page: Talk page:

Editors discuss things like: Editors discuss things like:

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20

Page 21: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

Every registered use has a personal page: Every registered use has a personal page:

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21

Page 22: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

Every user’s page has a talk page: Every user s page has a talk page:

Users discuss things: Users discuss things:

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22

Page 23: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

We have nicely parsed Wikipedia dataWe have nicely parsed Wikipedia data Each edit: REVISION 4781981 72390319 Steven_Strogatz 2006-08-28T14:11:16Z SmackBot

433328 CATEGORY American mathematiciansCATEGORY American_mathematicians MAIN Boston_University MIT Harvard_University Cornell_University OTHER De:Steven_Strogatz Es:Steven_Strogatz EXTERNAL http://www.edge.org/3rd_culture/bios/strogatz.html TEMPLATE Cite_book Cite_book Cite_journal COMMENT ISBN formatting &/or general fixes using [[WP:AWB|AWB]] COMMENT ISBN formatting &/or general fixes using [[WP:AWB|AWB]] MINOR 1 TEXTDATA 229

Can identify networks:Wh t lk t h

If you are interested  Who talks to whom Who edits what

Also: Wikipedia has elections for admins, articles 

– send us an email!

p ,get reverted, disputes resolved, …

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23

Page 24: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

We also have the Wikipedai webserver logs, i.e.,We also have the Wikipedai webserver logs, i.e., page visit statistics http://dammit.lt/wikistats/h //l /bl / http://lmonson.com/blog/ http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2596

How does Wiki page visit statistics correlate with external events, natural disasters?, Use Twitter or MemeTracker data to detect those Compare occurrence of phrases and visits to Wikipedia pagespages

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 24

Page 25: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

Altavista web graph from 2002: Altavista web graph from 2002: Nodes are webpages Di t d d h li k Directed edges are hyperlinks 1.4 billion public webpagesS l billi d Several billion edges For each node we also know the page URL

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25

Page 26: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

SPAM: SPAM: Use the web‐graph structure to more efficiently extractto more efficiently extract spam webpages Link farmsLink farms Spider traps

Personalized and topic‐Personalized and topicsensitive PageRank

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26

Page 27: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

Website structure identification: Website structure identification: From the webgraph extract “websites” What are common navigational structures ofWhat are common navigational structures of websites? Cluster website graphs Identify common subgraphs and patterns What are roles pages/links play in the graph: C t t Content pages Navigational pages Index pagesp g Build a summary/map of the website

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 27

Page 28: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

A collection of focused snapshots of the Web A collection of focused snapshots of the Web  Data starts in 2004 and continues till today General crawls General crawls  start from ~1000 seed webpages Crawl up to ~150 000 pager per siteCrawl up to  150,000 pager per site

Specialized crawls: UniversitiesUniversities US Government Hurricane Katrina (2005) – daily crawls Monthly newspaper crawls

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 28

Page 29: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

Smaller than Altavista but you Smaller than Altavista but you also have the page content

Study the evolution of the webgraph How does website structure change and evolve gover time How do webpages (webpage structure) change p g ( p g ) gover time

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 29

Page 30: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

A large IM buddy graph from March 2005 A large IM buddy graph from March 2005 230 million nodes 7 340 million undirected edges 7,340 million undirected edges

Limitations: Only have the buddy graph with random node ids No communication or edge strengthNo communication or edge strength

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 30

Page 31: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

Find communities clusters in such a big graph Find communities, clusters in such a big graph

Count frequent subgraphsq g p

Design algorithms to characterize the fstructure of the network as a whole

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 31

Page 32: Stanford University ://snap.stanford.edu/.../slides/06-projects.pdf · 2020-01-09 · Project: Experimental evaluation of algorithms and models on an interesting dataset A theoretical

Stanford Search Queries Stanford Search Queries New York Times articles since 1987 Article are manually annotated by subjectArticle are manually annotated by subject categories and keywords Entity or relation extraction Extract keywords, predict article category

’ f l l d b h Don’t feel limited by these You can collect the dataset yourself And define the project/question yourself And define the project/question yourself

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 32