34

Media Cloud

Embed Size (px)

DESCRIPTION

Talk at IBM's Transparent Text Forum about the Berkman research project MediaCloud

Citation preview

Page 1: Media Cloud
Page 4: Media Cloud

#pman32,107 posts1979 authorshigh RT #sinfluencers

Page 5: Media Cloud

flickr photo by mhartford, cc

Page 6: Media Cloud

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

New York Times (1857-Current file); Dec 30, 2000; ProQuest Historical Newspapers The New York Times (1851 - 2006)pg. A1

Page 7: Media Cloud
Page 8: Media Cloud

Nutritional Information: New York Times

Ingredients: International coverage 42% (includes 8% Iraq, 5% Afghanistan, minimum weekly 5% China, and no less than 2% Africa), Washington coverage 28% (includes 7% Obama, 6% Congress, and trace amounts of Limbaugh), New York State/Albany coverage 14%, New York City coverage 10%, and less than 6% domestic US coverage.

Warning: contains less than 40% of sports coverage of the leading competitor, the New York Post, and 50% of business coverage of the Wall Street Journal. May contain less than your recommended daily allowance of Latin America News.

Page 9: Media Cloud

You have not read any international news today.

Page 10: Media Cloud

newspapers

radio (npr, talk)

television (network, cable)

blogs (int’l, domestic, left, right)

twitter

facebook

email forwards

magazines

websites

?

Page 11: Media Cloud

hand-codinglink analysisautomated content analysis

Page 12: Media Cloud

PEJ News coverage index

Page 13: Media Cloud

mediatenor.com

Page 14: Media Cloud

Hand-coding: traditional media monitoring

Upsides:- High accuracy- Flexibility- Low startup cost

Downsides:- Small data sets, problems extrapolating- Time consuming, no real-time data- Intercoder reliability, difficulty of coder setup

Page 15: Media Cloud

Figure 1: Community structure of political blogs (expanded set), shown using utilizing a GEMlayout [11] in the GUESS[3] visualization and analysis tool. The colors reflect political orientation,red for conservative, and blue for liberal. Orange links go from liberal to conservative, and purpleones from conservative to liberal. The size of each blog reflects the number of other blogs that linkto it.

longer existed, or had moved to a di!erent location. When looking at the front page of a blog we didnot make a distinction between blog references made in blogrolls (blogroll links) from those madein posts (post citations). This had the disadvantage of not di!erentiating between blogs that wereactively mentioned in a post on that day, from blogroll links that remain static over many weeks [10].Since posts usually contain sparse references to other blogs, and blogrolls usually contain dozens ofblogs, we assumed that the network obtained by crawling the front page of each blog would stronglyreflect blogroll links. 479 blogs had blogrolls through blogrolling.com, while many others simplymaintained a list of links to their favorite blogs. We did not include blogrolls placed on a secondarypage.

We constructed a citation network by identifying whether a URL present on the page of one blogreferences another political blog. We called a link found anywhere on a blog’s page, a “page link” todistinguish it from a “post citation”, a link to another blog that occurs strictly within a post. Figure 1shows the unmistakable division between the liberal and conservative political (blogo)spheres. Infact, 91% of the links originating within either the conservative or liberal communities stay withinthat community. An e!ect that may not be as apparent from the visualization is that even thoughwe started with a balanced set of blogs, conservative blogs show a greater tendency to link. 84%of conservative blogs link to at least one other blog, and 82% receive a link. In contrast, 74% ofliberal blogs link to another blog, while only 67% are linked to by another blog. So overall, we see aslightly higher tendency for conservative blogs to link. Liberal blogs linked to 13.6 blogs on average,while conservative blogs linked to an average of 15.1, and this di!erence is almost entirely due tothe higher proportion of liberal blogs with no links at all.

Although liberal blogs may not link as generously on average, the most popular liberal blogs,Daily Kos and Eschaton (atrios.blogspot.com), had 338 and 264 links from our single-day snapshot

4

Adamic and Glance

Page 16: Media Cloud
Page 17: Media Cloud

Link analysis: Leveraging web architectures

Upsides:- Highly automatable- Large data sets- Leverage existing tools for network research

Downsides:- Only consider link structure, not content- Danger of conflating linking with social structure- Need for hand-coding to make sense of clusters- Good for blogs, bad for MSM

Page 18: Media Cloud

newsmap.jp

Page 21: Media Cloud

Content analysis: Just becoming possible

Upsides:- Can work with unstructured text, blogs and MSM- Large data sets, highly automatable- Easy linkage with visualization platforms

Downsides:- Inaccuracy- Language constraints- Major programming investments

Page 23: Media Cloud

What We Have Done

extract

story

text

create

term

list

allow

rich

queries

get

news

stories

1

2

3

4

Page 24: Media Cloud

Terming

Lexicon-based simple matching More complex term extraction

Archer Daniels Midland CompanyArcher, Bill

Archer, Dennis W.Archer, Jeffrey

ArcheryArchibold, Randal C.

ArchitectureArchitecure and DesignArchives and Records

Archon Corp.Archstone-Smith Trust

ArcSight Inc.Arctic Cat Inc.Arctic Monkeys

Page 25: Media Cloud

- 9.25 million stories - 900G of database + downloaded content - 162 million story / tag associations - 1,500 sources - 10,000 feeds - roughly 20,000 stories per day

Page 26: Media Cloud

Topic focus

Page 27: Media Cloud

Pivoting on “republican”

Page 28: Media Cloud

Global Attention and Power Laws

Page 29: Media Cloud

You say “stimulus” and I say “bailout”

Page 30: Media Cloud
Page 31: Media Cloud
Page 32: Media Cloud
Page 33: Media Cloud
Page 34: Media Cloud

what’s been hard:

topic clusteringreplicating across languageslegal concernsdark matter