28
© Copyright IBM Corporation 2008 Mining Political Blog Networks Wojciech Gryc Yan Liu Prem Melville Claudia Perlich Richard D. Lawrence Predictive Modeling Group Mathematical Sciences Department IBM Research June 13, 2008

© Copyright IBM Corporation 2008 Mining Political Blog Networks Wojciech Gryc Yan Liu Prem Melville Claudia Perlich Richard D. Lawrence Predictive Modeling

Embed Size (px)

Citation preview

© Copyright IBM Corporation 2008

Mining Political Blog Networks

Wojciech GrycYan Liu

Prem MelvilleClaudia Perlich

Richard D. Lawrence

Predictive Modeling GroupMathematical Sciences Department

IBM Research

June 13, 2008

Slide 2

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

Overview

Blogs and other forms of social media provide us with a snapshot of people’s daily lives, opinions, and ideas – can we use this to learn more about trends within society?

Presentation Outline Overview of the political blogosphere Our work: a long-term plan Finding communities Text mining and information retrieval Combining text mining with graph mining Future work

Slide 3

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

Web 2.0

The web as: participatory, customizable, and community-oriented Numerous opportunities for corporations – marketing, customer loyalty, and research

Slide 4

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

Web 2.0 and Politics

Web 2.0 is also revolutionizing politics

Slide 5

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

Blogs provide two key pieces of information: Textual data:

Information on relationships between bloggers:

Web 2.0, Blogs, and Social Networks

Web 2.0 is ultimately a social environment eBay as an auction system with its own social ecosystem Wikipedia as a collaborative environment Blogs, forums, and e-mail as social networks

There are over 77 million blogs, with about 100,000 added every day Blog: an online journal that an individual shares a running log of events and personal insights with online

audiences in a reverse chronological order.

“No, I am NOT Voting for McCain ... There has been some murmuring of Hillary Clinton supporters voting for McCain in pure protest of Barack Obama’s candidacy.”

HuffingtonPost

Daily Kos

Boing Boing

PoliticalWire

A

C

BD

Blog Cross-Reference Graph

Slide 6

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

Literature Review: Political blog networks

Most political blog studies depend on hand-labeled information

Many (though not all) studies are limited to mainstream blogs – the most popular blogs in the blogosphere

Prior Work

Adamic & Glance, 2005. Analysis of political blogosphere between liberal and

conservative bloggers.

Ackland, 2005. Follow-up to Adamic & Glance, 2005.

Wallsten, 2005. Political blogosphere as an echo chamber, and

prominence of conservative bloggers.

Hand-labeled popular Liberal and Conservative blogs.

Analysis of linking patterns.

Analysis of less popular blogs (background discussions).

Slide 7

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

Literature Review: Machine learning and political blogs

Studies focusing on machine learning and political blogs often focus on text classification

Prior Work

Tremayne, 2006 Preferential attachment and link prediction in

the war-focused blogosphere

Turney, 2002 Cultural discussions are much more difficult to

label than technical ones

Mullen & Malouf, 2006 Sentiment labeling of political discussion

boards, accuracies around 60%

Durant & Smith, 2006 Achieve accuracies around 90% in labeling

political blogs as left, right, or moderate

Analysis of linking patterns (topic-specific).

Post classification based on sentiment labels.

Notice differences in accuracies.

Slide 8

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

BANTER (Blog Analysis of Network Topology and Evolving Responses)

77M Blogs

Political Blogs PresidentialPrimary Blogs

1. How do we identify the relevant sub-universe of blogs? We submit set of relevant keywords to Technorati, include out-linked blogs,

and then refine this sub-universe via active learning

2. How do we determine “authorities” in this sub-universe? We use page-rank-like algorithms against cross-reference structure,

combined with SNA concepts (e.g. Information Flow)

3. How do we detect emerging topics and themes in this sub-universe? One approach is to predict link (cross-reference) formation using network

evolution and content (keywords) at the nodes (blogs)

4. How do we detect sentiment and topics associated with specific posts? One approach is to learn a model using background knowledge and a small

set of labeled examples.

5 10 15 20

050

100

150

200

OpenID Buzz in January

days

Num

ber

of O

ccur

renc

e

OpenID Buzz in January

Slide 9

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

Task 1

First, how do we actually find the relevant blogs and communities?

Slide 10

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

Task 1: How do we find a relevant sub-community of blogs (e.g. Lotus-related blogs)?

Develop text-based classification approach to rank blogs in terms of their relevance to a specific domain (e.g. Lotus software)

Politics DemocratsElections RepublicansPolicy Voting

TechnoratiBlog Search

Keywords Keyword SearchSubset of Blogs

Extended Blog Subset

Include out-linked blogs

Positive Examples

Negative Examples

Generate labels for classifier

Classify blogs asRelevant or Irrelevant

Classifier

BuildModel

54

32

1

Use Top-Ranked Blogs As New Subset

Repeat this process as many times as necessary to collect a larger universe of blogs.

Slide 11

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

Task 1: Technorati results

31 political tags were submitted to Technorati, based on terms surrounding the Presidential Primaries and policy areas currently making headlines (e.g. “Iraq”, “economy”)

The following table shows the number of blogs tagged with a specific term

Tag Blog Countpolitics 56460political 13990government 8090washington 7580president 6579bush 6370iraq 5100current events 5060economics 4440vote 3760economy 3740republican 3590election 3210iran 3110clinton 2769democrat 2720

Tag Blog Countprimary 2420obama 2320democrats 2140current affairs 2040elections 2010taxes 1740republicans 1320afghanistan 1070mccain 937governance 737public policy 457recession 159primaries 103nuclear proliferation 21budget deficit 6

Slide 12

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

Task 1: Current data sets

Initial set of 100 “influential” political blogs Crawled from January 10, 2008 until the present Includes influential sites taken from previous political blog papers and listings such

as Technorati’s Top 100 blogs Includes blogs like Huffington Post, Wonkette, Daily Kos, etc

Larger set of 11788 blogs (317566 posts) being crawled since April 22, 2008 This includes the smaller data set above Built through the Technorati tag system

Slide 13

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

Task 2: How do we determine “authorities” in this sub-universe?

Influence Standard site ranking algorithms (e.g. Page Rank or Flow Betweenness) look at status of each blogger

within the social network Page Rank looks at linking patterns, giving more weight (i.e. importance) to links originating from

important websites Flow Betweenness looks at whether specific nodes in a network act as key distribution points for

information

Most

Very

Somewhat

None

Level of Importance

Page Rank Flow Betweenness

Slide 14

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

Task 3: How do we detect emerging topics and themes in this sub-universe?

One way to find emerging topics is comparing background discussions to most recent posts

We can also use text-based information to see if our authorities are actually leading discussions and breaking news

Mentions of "Wikileaks" in a One Week Period

0

10

20

30

40

50

60

70

3-Jan-08 13-Jan-08 23-Jan-08 2-Feb-08 12-Feb-08 22-Feb-08 3-Mar-08 13-Mar-08

Date

# o

f B

log

ger

s

Generalbackgrounddiscussions

Wikileaks DNS entryremoved by US judge

New backgroundlevel followingpress coverage

Slide 15

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

Task 3: How do we detect emerging topics and themes in this sub-universe?

Assortativity and homophily play a key role in our analysis

Our approach to homophily is based on measuring within- and between-group edges

Homophily is only observed in certain contexts Blogs focusing on similar topics are more likely to link to each other Blogs are not homophilous when it comes political sentiment Node level versus network level

Homophilous Heterophilous

Slide 16

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

Task 3: Analyzing discussions and network structure

To use assortativity in analyzing “buzz”, label bloggers by their mentioning a specific term or set of terms

In this case, correlation between # of bloggers and assortativity: -0.799

Tracking "wikileaks"

0

10

20

30

40

50

60

70

3-Jan-08 13-Jan-08 23-Jan-08 2-Feb-08 12-Feb-08 22-Feb-08 3-Mar-08 13-Mar-08

Date

Nu

mb

er o

f B

log

ger

s

-4.500E-05

-4.000E-05

-3.500E-05

-3.000E-05

-2.500E-05

-2.000E-05

-1.500E-05

-1.000E-05

-5.000E-06

0.000E+00

5.000E-06

Ass

ort

ativ

ity

Co

effi

cien

t

Counts Assortativity

Slide 17

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

Task 4: Labeling political posts and blogs

Depending on the information we want to extract, there are numerous labels we may want to apply to posts or blogs

An extension of our data sets: 260 posts labeled as “positive”, “negative” in relation to

Hillary Clinton and Barack Obama 360 posts labeled as “relevant” or “not relevant” in relation

to the Democratic Primaries

Potential Labels

Relevant or not relevant Subjective or objective Positive or negative

More on Our Data

260 posts on sentiment surrounding Obama, Clinton

360 posts on relevance to Democratic Primaries

Is post

relevant?

Is post

subjective?Posts

Positive

Negative

Slide 18

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

Task 4: Labeling political posts and blogs

Key question: how can we improve our classifiers with such a limited set of labeled examples?

Transfer learning: using other data sets Using background knowledge

Slide 19

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

Task 4: Precocious Naïve Bayes

Using Naïve Bayes classification, we can use a bag-of-words approach to build text-based classifiers

Problem: when training a classifier like this, we start with a “blank slate” – equal probabilities for all features

It may also be useful to include background knowledge in classification systems Lexicons containing sentiment-focused information Related data sets and labeled information

nF

F

...

1

n

i

in CFpCpZ

FFCp1

1 )|()(1

),...,|(

Slide 20

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

Task 4: Precocious Naïve Bayes

Using a lexicon can improve the classification process It increases accuracy and minimizes the number of training examples needed

Example application: classifying posts based on sentiment towards Obama or Clinton Similar studies achieve accuracies of about 60%

Slide 21

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

Task 4: Precocious Naïve Bayes

Using machine learning and information retrieval models can also help clarify linguistic patterns within political posts and blogs

For example, explore the term “truth” In general, “truth” has positive connotations Yet our models see it as negative

Evidence as to why “truth” is negative Often used in sarcastic or accusatory messages Associated with negative events

Another down-weighted term: “liberal” More evidence of a conservative blogosphere?

“Spinning the truth.”

“Transform a lie into a truth.”

“There is a lot of truth to Wright's sermons.”

Slide 22

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

Task 4: Transfer learning for sentiment prediction

Can we apply a model trained for movie reviews (or other product reviews) to predict sentiment in Lotus blogs?

We examined several labeled sentiment datasets from other domains to predict the sentiment of our Lotus-universe blogs Movielens: 1000 positive and 1000 negative movie reviews from www.movielens.org Epinions: 97 enterprise software reviews from epinions.com Amazon: 13040 software reviews from amazon.com

Testing (Accuracy)

Training Movie Epinions(Enterprise software)

Amazon(Software)

Lotus Data2-class (3-class)

Movie 81.5 65.9 60.1 79.3

Epinions 33.7 67.5 34.9 50.3

Amazon 67.2 60.9 73.8 76.6

Slide 23

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

Task 4: Labeling political posts and blogs

Final question: can we learn more by combining graph-based information with our text-based models?

Slide 24

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

Task 4: Topic-Link Latent Dirichlet Allocation (LDA)

We wish to incorporate both content and network structure in modeling the political blogosphere Members of a “community” are more likely to link to each other

An application: predicting links between new posts in our data set Predicting linking patterns can help predict (or observe) major patterns in the blogosphere

Simple text-based models can even be somewhat effective The chart below shows the probability of a link existing between two posts based on those two posts’

content similarities

Slide 25

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

Task 4: Topic-Link Latent Dirichlet Allocation (LDA)

This method helps show clusters of terms (i.e. topical discussions or posts)

The table below shows the top five clusters using LDA and Topic-Link LDA

While one can’t formally say which set of clusters is better, both clusters provide different ways of looking at patterns in the data Clusters from the Topic-Link LDA approach are built using both textual similarity scores and the linking

patterns of the bloggers

Rank LDA Topic-Link LDA

1pda, xml, backup, podcast, extra, faq, print

obam, clinton, mccain, romney, huckab, vote, hill

2atlant, subscrib, masthead, internship, dish , feedback,

minist, jawa, ali, omar, terror, june, octob

3mccain, romnei, huckabe, mitt, vote , parti, republican

tax, farm, mitchel, veto, lobby, marin, fund

4obama, vote , clinton, hillari, deleg, score, win

bin, laden, chin, osam, dubai, warm, global

5captain, podhoretz, quarter, blogger, blog, paul, obama

copyright, pundit, disclaim, blog, prev, index, trackback

Slide 26

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

Task 4: Link prediction using Topic-Link LDA

Using the models generated from Topic-Link LDA, it is possible to build a predictive model for which posts will link to which posts Using posts from February 1-14, 2008, we want to predict linking patterns between posts

written during February 15-22, 2008

Using the following baseline models: Preferential attachment: blogs with high-outlinks always cite blogs with high in-links Cosine similarity: blog posts with high similarity scores (> 0.5) link to each other

Below are precision and recall scores for the models Accuracies were not used due to the sparsity of the networks

Model Type Precision Recall F1Preferential Attachment 0.005 0.891 0.010Cosine Similarity 0.020 0.750 0.039Topic-Link LDA 0.079 0.803 0.144

Slide 27

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

Future Work and Potential Extensions

Graph transduction and assortativity of networks Multidimensional assortativity Graph transduction dependent on assortativity

Authority scores based on both network structure and text Who is driving the discussions, based on linguistic patterns and term frequencies? Page rank, betweenness, etc. are good network measures for driving discussions,

but need to be validated

Early detection of “buzz” and discussions Building on models of “post cascades” Incorporating new definition of authority using both textual and network-based data

Slide 28

IBM Research – Mathematical Sciences Department

© Copyright IBM Corporation 2008

Thank You

[email protected]