24
Content-Based Social Network Analysis of Online Communities Anatoliy Gruzd Caroline Haythornthwaite Graduate School of Library and Information Science University of Illinois at Urbana-Champaign Social Network/ing Symposium Toronto, 2007

Content-Based Social Network Analysis of Online Communities

Embed Size (px)

Citation preview

Page 1: Content-Based Social Network Analysis of Online Communities

Content-Based Social Network Analysis of Online Communities

Anatoliy Gruzd Caroline Haythornthwaite

Graduate School of Library and Information ScienceUniversity of Illinois at Urbana-Champaign

Social Network/ing SymposiumToronto, 2007

Page 2: Content-Based Social Network Analysis of Online Communities

The Problem Online communities are

creating a growing volume of texts contributed by a growing number of participants

100 million posters in Usenet (Marc Smith, quoted in CNET, 2003)

3.12 terrabytes of data *daily* on Usenet (2007)

2010, >70% of digital content will be user-generated, with the majority of it will still be text-based (Technology Consultancy IDC)

Growth of Usenet(wikipedia, Oct. 2007)

Page 3: Content-Based Social Network Analysis of Online Communities

Making Sense of Community Action How can we help make

sense of community action and interaction based solely on textual interchanges?

How can we make the social structures evident for participants, and managers or teachers?

How can we make advances from linear streams of text to visualized patterns of interaction?

Growth of blog activityMarch 2003-2006

• 175,000 new blogs a day (2006)

Page 4: Content-Based Social Network Analysis of Online Communities

Mapping Online Communities Mappings and internal

examinations tend to be based on one aspect of ties

Links between sites Reports of friendship or

work relations FOAF declarations

With a concentration of quantity over content Flickrverse, Gustavog, 2006

http://www.flickr.com/photo_zoom.gne?id=9708628&context=set-222111&size=lBased on 50 connections between people.

Page 5: Content-Based Social Network Analysis of Online Communities

Mapping Online Communities (2) Emerging mappings

include attention to Poster activity Actor profiles as posters Content of sites

e.g., words in common on different sites(Gloor & Zhao, 2006)

Welse, Gleave, Fisher & Smith, 2007 in JOSS

Page 6: Content-Based Social Network Analysis of Online Communities

Extracting Network Information Determine who is talking to whom

Applying social network analysis techniques

Determine what they are talking about Applying natural language processing techniques

Merge these to produce network detection that better represents ongoing processes

Page 7: Content-Based Social Network Analysis of Online Communities

Our Goal Use natural language processing (NLP)

enhance the current techniques of building social networks

gain more information and insight about Nodes, Relations, and Ties

Current focus is on bulletin boards Current example is online learning environment Procedures are being derived to use for groups

with unknown membership

Page 8: Content-Based Social Network Analysis of Online Communities

Adding more with NLP Revealing network information

1. Node discovery 2. Tie discovery 3. Relation discovery 4. Role & Group discovery

Network visibility rather than aggregate behavior

Important for revealing structures to Participants to understand the ‘lay of the

(cyber)land’ and for instructors (or managers) to oversee participation and intervene as necessary

Page 9: Content-Based Social Network Analysis of Online Communities

Adding relational information Few (yet) derive relations from content which can

reveal Networks based on multiple relations Change in discourse over time Changes in associations among network members by relation

and time Few deal with the vagaries of CMC texts

Bulletin boards, chat Incorrect spelling, partial sentences, inventive punctuation Deriving who is talking to whom from content analysis

Or local language conventions Acronyms, group naming conventions, group word use

conventions, nicknames for people and processes

Page 10: Content-Based Social Network Analysis of Online Communities

Node and Ties Focus today on nodes and tie discovery Identifying who are the actors in the

network Identify nodes, i.e., people Make the tie(s) between nodes

Two approaches Chain Network, based on chain of posting Name Network, based on names used in

the text

Page 11: Content-Based Social Network Analysis of Online Communities

Chain Network: definition optionsA B C D

Connect a sender to the last person in the post chain only (undirected)

0 0 1

Connect a sender to the last and first (=thread starter) person in the chain, and assign equal weight values (e.g. 1) to both ties.

1 0 1

Same as option 2, but a tie between a sender and the first person is half weight (e.g. 0.5)

.5 0 1

Connect a sender to all people in the reference chain with decreasing weights.

.25 .5 1

Page 12: Content-Based Social Network Analysis of Online Communities

Chain Networks: missed info.

Previous post is by Gabriel, Sam replies: ‘Nick, Ann, Gina, Gabriel:

I apologize for not backing this up with a good source, but I know from reading about this topic that libraries…’

Previous posts by Gabriel, Sam, Gina, and Eva, then: ‘Gina, I owe you a cookie. This is exactly what I wanted to know.

I was already planning on taking 302 next semester, and now I have something to look forward to!’

Post by Fred: ‘I wonder if that could be why other libraries around the world have resisted changing –

it's too much work, and as Dan pointed out, too expensive.’

Ex.1

Ex.2

Ex.3

Page 13: Content-Based Social Network Analysis of Online Communities

Name networks Making use of node and tie information that is

in the text of the postings Issues

Disambiguating names/nicknames from text Disambiguate names of people from names of

people being discussed (e.g., subject) Detection of aliases for a given person and

disambiguation of two or more users with the same name

Page 14: Content-Based Social Network Analysis of Online Communities

Hand coding: categories Network Participants

<from> = person indicated in ‘from’ line of post heading (NB only info that is system generated) <addressee> = direct reference to other ('I agree with you Todd') <reference> = indirect reference to other ('Todd has a good point') <self-reference> = poster references themselves in some way (braindead library student, high

school teacher, etc.) <signature> = name as given by the message author on their post

Named non-participants <subject>, <subject 2>, or <subject 3> = name is a subject of the discussion, either as one name

(Dewey), 2 (Brewste Kahle) or 3 (Charles R. Darwin) <non-group reference> = reference to a person who is not in the group, nor the subject – e.g., a

former professor Error

<error> = new name appears because of error (e.g., Lackie as a subject instead of Leckie; or part of a prevpost line does not conform to the usual format)

Previous Posts (if not removed from dataset) <previous-poster> = when the previous message is included, this indicates the poster (‘Janice

wrote: ’) (system generated) <copy> = name appears because it is included with the previous message

Page 15: Content-Based Social Network Analysis of Online Communities

Examples of hand-coding Just a note to clarify something in yesterday's lecture/chat session. I mentioned

that Monday's NY Times had an article on <#1><subject> Brewster. I want to clarify that the article concerns the copyright extension law and the current Supreme Court case <#1><subject> Eldred v. <#1><subject> Ashcroft, set to begin today, I believe. <#1><subject 2> Brewster Kahle is currently touring the country in a bookmobile … For more info on this … you can refer to the Web site that <#1><reference> Jodie mentioned yesterday… <#1><signature>LA

NB. Jodie may not even appear in the contributors to this thread Several of our programs at UC <#7><subject> Davis have well-intentioned

lower division research methods classes that introduce then never reinforce basic skills.

Need to disambiguate “UC Davis” from someone called “Davis” Research (to paraphrase my hero, <#8><subject> Shrek) is like onions. Not

because it stinks, but because it is made up of layers. “Shrek” as a name will not appear in conventional name lists.

Page 16: Content-Based Social Network Analysis of Online Communities

Automated Node & Tie Discovery Method

1. Determine names in the dataset, and assign a probability value

2. Determine email address to name relationship

3. Assign tie weight to each discovered tie

Page 17: Content-Based Social Network Analysis of Online Communities

Automated Node Discovery Named Entities Recognition

Discovery of personal names The 1990 US Census http://www.census.gov/genealogy/names Capitalization

Distinguishing between names of people in and outside the class

Having a list of names doesn’t always work e.g., if someone uses their middle name which is not on the name list,

or they use a short or nickname; Method: associate names with email addresses in the class

relying on content-based (e.g. context words) and structure-based (e.g. word position) features of names

Issues Many names - same person Same name - many people

Page 18: Content-Based Social Network Analysis of Online Communities

Automated Node Discovery (2)EXAMPLEFrom: [email protected] (=Wilma)Reference Chain: [email protected] (=Dustin) => [email protected] (=Sam)

Hi Dustin, Sam, Nick and all, I appreciate your posts from this and last week […]. I keep thinking of poor Charlie who only wanted information on “dogs“ Sam has been talking about. […] Wilma.

Words to the Left

Name Words to the Right

Position %

Score “TO”

Score “FROM”

* Hi Dustin Sam, Nick, 0 0.322 -0.004

* Dustin, Sam Nick and 1 0.321 -0.002

Dustin, Sam, Nick and all, 2 0.320 -0.001

of poor Charlie who only 50 0.05 0.04

on “dogs“ Sam has been 65 0.285 0.07

* Wilma * 88 0.0012 0.116  

* - end of the line

Page 19: Content-Based Social Network Analysis of Online Communities

Automated Tie Discovery Associate each sender in the class with all names mentioned in

his/her emails. For example, Wilma ---> Dustin = [email protected] Wilma ---> Charlie

no email for Charlie, so not a person in the conversation group (e.g., when Steve and I took Professor Sid’s course last year)

Wilma ---> no mention of a name; info on tie is only in the Chain network; could start of a thread or change of topic within a thread, or a general posting

Assign tie weight Pair counts Mutual information

Page 20: Content-Based Social Network Analysis of Online Communities

Chain vs. Name Networks Get added information from the name

network Ex. BBoards #06,07,08 Nodes: 37 Messages: 346 Chain network ties: 223 Name network ties: 215 / 429 Shared ties: 140 QAP Pearson Correlation: 0.453 (p = .000)

Page 21: Content-Based Social Network Analysis of Online Communities

An ego network for Brent

Visualization powered by http://www.netvis.org

Name Network Chain Network

Page 22: Content-Based Social Network Analysis of Online Communities

An ego network for TylerName Network Chain Network

kurt -> Kurt Cobain, a lead singer for the rock band Nirvana dewey -> John Dewey, philosopher & educatorsanta_monica -> Santa Monica Public Library mark –> mark up language Visualization powered by http://www.netvis.org

Page 23: Content-Based Social Network Analysis of Online Communities

Conclusion Uses and benefits of content-based networks

Discovery of social network behavior rather than posting behavior

Discovery of social interactions between group members that happened outside the group (e.g. fishing trip)

Discovery of relations between group members and people outside the group (e.g. a shared friend from another department)

Expert/Co-discussant discovery Study of perceived social networks without directly collecting

survey-data from participants (?)

Page 24: Content-Based Social Network Analysis of Online Communities

References and Further Reading Related papers

Haythornthwaite, C. & Gruzd, A. (2007). A noun phrase analysis tool for mining online community. In C. Steinfield, B.T. Pentland, M. Ackerman & N. Contractor (eds.). Communities and Technologies 2007 (pp. 67-86). London: Springer.

Howard T. Welser, Eric Gleave, Danyel Fisher, and Marc Smith (2007) Visualizing the signatures of social roles in online discussion groups. Journal of Social Structure, 8(2). http://www.cmu.edu/joss/content/articles/volume8/Welser/