Twitter Analytics: The Sample of the London Olympics Week Two – INFO 480 – Introduction to Data Science

Twitter Analytics: The Sample of the London Olympics

Week Two – INFO 480 – Introduction to Data Science

2

Undersanding What’s in Twitter Data

• Relationships are unique (Golder & Yardi, 2010)

– 22% are reciprocal (Kwak et al., 2010)

• Digging deeper than follower counts (Cha et al., 2010)

• Context Collapse (Marwick & boyd, 2011)

• Numerous syntactical features– Retweets

– Reply-to

– Mentions

– Hashtags

• Device String

3

Retweets

• RT @[username] “tweet text”• Intention and Purpose (boyd et al., 2010)

• Frequency (Mustafaraj & Metaxas, 2011)

• Message Valence (Gruzd et al., 2011)

• Syntactic Structure (Suh et al., 2010)

– Users that follow more users are retweeted more (counterintuitive)

• Crisis Informatics (Starbird et al., 2010; Starbird & Palen, 2012)

4

Conversation (Reply-to & Mentions)

• Reply-to– @[username] at first position in tweet text

• Mention– @[username] at any position in tweet text

• Conversation marker (Honeycutt & Herring, 2009)

• 3-5 messages• 3% of direct addressals were not with @

– Mascaro, Novak & Goggins, 2012

• Engaging over controversy (Yardi & boyd, 2010)

• Measure of relationship strength (Bigonha et al., 2010; Bakshy et al., 2011)

5

Hashtags

• #[alphanumeric text] no spaces• Discourse marker (Huang et al., 2010)

– Real-time topical identification (Mathioudakis & Koudas, 2010)

• Breaks down conversational barriers (Heverin & Zack, 2011; Bruns & Burgess, 2011; Sreenivasan, Lee & Goh, 2011)

• Diffusion of discourse (Chang, 2010; Szomszor, Kostkova & St. Louis, 2011; Chew & Eysenbach, 2010)

6

Twitter Access Mechanisms

• Twitter API identifies device/application used for tweet

• Identifying communities of discourse (Black et al., 2012)

• Demographic identification (Wohn & Na, 2011)

• Human or Bot (Chu et al., 2010)

URL

• http:// until next white space• Twitter users t.co shortener• Need to decode URL multiple times– Other URL shorteners

• This process is “costly” with large datasets

8

Categorizing Twitter Users Politically

• Research has categorized users politically by syntactical feature usage and content– Retweets (Conover et al., 2012)

– URL’s and memes (Ratkiewicz et al., 2011)

– Hashtags and Mentions (Livne et al., 2011; Hanna et al., 2011)

• “Content Injection”/”poaching” (Livne et al., 2011; Conover et al., 2011)

• Conversational networks– #Hashtag +/- (Jurgens et al., 2011)– Biased Gatekeepers

Syntactical Features

The Assignment: Week Two• This is first and foremost an analysis assignment and an assignment focused on

familiarizing yourself with what R can help you with. A full, working sample is provided on GitHub. If you download the Full Zip File, you will have access to the data under the “Week2” directory”– Set your working directory to “Week2”– Run “Complete.R”. Examine the comments and the resulting files to familiarize yourself with a

Description of the data• Analysis Questions. Write up a short essay with tables or graphs if needed to describe

how you would:– Build a network using the scripts from week1 against the mention connections? Reply-To

connections? In this sample data. What transformations are required? How would you filter the data? Use the actual data to ground your thinking. Feel free to actually write or modify the R code samples from the first two weeks to experiment. Some of you will be more comfortable doing this; some will be more comfortable addressing the question conceptually. This is OK.

– Submit any issues you encounter to GitHub under this repository• I will open a discussion board under our Blackboard Shell regarding the three papers you

were assigned to read last week. I expect you to answer the questions and respond to your classmates. Your participation does not need to be long, just thoughtful.

Documents

Twitter Analytics: The Sample of the London Olympics Week Two – INFO 480 – Introduction to Data Science