134
ANALYZING THE SPATIAL PROPAGATION OF INFORMATION IN TWITTER By SRETEN CVETOJEVIĆ A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2018

© 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

ANALYZING THE SPATIAL PROPAGATION OF INFORMATION IN TWITTER

By

SRETEN CVETOJEVIĆ

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2018

Page 2: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

© 2018 Sreten Cvetojević

Page 3: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

To my family, friends and colleagues

Page 4: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

4

ACKNOWLEDGMENTS

I would like to express gratitude to my advisor Dr. Hochmair. His guidance

motivated me to overcome countless obstacles along my scientific journey.

Words can hardly express the how grateful I am to my parents. Their sacrifices

and hard work helped me come this far and will forever inspire me to go beyond the

limits. Special thanks to my brother who always inspired me to work harder towards the

future and not to dwell on my previous accomplishments.

Thanks to my former and present lab mates Denis Zielstra, Francesco Tonini,

Majid Alivand, Levente Juhász, Adam Benjamin, Ahmed Ahmouda and my friends at

FLREC for their help, friendship and encouragement.

I would like to thank my committee members for their guidance, understanding

and help during the course of my Ph.D.

Page 5: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

5

TABLE OF CONTENTS page

ACKNOWLEDGMENTS .................................................................................................. 4

LIST OF TABLES ............................................................................................................ 7

LIST OF FIGURES .......................................................................................................... 8

LIST OF ABBREVIATIONS ........................................................................................... 10

ABSTRACT ................................................................................................................... 11

CHAPTER

1 INTRODUCTION .................................................................................................... 13

Objectives ............................................................................................................... 14

Dissertation Outline ................................................................................................ 14

2 POSITIONAL ACCURACY OF TWITTER AND INSTAGRAM IMAGES IN URBAN ENVIRONMENTS ..................................................................................... 17

Study Background .................................................................................................. 17 Study Setup ............................................................................................................ 19

Data Collection ................................................................................................. 19

Geo-tagging in Twitter and Instagram ........................................................ 20

Obtaining the photographer’s position ....................................................... 21 Data Analysis ................................................................................................... 24

Analysis Results ..................................................................................................... 25 R1: Twitter Image Positional Accuracy ............................................................. 25 R2: Distance Between Photographer And Object ............................................. 27

R3: Distance Between Instagram Locations And Object Position .................... 28 Discussion And Future Work .................................................................................. 29

3 ANALYZING THE SPREAD OF TWEETS IN RESPONSE TO PARIS ATTACKS . 39

Study Background .................................................................................................. 39 Related Work .......................................................................................................... 41

Study Setup ............................................................................................................ 45 Twitter Information Sharing Methods Analyzed In The Study ........................... 45 Data Access ..................................................................................................... 46

Analysis Of Tweet Popularity .................................................................................. 48 The Role Of Tweet Type And Content On Tweet Popularity ............................ 48

The Effect Of The Profession On Tweet Popularity .......................................... 52 Analysis Of Information Spread .............................................................................. 53

Exploring Information Spread On World Maps ................................................. 53

Page 6: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

6

Retweets .................................................................................................... 53

Hashtags .................................................................................................... 54

Kernel-density maps .................................................................................. 55 Spatiotemporal Regression For Global Spread Analysis .................................. 56

Model formulation ...................................................................................... 57 Data preparation ........................................................................................ 58 Model estimation ........................................................................................ 59

Discussion .............................................................................................................. 60

4 MODELING INTERURBAN MENTIONING RELATIONSHIPS IN THE U.S. TWITTER NETWORK USING GEO-HASHTAGS .................................................. 80

Study Background .................................................................................................. 80 Related Work .......................................................................................................... 82

Study Setup ............................................................................................................ 84 Analyzing the Network Structure of Mentions ......................................................... 88

Graph Generation ............................................................................................. 88 The Distance Between Mentioning Cities ......................................................... 89

Node Degree .................................................................................................... 89 Network Centrality Measures ........................................................................... 91 Reciprocity And Connectance .......................................................................... 93

Sentiment Analysis ........................................................................................... 93 Homophily and Heterophily ..................................................................................... 96

Data Preparation .............................................................................................. 97 City characteristics (nodal covariates) ....................................................... 97 Dissimilarity and similarity matrices ........................................................... 98

Network Regression ......................................................................................... 99

Discussion And Conclusions ................................................................................. 102

5 CONCLUSIONS ................................................................................................... 121

LIST OF REFERENCES ............................................................................................. 124

BIOGRAPHICAL SKETCH .......................................................................................... 134

Page 7: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

7

LIST OF TABLES

Table page 2-1 Number of identified photographer positions and object locations (in

parentheses)....................................................................................................... 31

2-2 Descriptive statistics of distances between photo upload and photo position in different geographic regions ........................................................................... 32

3-1 Breakdown of geometry types in the analyzed dataset of tweets (wide Paris area, 13 Nov-27 Nov) ......................................................................................... 73

3-2 Confusion matrix for tweet content classification ................................................ 74

3-3 Popularity of tweets for different tweet formats and content categories .............. 75

3-4 Analysis of deviance for retweets ....................................................................... 76

3-5 The interaction between tweet format and content category on retweets (P-value adjustment method: Holm) ........................................................................ 77

3-6 Retweet statistics for tweets posted by journalists and non-journalists .............. 78

3-7 Negative binomial regression for panel data (Europe is the default continent) ... 79

4-1 Cities with highest weighted indegree and outdegree (strength) ...................... 115

4-2 Pearson correlation between weighted centrality measures ............................. 116

4-3 City ranking based on closeness centrality, together with Kleinberg hub and authority scores. ............................................................................................... 117

4-4 City mentions state subgraph indicators ........................................................... 118

4-5 Mean number of employees in given occupation per 1000 employees in any occupation across all analyzed cities, and its and standard deviation of the mean. Categories in boldface highlight specific occupational categories whereas those in regular font show broad occupation categories .................... 119

4-6 Arithmetic signs of estimated coefficients from Multivariate QAP regression on four models .................................................................................................. 120

Page 8: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

8

LIST OF FIGURES

Figure page 2-1 Analyzed areas ................................................................................................... 33

2-2 Twitter and Instagram photo positions in Vienna ................................................ 34

2-3 Twitter and Instagram photo positions in Belgrade ............................................. 35

2-4 Boxplots of distances in different geographic regions. ........................................ 36

2-5 Offset between the photographer and identified object ...................................... 37

2-6 Spatial distribution of Instagram locations .......................................................... 38

3-1 Bounding box (this map extent) around Paris, which was used to select original tweets with images, hashtags, and keywords whose spread, was analyzed ............................................................................................................. 64

3-2 Tweet with photos. .............................................................................................. 65

3-3 Power law fitting the distribution of retweets, separated by tweet format and content category ................................................................................................. 66

3-4 Interaction between tweet type and content category on the number of retweets .............................................................................................................. 67

3-5 Retweets of tweets with pictures related to the Paris attacks ............................. 68

3-6 Geographic distribution of hashtags ................................................................... 69

3-7 Temporal distribution of hashtags ....................................................................... 70

3-8 Kernel density maps for the first 9 hours of #prayforparis hashtag usage (tweet density is shown in thousand tweets per square km) ............................... 71

3-9 Distance-based clustering of twitter places around Barcelona ........................... 72

4-1 Setup of world regions used for Twitter data download .................................... 106

4-2 Country place tag in geo-tagged tweets JSON file ........................................... 107

4-3 Locations of originating cities of tweets (green polygons) and density of mentioned cities (blueish Kernel density map) ................................................. 108

4-4 Force directed layout for a sub-graph of cities that have more than 30 incoming mentions ............................................................................................ 109

Page 9: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

9

4-5 Distribution of weighted and unweighted distances (in km) between U.S. cities ................................................................................................................. 110

4-6 Power law fitting the distribution of the weighted indegree and weighted outdegree of the city mentions graph ............................................................... 111

4-7 A network of mentions between cities in Colorado (link width is proportionate to edge weights) ............................................................................................... 112

4-8 Word clouds of the words most used with some of the analyzed geo-hashtags ........................................................................................................... 113

4-9 Mean sentiment value of tweets between pairs of cities plotted against distance (in 1000s of km) between pairs of cities. ............................................ 114

Page 10: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

10

LIST OF ABBREVIATIONS

API Application Programming Interface

EXIF Exchangeable Image File

GIS Geographic Information System

HTML Hypertext Markup Language

JSON Java Script Object Notation

KML Keyhole Markup Language

LDA Latent Dirichlet Allocation

NLTK Natural Language Tool Kit

OSM Open Street Map

POI Point of Interest

QAP Quadratic Assignment Procedure

SMS Short Messaging Service

SQL Structured Query Language

VGI Volunteered Geographic Information

URL Uniform Resource Locator

Page 11: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

11

Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

ANALYZING THE SPATIAL PROPAGATION OF INFORMATION THROUGH TWITTER

By

Sreten Cvetojević

May 2018

Chair: Hartwig H. Hochmair Co-Chair: Bon A. Dewitt Major: Forest Resources and Conservation

This study explores and models spatiotemporal information propagation through

Twitter. It analyzes in detail the role of different content types of a tweet, such as

images, hashtags, or keywords on information propagation, determines the effect of

sociodemographic characteristics of individuals on tweet popularity, and explores the

role of city attributes on the mentioning frequency between cities in the Twitter network.

Since this research is primarily concerned with the spatial aspect of information

propagation, an understanding of the data quality of spatial information associated with

a tweet is of high relevance for any subsequent analysis. For such an assessment

several aspects related to spatial data quality in tweets are explored, such as available

geo-tagging options, including their associated positional errors and spatial resolution,

the positional accuracy of Twitter photos, social networking (e.g. retweeting) behavior,

technical limitations for Twitter data download, and data noise and spam affecting the

accurate modeling of spatial information spread. For part of this data quality analysis,

Twitter data will be compared to other crowd-sourced data, such as Instagram photos,

to highlight the specifics of Twitter data and Twitter user behavior.

Page 12: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

12

To demonstrate various approaches to observing, mapping, and modeling the

dynamic information spread through Twitter, a terrorist attack in Paris was chosen as a

showcase. Various exploratory methods and spatiotemporal regression models were

used to describe and formalize how the news of this event spread around the world,

where the influence of tweet content, tweet format and type, user profession, and

geographic characteristics of places, on the effectiveness and speed of information

spread were analyzed. The identified factors allow adding spatial and spatiotemporal

components to current approaches of information propagation modeling. The analysis of

Twitter communication patterns was furthermore expanded to interurban mentioning

relationships through the exploration of Tweet patterns between U.S. cities based on

geo-hashtags. This provides insight into the inherent structure of the Twitter social

network space, its hierarchies, and the spatial and non-spatial processes and factors

governing the mentioning relationships between cities.

Page 13: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

13

CHAPTER 1 INTRODUCTION

This study uses spatiotemporal analysis to advance the understanding of user

behavior and spatial information propagation in Twitter. Twitter is a microblogging

service that allows users to post text in 140 (280 since November 2017) character long

messages called tweets. Tweets can have images attached, or contain links to videos

or other external sources. Twitter was founded in 2006 and initially designed for tweets

to be sent in SMS (Short Messaging Service) messages, which explains the length limit

of 140 characters. Twitter has 330 million monthly active users with 500 million tweets

sent every day. 80% of active users are on mobile phones or tablets, and over 67

million users live in the U.S. (Aslam, 2018). Twitter provides a large volume of data to

analyze human social behavior and movement patterns. However, several reasons

make the quality and usability of Twitter data questionable. For example, tweets are not

representative of the whole population since primarily the younger generation uses it.

Further, its use is concentrated on industrialized nations, leaving several blank spots on

the globe. Also, only 1-2% of tweets are geo-coded, rendering only a small portion of

tweets usable for geographical research (Mitchell et al. 2013). Twitter data is one

prominent example of crowd-sourced data that comes with a spatial component, which

is often referred to as Volunteered Geographic Information (VGI) (Goodchild, 2007).

Other examples of VGI are data from photo-sharing applications, such as Flickr, or from

crowd-sourced maps, such as OpenStreetMap. While VGI is for free, it does not have

official quality standards, making its fitness of use for certain applications often

questionable (MacEachren et al. 2011).

Page 14: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

14

Objectives

The overall goal of this dissertation is to explore and develop models of spatial

information propagation in Twitter through geotagged tweets with their different facets

relating to content and format, and to analyze and determine factors affecting such

information propagation in the Twitter network space. This will enhance our current

understanding of the spatial propagation patterns in Twitter and how Twitter users react

to real world events. This overall goal is accomplished through the following objectives:

Description of the geotagging accuracy of tweets,

Analysis of the positional accuracy of Twitter images and its comparison to the accuracy of images from other social networks,

Identification of factors influencing the popularity of tweets, including tweet format, user profession and thematic categories;

Exploration of the geographic spread of event-related tweets over time and the role of the language used;

Identification of factors contributing to the information spread around the world within a spatiotemporal regression model;

Exploration of underlying geographic and socio-demographic factors influencing the formation of the network of mutual city mentions using the Quadratic Assignment Procedure.

Dissertation Outline

In the first case study, certain aspects of Twitter images are explored and

compared to Instagram images. Both Twitter and Instagram provide means for the user

to annotate images with geographic location information to some extent. Using a

selection of images that are shared through these two platforms from various urban

areas around the world, this study compares the photographer’s position, which is

manually estimated from the scene shown in the image, with the annotated location

information of the image and the position of the object being photographed. This

Page 15: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

15

approach provides the first insight into the Twitter user’s spatial movement between the

locations where the picture is taken and uploaded to Twitter. Furthermore, the distance

between the photographer position and the photographed object location in Twitter and

Instagram can be used as a proxy for the visual prominence of photographed urban

objects. Lastly, the collected dataset allows us to assess the positional accuracy of

location labels in Instagram through comparison of the label position to the true position

of the referenced object. For each of the different analyses the study discusses potential

sources leading to positional errors of images in Twitter and Instagram and provides a

comprehensive set of illustrative examples from different cities.

In the second case study, different tweet formats, including Twitter images, are

explored with regards to their effect on worldwide information propagation through

Twitter after the attacks that occurred in Paris in November 2015. Exploration of the

images posted by the Twitter users showed that two themes were predominantly used,

namely, events or their aftermath, and artistic support to the victims. This study also

found that journalists extensively used Twitter to share images of the events and that

their tweets received more attention than those of non-journalist. Endogenous

information spread is explored by mapping of retweets, which represents sharing of

information from within the Twitter network only. Exogenous information spread (which

includes event information that may have been obtained from sources outside Twitter) is

modelled through observing time and location of tweets with event related hashtags.

Geographic and temporal aspects and a hierarchical structure of the spread pattern are

modelled using spatiotemporal regression analysis.

Page 16: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

16

In the third case study counts of geo-tagged tweets that mention selected U.S.

cities in their hashtags, combined with various measures of network connectivity, node

centrality, and city characteristics are used to examine the prominence of individual

cities in the Twitter landscape, and to identify factors that explain strong mutual

communication ties between cities. In addition, the joint use of the city’s name in a

hashtag along with other thematic hashtags posted in tweets allows extracting user

sentiments about a city, and the effect of geographic distance on mutual sentiments

between cities. This analysis contributes to the modeling of the relationships and ties

between cities in the social network space. It also offers a detailed interpretation of the

Quadratic Assignment Procedure that was used for modeling these relationships.

Page 17: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

17

CHAPTER 2 POSITIONAL ACCURACY OF TWITTER AND INSTAGRAM IMAGES IN URBAN

ENVIRONMENTS

Study Background

Driven by the rapid development in computer, sensor, and communication

technology, the past decade experienced a surge in new Web 2.0 and social media

applications that allow users to share spatial information over the World Wide Web and

mobile communication platforms. Two prominent examples of social networking/photo

sharing platforms are Twitter and Instagram. Twitter is an online microblogging service

that allows users to send and read short 140-character messages called tweets. The

nature of Twitter data has been analyzed in numerous aspects, reaching from the

extraction of travel patterns (Hawelka et al., 2014), over estimating the influence of

socio-economic factors on Twitter activity (L. Li, Goodchild, & Xu, 2013), to the

localness of tweets and other geotagged social media (Johnson, Sengupta, Schöning, &

Hecht, 2016). Twitter is also a rich source of images since users can share links to

media from other websites (e.g. YouTube, Instagram) or attach pictures to their posts

which are hosted on Twitter. The spatial aspect of Twitter image sharing has, however,

not been discussed in the research literature so far. Some studies did take on various

other topics of Twitter image analysis, though. For example, (Thelwall et al., 2015)

conducted a content analysis of 800 images tweeted from the UK and the USA, finding

that most of the images were photographs, that about 9% of the images mainly

displayed text, and that about 15% of images were screen grabs of phones. The same

study estimated that about two thirds of the images were taken immediately before

being tweeted. (Yanai & Kawano, 2014) developed a classifier for grouping streamed

Twitter photo data into 100 kinds of food. Classification results are visualized in a

Page 18: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

18

prevailing food map showing popular foods in different parts of Japan. The paper

analyzed also how the popularity of different dishes, such as “ramen noodle”, “curry”

and “okonomiyaki”, varies by season and region. The study presented in this paper

complements earlier research efforts by assessing the positional accuracy of Twitter

images at the urban level. For this purpose the photographer’s position will be estimated

from the scenery shown in the image through manual identification of the location by

human analysts. This is then compared to the coordinates of the associated geo-tagged

tweet and the photographed object itself. The method of manually estimating the

photographer’s position from image scenes for accuracy assessment of crowd-sourced

data has already been applied to data from other photo-sharing services, such as Flickr

and Panoramio (Zielstra & Hochmair, 2013). Automated methods to extract the

photographer’s position from image content have already been developed for regions

with high photo density where images sufficiently overlap, and for which a set of control

point with known coordinates is provided (Y. Li, Snavely, & Huttenlocher, 2010).

Instagram is a photo- and video-sharing platform which allows users to take

pictures and videos and to share them with their followers on the Instagram website, as

well as through a variety of social networking platforms such as Facebook, Twitter, and

Flickr. Users can also geo-tag their shared content. The content and spatial distribution

of Instagram images have been analyzed in several recent studies. For example,

(Bakhshi, Shamma, & Gilbert, 2014) found that Instagram photos with faces are 38%

more likely to receive likes and 32% more likely to receive comments than those

without. (Hochman & Manovich, 2013) compared the visual signatures of 13 different

global cities using 2.3 million Instagram photos from these cities and used spatio–

Page 19: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

19

temporal visualizations of over 200,000 Instagram photos uploaded in Tel Aviv, Israel,

to demonstrate how they can offer social, cultural and political insights about people’s

activities in particular locations and time periods.

Although social media images provide valuable information about a place, the

research literature has so far barely touched upon the spatial accuracy aspect of

images shared through Twitter and Instagram. Therefore, this paper addresses the

following three related research objectives:

R1: Determine for Twitter images the distance between a photographer’s position

(derived from the image content) and the geo-tagged position from which the tweet has

been sent. This analysis provides information about a photographer’s movement that

occurs between taking a picture and sending the tweet with the picture.

R2: Determine for Twitter and Instagram images the distance between the

photographer’s position and the photographed object. The range of distances

associated with a photographed object gives insight into the visual prominence of the

object.

R3: Determine for Instagram images the distance between the photographed

object and the Instagram location associated with that photograph. This provides

information about the positional accuracy of location tags available in Instagram for

annotating images with positional information.

Study Setup

Data Collection

This study is based on local knowledge of human analysts so that the

photographer’s position can be estimated from the content that is shown on Twitter and

Instagram images. The study was therefore conducted for geographic areas that

Page 20: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

20

students participating in this study (as well as the authors) were familiar with. Since

urban environments with their multitude of unique objects, e.g. monuments, stadiums,

plazas, or churches, provide more visual clues to estimate a photographer’s position

than a rural landscape with fewer discernable objects, the study was primarily

conducted in urban areas. In addition to the photographer’s estimated position research

objective R1 requires the geographic coordinates of the location from which the tweet

with an image was sent, and R3 requires the coordinates of the location tag which has

been associated with the image by an Instagram user.

Geo-tagging in Twitter and Instagram

The Twitter mobile application interface allows the user to opt for attaching exact

geographic coordinates as metadata along with the tweet. The geographic coordinates

are in this case obtained through the smartphone geolocation method, which can be

based on the built-in GPS receiver, nearby Wi-Fi networks or from the mobile network

through base station information. The accuracy of the latter method depends on the

mobile network infrastructure. As an alternative for geotagging tweets, the user can also

pick a place from a collection of nearby locations in the mobile application, where more

general geographic entities, such as country, province, or city appear on top of the list.

How general Twitter’s suggestions depend on the geographic region. For example, for

photos from Belgrade, Serbia, the top-most suggested place tag was “Republic of

Serbia”, whereas for photos from Vienna, Austria, the suggested place tag was “Vienna,

Austria”. Since the spatial granularity of these places is too coarse for the research

tasks proposed in this study, only photos from tweets with geographic coordinates

(derived from the cell phone) were used.

Page 21: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

21

A geo-tagged image on Instagram does not provide exact geographic

coordinates of the location from where the picture was taken, or from where it was sent

or uploaded, respectively. Instead, it provides the name of the location that has been

selected by the user from a pre-defined list of locations when uploading the image to

Instagram. If the photo to be uploaded to Instagram has geographic coordinates in its

Exif (Exchangeable image file format) image file metadata tags, the Instagram

application lists locations in a list that are near the coordinates in the Exif metadata. Exif

tags contain coordinates if the smartphone geolocation was activated while the image

was taken. If the Exif tags do not contain geographic coordinates, the Instagram

application lists locations near the current upload location identified by the smartphone.

The link to an Instagram image can be tweeted from within the Instagram application as

well. If the image file that is to be shared via Instagram does not contain geographic

coordinates in its Exif metadata and the smartphone geolocation function is turned off,

the image cannot be geo-tagged. Until a recent change in the Instagram application

users were allowed to add custom places based on the Exif metadata coordinates or

the smartphone position to the list of already available location names nearby.

Therefore a single real world place, such as a city, state, or mountain, can have

different Instagram place labels assigned to it, with the same or different coordinates. It

is also possible that the same real-world feature is associated with several same

Instagram place labels, where these place labels vary in position. Adding custom place

labels in Instagram has been deactivated as of August 2015.

Obtaining the photographer’s position

To obtain the position of the photographer at the time when the picture was taken

we relied on the local knowledge of 47 graduate students who took on this task as part

Page 22: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

22

of a GIS graduate course at the University of Florida for partial course credit. For data

preparation each student was asked to provide us with the bounding box of two urban

areas they were familiar with, anywhere in the world. For these areas, three types of

photos were collected:

Photos attached to tweets (hosted by Twitter): Links to jpg files are provided in tweet JSON files that can be harvested from the Twitter streaming API.

Photos from Instagram shared in tweets (as a link to Instagram photos): A tweet contains the link to the Instagram Web site for that photo. The HTML code of that Web site was then parsed for the URL to the corresponding jpg file.

Instagram photos: Original photos posted on Instagram containing metadata such as user and location information, links to photos, or captions.

Each photo used in the analysis contained at least one type of location

information in its metadata. Photos obtained through tweets had geographic coordinates

of the place the tweet was uploaded from. Instagram photos contained a user assigned

location tag. Instagram photos shared in tweets contained the location of the Instagram

location that users had chosen to annotate it with. For the conducted data analysis,

Instagram images that were either obtained from the Instagram API or sent as a link in a

tweet were analyzed as one dataset, since for both methods the only geo-tagged

information available for the image is the Instagram location assigned by the user.

For the data collection process, in order to obtain a sufficient number of suitable

photographs that students could analyze in their selected region, the specified polygon

area was increased if necessary. This was often necessary for photos attached to

tweets (source 1), which occurs in about 7.5% of geotagged tweets with exact

geographic coordinates. A smaller percentage of geotagged tweets (2.4%) was found to

contain links to Instagram images (source 2). The highest photo density in a region was

generally obtained from the Instagram API with original Instagram photos (source 3).

Page 23: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

23

Prior to handing out photos to students to identify the photographer’s position we

manually removed photos that contained profanities and vulgar content. In a Web

application that was set up for this study students could then browse through the

collected photos for their selected urban areas. The task of the assignment for students

was to indicate for each image (whenever this was possible) the estimated position of

the photographer based on the image content, through adding markers to a “Google My

Maps®” map, together with the photo ID. Students were asked to complete this step for

20 images from each data source. If this was not possible, they were asked to analyze

more images from any data source (whichever one worked) to reach a total of 60

images. The marker locations indicated by students have then extracted from the

shared “Google My Maps®” maps through a script and inserted into a PostgreSQL

database. The authors of this paper went through the same steps for selected areas in

Vienna, Salzburg, Budapest, Szeged, Ispra and Belgrade. For the next steps the photos

from only 23 students (out of the original 47 students) were further processed and

analyzed to reduce the time consuming process of data cleaning to a feasible amount.

That is, for quality assurance all of the photographer positions indicated by the 23

students were manually checked by the authors in a customized Web application that

showed the original photo content, the specified position in a map as a marker, and the

“Google Street View®” image for that position next to the map where available. The Web

application enabled us to either accept the photographer’s position indicated by the

student as is, to move the marker position, or to exclude a photo if it was obviously

placed at the wrong location and if we could not identify the correct photographer’s

position based on the satellite image view or “Google Street View®”. Based on these

Page 24: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

24

data it was possible to measure the distance between the photographer’s position and

a) the geo-tagged position of the tweet containing the picture and b) the location

position associated with an Instagram photo. In addition to these efforts, the authors

placed markers at the location of photographed objects that could be well approximated

through a point location, such as a clearly discernible building. Objects that could not be

well approximated with a point on the map and where it was unclear which point the

photographer was focusing on (such as with bridges) were not considered for this task.

Table 2-1 summarizes the number of photographer positions obtained per

country and source that were retained for further analysis. Values in parentheses

indicate the number of object locations that were identified by the authors. Depending

on the research objective under consideration, different data columns are used from

Table 2-1, as will be described in the section about data analysis. Figure 2-1 plots the

photo locations from Table 2-1, and Figure 2-2 and Figure 2-3 provide a zoomed view of

available data sources for parts of Vienna and Belgrade.

Data Analysis

The analysis consists of three parts according to the three research objectives.

To quantify the movement of Twitter users between taking a photograph and uploading

it to the Twitter site (R1), the distance between these two positions is measured. To

assess regional differences, each data point was assigned to a geographic area, i.e.

North America (including the Caribbean), Europe and other. The dataset consists of 273

individual features from Twitter images.

To answer R2 which assesses the visual prominence of objects, a dataset

containing 325 Twitter and Instagram photos was used, for which both the position of

the photographer and the photographed objects could be identified. We hypothesize

Page 25: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

25

that the type of object and the object surrounding affects the visual prominence of the

object. Therefore, each photograph was assigned to one of the following categories: a)

Prominent building spatially separated from other buildings; b) photos were taken from a

location that is separated by water from the photographed object, e.g. through a

fountain or river; and c) all other photos. The last group contained for example pictures

of local businesses in downtown areas or other points of interest, such as small

monuments or fountains.

For R3, which analyzes the Instagram location accuracy by measuring the

distance between the photographed object and the annotated Instagram location the

used dataset contains 251 photos. This dataset is a subset of the dataset used to

answer R2, containing only photos originating from the Instagram platform.

Analysis Results

R1: Twitter Image Positional Accuracy

The Twitter dataset can be used to study the movement of a photographer

between taking a picture and uploading it to Twitter. The log-log plot reveals that more

than 60% of photos were uploaded within a 1 km radius of the original photo location.

On the other end of the range, 2% of total photos were uploaded more than 100 km

away from the place where they were taken.

Different user patterns could be observed for posting photos on Twitter.

Approximately 30% of the photos were posted within 50 m of the actual location. This

distance closely resembles the maximum error of smartphone positioning in urban

environments, therefore these photos can be considered as instant uploads. As

opposed to this, 10% of photos were posted from more than 10 km away from the

original location. This category contains for example vacation images or photos from

Page 26: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

26

sports events held in different cities. Users in this category decided not to upload their

photos instantly. The spatial distribution of the intermediate distance category provides

some information about the locations from where social media users post their photos.

In some cases, when the offset is large, the upload position corresponds to possible

open Wi-Fi hotspots and hotels. This might be indicative of tourist Twitter activities, for

example, when tourists do not have a cell phone data plan abroad, and are therefore

unable to upload their photos instantly. Images are often uploaded from areas that

appear to be residential, but taken somewhere else, e.g. downtown areas.

Since distances in the three compared global regions are not normally

distributed, even after using a log transformation, a non-parametric test was applied to

test the effect of geographic region on median distance offsets. Data points were

categorized into North America/Caribbean (AME), Europe (EUR) and other (OTH,

consisting of locations from Arabic countries, India and Kenya). Descriptive statistics of

distances for these categories can be found in Table 2-2, revealing that median

distances, which are not as much effected by outliers caused by tweets from other cities

as the mean distance, are highest for regions outside North-America/Caribbean and

Europe. Results of the Mood’s median test show that the geographic region has a

significant effect on the distance between the photo and upload location (p = 0.02). This

can potentially be explained by differences in Wi-Fi and mobile data infrastructure,

which has generally better coverage in regions of stronger economic development,

requiring users in less developed countries to move further for internet connection and

sending a tweet.

Page 27: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

27

Figure 2-4 shows boxplots of the log transformed data grouped by geographic

regions, supporting the pattern from Table 2-2.

R2: Distance Between Photographer And Object

The distance between the photographed object and the photographer can be

interpreted as the visual prominence of an object, with larger values indicating that the

object can be seen (and is interesting enough to be photographed) from further away.

Only photos that have a clear focus on an object were used. Therefore landscapes, city

panoramas, portraits and other photos with scenery were excluded from the analysis.

Visual inspection of the distance data revealed that most photos were taken in close

proximity to the photographed object which is because urban environments usually

prevent distant views due to the high building density. Figure 2-5 A shows a typical

image setup in a city, with many objects being photographed from short distances, such

as stairways (lower left inset). The figure shows also that photos of landmark buildings

tend to be taken from larger distances, which is because of their visual prominence and

the setup of their surroundings, which often includes large plazas and parks. A similar

case occurs if a water body is located between the object and the photographer (Figure

2-5 B), preventing the user from moving closer, and often providing a scenic foreground

for the photograph. Boxplots of distances for these categories are shown in Figure 2-5

C. A one-way ANOVA test on the log transformed distances indicates a significant effect

of the object category on the photographer’s distance (F(2,322) = 87.47, p < 0.001).

The overall distribution of distances between the object and photographer also

follows a power law function with an exponent value of 1.31 and R-Squared of 0.89

(Figure 2-5 D). Out of the total 325 photos analyzed, 47% of social media photos with

Page 28: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

28

identified objects were taken within 25 m of the object. On the other end, only 16 % of

photos were taken more than 100 m away from the object.

R3: Distance Between Instagram Locations And Object Position

Instagram locations labels are diverse in nature and can denote among others

physical objects, such as a building, street, or monument, or administrative units, such

as a city. Users previously had the ability to create custom locations which resulted in a

high density of Instagram locations in urban environments, as shown in an example for

Salzburg (Figure 2-6 A). After an update in August 2015, attaching photos to existing

locations is the only way to geocode Instagram photos. This update also prevents users

from creating new locations inside the Instagram apps. The offset between the identified

objects and the Instagram locations ranges in the analyzed dataset between 2 m and 24

km (median: 85 m, mean: 635 m). 52 % of the locations were closer than 100 m to the

object and 14 % of them were further away than 1 km.

Several reasons can explain a location offset error. Among the locations more

than 1 km away from the identified object, several locations were tagged with general

names, such as a town (e.g. Ispra - Lago Maggiore) or a geographic area (e.g. Dutch

Harbor). This is not necessarily a positional error of the Instagram location, but rather

the user’s inclination towards increased privacy (i.e., obscuring his or her exact

location), lack of local knowledge, the thinking that a general location name is the best

fit for describing the photo content, or the absence of an appropriate Instagram location

nearby. Another reason to explain large offsets that are not related to Instagram location

position errors is when a user mistakenly picks the wrong location label for the photo. If

the photo is not tagged with geographic coordinates in its Exif tags, users rely on the

Instagram locations suggestions that are based on their current position. In such cases,

Page 29: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

29

when a user moves away from the photo location, an image can be associated with a

place in the proximity of the upload location. Figure 2-6 B shows an extreme example of

a photo (distance between location and object: 3.3 km) that is neither associated with

the place where it was taken, nor the true location of the object that is shown on the

photo, but a third location, which is most probably close to the place of upload (the

northern most point).

Furthermore, a number of large distance location - object pairs in Instagram

revealed misplaced Instagram labels, where locations do not align with their true

positions. An example is provided in Figure 2-6 C, where Instagram locations are

marked as red dots. In these cases, it is possible that the first user who created the

location traveled far towards the southeast before creating a custom location. This

phenomenon implies that custom locations were geotagged based on the smartphone's

geolocation, i.e. the current position of the user. This is illustrated in an example for St.

George Island, Florida (Figure 2-6 D). The spread of Instagram locations around the

true object position, a lighthouse, implies that locations were most likely added by

Instagram users, with coordinates corresponding to their smartphone locations. The

example shows also that the same object can have multiple Instagram locations. One

problem with misplaced locations is that users can add photos to them without being

aware of the position error, since only the location names are shown in the apps, but not

their map location.

Discussion And Future Work

This study analyzed the positional accuracy of geotagged images shared over

Twitter and Instagram, using the estimated photographer positions from the image

content, as well as published coordinates and/or locations of tweets with images and

Page 30: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

30

Instagram images. For Twitter, the analysis provided some explanations for observed

patterns of distance offsets between photo capture location and tweet position, including

Wi-Fi availability. The study considered primarily images taken within the urban areas

since otherwise the scene could not be recognized by the analyst. Offset distances

between photo capture location and tweet position can be expected to be much larger if

distances to scenes outside the city limits, e.g., in other countries, would be taken into

account as well. Extending this kind of analysis to the worldwide scale is part of the

plans for future work. The study showed that Twitter and Instagram images help to

identify the visual prominence of selected objects, which is affected by the type and

layout of the object. The analysis is therefore relating to the visual aspect of landmark

attractiveness, which could be expanded to determining the semantic and structural

attraction of landmarks (Raubal & Winter, 2002) for these two data sources. The study

provided also various explanations for observed inaccuracies in Instagram location

labels, such as travel between the location where a picture was taken and the location

where it was uploaded. For future work we plan to explore the density and accuracy of

place labels in more depth for cities around the world, and to relate their spatial

characteristics to those of other place label collections, for example in

Foursquare/Swarm.

Page 31: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

31

Table 2-1. Number of identified photographer positions and object locations (in parentheses)

Country Twitter Twitter/Instagram Instagram Total

Austria 45 (24) 50 (27) 54 (16) 149 (67)

Canada 26 (6) 28 (1) 26 80 (7)

Germany 1 2 18 21

Haiti 1 0 16 (4) 17 (4)

Hungary 11 (5) 40 (14) 68 (25) 119 (44)

India 7 (2) 12 (3) 14 (2) 33 (7)

Italy 0 2 41 (6) 43 (6)

Kenya 3 3 3 (1) 9 (1)

Libya 3 (1) 3 (1) 52 (11) 58 (13)

Puerto Rico 1 13 (3) 4 (1) 18 (4)

Serbia 24 (12) 19 (8) 26 (13) 69 (33)

Slovakia 5 (1) 16 18 39 (1)

Turkey 10 12 (1) 31 (10) 53 (11)

United Arab Emirates 4 (1) 8 (2) 0 12 (3)

United Kingdom 6 (1) 18 (5) 16 (6) 40 (12)

United States 126 (21) 203 (32) 546 (59) 875 (112)

Total 273 (74) 429 (97) 933 (154) 1635 (325)

Page 32: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

32

Table 2-2. Descriptive statistics of distances between photo upload and photo position in different geographic regions

Region Mean [m] Median [m] SD [m] N

North America and the Caribbean 7389.0 198.7 20606.4 154 Europe 2837.0 627.7 13077.5 92 Other 3668.0 1559.0 6983.1 27

Page 33: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

33

Figure 2-1. Analyzed areas.

Page 34: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

34

Figure 2-2. Twitter and Instagram photo positions in Vienna.

Page 35: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

35

Figure 2-3. Twitter and Instagram photo positions in Belgrade.

Page 36: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

36

Figure 2-4. Boxplots of distances in different geographic regions.

Page 37: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

37

Figure 2-5. Offset between the photographer and identified object. A) in Vienna, B) in Budapest, C) boxplot of distances for different object categories, D) fitted power law function to the frequency distribution of distances for Twitter and Instagram photos.

A B

C D

Page 38: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

38

Figure 2-6. Spatial distribution of Instagram locations. A) in Salzburg, B) incorrect

selection of an Instagram location in Budapest, C) misplaced Instagram locations in Florida, D) multiple locations for the same object with similar labels in St. George Island, Florida.

A B

C D

Page 39: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

39

CHAPTER 3 ANALYZING THE SPREAD OF TWEETS IN RESPONSE TO PARIS ATTACKS

Study Background

Over the past decade, the number of social media and crowd-sourced data-

sharing platforms has grown substantially and opened a new era of information

collection and analysis. Understanding the dynamics of social networks is crucial for

tracking of opinions (e.g. political trends), management of crises (e.g. environmental

natural hazards or diseases), optimization of business performance (e.g. marketing

campaigns), or the detection of popular topics (Guille et al., 2013). Twitter provides a

prominent platform to study communication patterns among people and the information

flow between them, although, unlike many other social media platforms, Twitter does

not enforce reciprocal sharing (Lotan et al., 2011). The (non-spatial) spread of

information through the Twitter network has been analyzed in numerous studies

(Ferguson et al., 2014; Lerman & Ghosh, 2010; Pei et al., 2014; Romero et al., 2011),

which complements another major thread of Twitter-related analysis, namely that of

human mobility patterns (Hawelka et al., 2014; Hochmair & Cvetojevic, 2014; Hübl et

al., 2017; Jurdak et al., 2015; Lenormand et al., 2014, 2015; Y. Li et al., 2017; Steiger et

al., 2011; Valle et al., 2017). Although several studies addressed the connection

between geographic and social space when analyzing community interaction in social

media platforms (Gründemann & Burghardt, 2016; Takhteyev et al., 2012) most

information diffusion models operate exclusively within the social space, focusing, for

instance, on information promotion (Achananuparp et al., 2012), or the effects of

repeated exposure to hashtags on hashtag adoption (Romero et al., 2011). To better

Page 40: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

40

understand the information spread across the physical world, there is a need to

integrate spatial components into diffusion models.

As a step in this direction, we selected a series of six attacks (including suicide

bombings and mass shootings) that occurred in Paris on the night of November 13th,

2015, and analyzed the diffusion of tweets that contain information pertaining to this

event around the globe. Related tweets were divided based on format and content.

Included formats are tweets with images, tweets with hashtags and tweets with

keywords. Related images posted through tweets were visually inspected to identify

dominant content categories. This led to two distinct content categories, namely, tweets

related to the attacks and those expressing sympathy or support. Diffusion

characteristics were then analyzed for each of these two classes separately. This two-

class content distinction is in line with an earlier study (Seo, 2014) which analyzed

images posted to the November 2012 Gaza conflict. It found that Israeli images

primarily featured the analytical propaganda theme, which included images relating to

attacks and destruction, whereas the emotional propaganda theme, e.g., raising

sympathy towards their own people, was dominant in Hamas images. Our paper

identifies several factors that influence tweet popularity (measured by the number of

retweets), including content category (attacks vs. support related), tweet format

(keywords vs. hashtags vs. images), and Twitter user profession (journalist vs. non-

journalist). Using these categories, various exploratory spatial methods, such as Kernel

density maps, are applied to assess the global spread of event-related information

through tweets. This is followed by a spatiotemporal negative binomial regression

model, which uses tweets with event-related hashtags to identify significant predictors of

Page 41: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

41

information spread around the world. In summary, this study addresses the following

research objectives:

determine the effect of tweet content category, tweet format, and user profession on the popularity of tweets that are posted in connection with the Paris attacks;

explore the geographic spread of event-related tweets over time;

use of tweets with hashtags that relate to the Paris attacks to identify factors contributing to the information spread around the world within a spatiotemporal regression model.

The remainder of the paper is structured as follows. Section 2 reviews previous

work on information diffusion through Twitter. This is followed by a description of the

study setup in section 3. Section 4 provides results of the tweet popularity analysis,

followed by results of exploratory analysis methods and a spatiotemporal regression

model for twitter related information diffusion. Section 5 discusses findings and the

utilized analysis methods, which is followed by conclusions and directions for future

work.

Related Work

The geospatial data component that comes from social media content and from

crowd-sourcing applications used for communication, navigation, or sharing travel

experiences, is primarily generated by passive, often unaware, contributions means,

and therefore sometimes referred to as Involuntary Geographic Information (iVGI)

(Fischer, 2012). Although georeferenced tweets fall into the same category, and the

sharing of one’s location is not the main purpose of tweets, Twitter position information

has been frequently used to assess the spatio-temporal dimension of emergency

situations, such as earthquakes, floods, forest fires, or terrorist attacks (De Longueville

& Smith, 2009; Hung et al., 2016; L. Li & Goodchild, 2010; MacEachren et al., 2011), to

Page 42: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

42

predict the spread of diseases (Brennan et al., 2013; Signorini et al., 2011), and to

model human mobility patterns in the case of unexpected events (Shelton et al., 2014).

Twitter is used by over 300 million users every month and therefore provides a

significant data source for studying communication patterns and information flows

among people (Lotan et al., 2011; Pei et al., 2014). However, it suffers from user

sampling bias (Duggan et al., 2015), and geographical bias through its concentration on

certain countries (Hawelka et al., 2014). Furthermore, only about 1% of all tweets are

geo-tagged (Graham et al., 2014). This means that results of Twitter studies are not

necessarily representative of the general population or even of all Twitter users. To

compensate for the scarcity of geo-tagged tweets, various studies have explored

methods to geo-locate tweets (Cheng et al., 2010; Zahra et al., 2017) or Twitter users

(Jurgens, 2013; Kotzias et al., 2014) through other sources of information in the tweet

post or in the user profile, such as geographic references in the tweet text and the social

network structure. Though these geo-positioning methods are consistently improving,

they add a level of positional uncertainty to any subsequent analysis, and often require

manual checks for reliable results. Therefore, for the presented study only geo-tagged

tweets were used.

Modeling of information diffusion in the Twitter network was often approached

through the analysis of retweet patterns (Guille et al., 2013), where a retweet is an

action taken by a Twitter user to share someone else’s tweets without alteration

(Compston, 2014). For example, Cha, Haddadi, Benevenuto, & Gummadi (2010)

compared three measures of user influence on others, namely the number of followers,

the number of retweets, and the number of user mentions. Results showed that popular

Page 43: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

43

users with a high number of followers do not necessarily have more retweets and

mentions, but that it is more influential to have an active audience that retweets or

mentions the user. Another study showed that tweets that contain interesting URLs (as

rated by others), and are posted by users with many followers were likely to be more

widely spread (Bakshy et al., 2011). Similarly, Pei et al. (2014) used several network

topology measures, including degree, PageRank, and k-core, to detect influential

spreaders of information in online social media platforms Twitter, Facebook, and

Livejournal. Based on a diffusion network model Yang & Counts (2010) predicted the

speed, scale, and range of information diffusion on Twitter using a variety of user and

tweet related predictors, including a user’s activity level, the presence of URL in a tweet,

or the stage of topic lifespan when a tweet was posted. Achananuparp et al. (2012)

introduced the notion of weak retweets in their information propagation model. This

concept describes a user posting a tweet that mentioned a relevant item, such as a URL

or hashtag, from an earlier tweet posted by another user.

Besides retweet patterns, hashtags have often been used to observe content

trends and to track topical information propagation. A Twitter hashtag is a string of

characters preceded by the hash (#) character, and is generated by users as a method

to categorize content and to highlight topics. A recent study extracted sentiments and

topics from tweets that contained the #prayforparis hashtag and that were sent four

days after the Paris attacks (Chong, 2016). The topics were extracted using latent

semantic analysis (LSA) (Deerwester et al., 1990; Evangelopoulos et al., 2015;

Landauer & Dumais, 1997) and included among others a tribute to the victims of the

Paris attack during the soccer game between England and France. Lotan et al. (2011)

Page 44: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

44

analyzed Twitter information flows during the 2011 revolutions in Egypt and Tunisia for

mainstream media organizations, journalists, and bloggers using tweets with hashtags,

such as #sidibouzid or #jan25. The study concluded that Twitter accounts of

organizations have substantially higher retweet rates than accounts of individuals, but

that news on Twitter is being co-constructed by bloggers and activists alongside

journalists. Tsur & Rappoport (2012) showed that a post’s content (e.g. length of a

hashtag) and context (e.g. cognitive categories), as well as the topology of the social

graph (e.g. number of followers) and global temporal features (e.g. peak hours) are

important predictors of the popularity of hashtags over time. Another study found that

the spread of hashtags varies by topic and that, especially for political hashtags,

repeated exposure leads to frequent hashtag adoption by followers (Romero et al.,

2011). Chang (2010) proposed a Diffusion of Innovation Theory that examines a trend

of hashtag adoption during certain time periods after the user has been exposed to

hashtag information.

Regarding news topicality Kwak et al. (2010) compared the occurrence of

headlines between Twitter and CNN and found that some events, such as accidents

and sporting events, broke out on Twitter first. A comparative analysis of the relative

importance of social media for news in six European countries, Japan, and the U.S.

revealed that television is still the most widely used and most important source of news

(Nielsen & Schrøder, 2014).

Several studies examined the ties between spatial and social network structure

on twitter. For example, it was found that smaller Twitter networks are more socially

clustered and extend over a smaller physical distance than larger ones, suggesting that

Page 45: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

45

network and physical distances are related (Stephens & Poorthuis, 2014). Similarly,

Takhteyev et al. (2012) showed that a substantial share of Twitter ties lies within the

same metropolitan region and that distance related variables, such as language,

country, and the number of flights affects Twitter ties between regional clusters.

Overall, the literature review reveals that the spatial and geographic aspects of

current diffusion network models of social media platforms are largely neglected. To

narrow this research gap, the role of distance and spatial hierarchy will be explored in

the context of information propagation. For this purpose geo-tweets with images,

hashtags, and keywords related to the Paris attacks will be used as data source.

Study Setup

Twitter Information Sharing Methods Analyzed In The Study

Twitter is a microblogging service that allows its users to send posts called

tweets. The length of a tweet was limited to 140 characters until November 2017, when

the maximum length was doubled to 280 characters. Our study uses tweets from 2015,

and therefore analyzes posts that are up to 140 characters long. Tweets can be

enriched with different content including images, videos, and links to external web

pages. The geo-positioning capabilities of mobile devices through GPS, Wi-Fi, or cell

phone towers gives Twitter users the opportunity to add location information to their

posts. Users post tweets on their timeline, and a follower is a user who can see another

user’s posts on their own timeline. Followers can either like or retweet another user’s

tweet. In the case of a retweet, a user forwards a tweet and shares it on his or her own

timeline. The retweet mechanism, therefore, allows users to extend the information

beyond the reach of the original tweet’s followers (Kwak et al., 2010). Retweets can be

seen on a user’s timeline together with their own tweets and the list of liked tweets can

Page 46: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

46

be seen in a separate tab. Liking a tweet is a sign of an appreciation for a tweet.

Hashtags provide a platform for the discussion of a specific topic and are therefore used

to classify information and highlight topics, promoting folksonomy (Chong, 2016).

Hashtag strings can be clicked to trigger a global search of tweets related to a topic of

interest. Retweeting and assigning tweets to topics through certain hashtags are

common methods of spreading information through Twitter.

Data Access

Twitter provides free access to the public portion of their data through the Twitter

Streaming Application Programming Interface (API) and REST APIs. The dataset used

for this study covers 1,094,009 worldwide geotagged tweets that were posted within two

weeks from the day of the attacks (November 15, 2015). Hashtags related to these

events were used primarily within a span of a few days. Therefore, the two-week range

appeared to be adequate for the proposed analysis. Data was downloaded using the

Tweepy python library from the Twitter Streaming API and stored in a PostgreSQL

database. Since the Streaming API returns tweets in the JavaScript Object Notation

(JSON) file format immediately after they were posted, the number of retweets equals

zero upon download. Therefore, in order to obtain the current number of retweets of a

tweet, the HTML code of tweets was accessed through a URL in the format:

http://twitter.com/statuses/tweet_id, and then parsed using the BeautifulSoup Python

library.

The JSON object for each tweet with an image contains a URL to an actual

image file on the Twitter server. Using a customized Web application we manually

selected a subset of images that were posted from tweets within a predefined polygon

around Paris (Figure 3-1) and that were related to the attacks or showed support. The

Page 47: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

47

JSON object contains also a list of all hashtags that are used in a tweet. Furthermore,

tweets that contained attack or sympathy related keywords were extracted using a full-

text search in PostgreSQL, as described in more detail in section 0.

According to the official documentation (Moffitt, 2014), tweets can contain three

types of location information, which are (1) geotags (exact location or Twitter place), (2)

geographic location mentioned in the tweet, or (3) location in the user profile. For this

study, only geotagged tweets were used. The breakdown of types of location

geometries found in the used worldwide dataset of geotagged tweets is shown in Table

3-1. Given the small percentage of tweets with exact coordinates (9.58%) among geo-

tagged tweets, the spatial analysis of information spread based solely on tweets with

exact coordinates would have been seriously limited.

To identify tweets that are posted from Paris and hence serve as a seed source

for information diffusion, various spatial search methods were applied:

Tweets with exact coordinates within the Paris bounding box (Figure 3-1),

Tweets geocoded with a place type “admin” whose centroid falls within the Paris bounding box,

Tweets geocoded with a place type “city” and the value “Paris”

Following tweet formats were analyzed:

Tweets with attack related photos or support pictures (Figure 3-2),

Tweets with hashtags related to attacks or support (in English and French),

Tweets with keywords related to attacks or support (in English and French).

Tweets of all three tweet types were subdivided into two content categories as follows:

Event-related:

a) photos from streets immediately after the attacks (Figure 3-2 A),

Page 48: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

48

b) hashtags such as: #ParisAttacks, #Bataclan (Bataclan is a theater where one of the attacks took place),

c) keywords such as: “attack”, “scared”, ”terror”, “armes”, “policiers”.

Support related

a) tweets containing artistic support images (Figure 3-2 B),

b) tweets containing hashtags expressing support and sympathy, such as #PrayForParis,

c) keywords and bigrams such as: “pray”, “stay strong”, “contre terrorisme”.

Analysis Of Tweet Popularity

In the presented study, the average number of retweets was used to measure

the popularity of tweets. The role of tweet format (image, hashtag, keyword), content

category (event, support), and profession of the contributor (journalist, non-journalist) on

popularity is assessed, using different sets of tweets. First, tweets with images from

Paris were selected manually using a customized Web application that visualized the

approximately 9000 tweets with images from the Paris area that were posted between

9:00 p.m. (local time) on November 13 and 7 a.m the day after. Second, tweets posted

from the broader Paris area with matching keywords and hashtags posted within two

weeks from the attacks were extracted after manual selection of keywords and

hashtags.

The Role Of Tweet Type And Content On Tweet Popularity

After the first author selected and classified the images based on content, two

more graduate students verified the content classification of the images. The students

were asked to classify images as attack related or as support related, or to suggest an

alternative category. For the verification procedure the same Web application as for the

initial classification was used. All three individuals (first author and two verification

Page 49: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

49

subjects) identified the exact same set of images as support images, as that theme was

very distinctive. The two verification subjects identified two and three additional attack

related images respectively, compared to the first author. After a thorough examination

of tweet text and comments associated with the newly identified photos, it was,

however, found that the photos were screenshots of news and no genuine images.

Therefore, these tweets were not used for further analysis. No other categories were

suggested by the verification subjects.

Using the Python NLTK (Natural Language Toolkit) keywords in English and

French were extracted from tweets posted within the Paris area. Frequent occurrences

of single words (e.g., terror, police and attack) and bigrams (a combination of two

words, such as in “stay safe”) were identified by the first author. A total of 101 single

keywords and 21 bigrams related to the attacks as well as 18 keywords and 10 bigrams

expressing support were identified. In addition, the four most frequent event and two

most frequent support related hashtags, with three in English and three in French, were

identified. Alternative methods of computer-assisted keyword and hashtag extraction

from the unstructured text are presented in the literature (King et al., 2017).

To check the correctness of the classification of keywords, bigrams, and

hashtags into the two content categories, tweets in English were manually classified by

three individuals (two Ph. D. students and one postdoctoral researcher) and tweets in

French by three volunteers (the first author’s relatives who live in Paris and speak

French fluently). Each reviewer from the English group was given a random sample of

100 tweets with English keywords and 100 tweets with English hashtags. Similarly,

each reviewer from the French group was given 100 tweets with French keywords and

Page 50: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

50

100 tweets with French hashtags. The six reviewers were given options to classify

tweets as attack related, support related, as “other” (i.e., unrelated to either of the two

categories), or to suggest a new category. Table 3-2 shows the confusion matrix of the

manual classification conducted by reviewers (event, support, other) and the automated

classification (event, support) for hashtags and keywords.

The table shows that 70.8% of the tweets that were automatically (i.e., based on

hashtags) classified as event-related, were confirmed in the process of manual

classification. For support related tweets, the match was even higher with 96.0%. For

keyword-based tweet extraction, the matching rates were somewhat lower, i.e. 72.9%

(events) and 68.5% (support), respectively. Most discrepancies came from one French-

speaking reviewer who identified politics as a subcategory in certain tweets. However,

upon further review we could not identify distinctive keywords or hashtags in these

tweets that would imply a political theme. Other discrepancies came from a few

automatically extracted tweets that used both support and attack related keywords and

hashtags together, such as: “#ParisAttacks I hope @username is safe”. In this example,

the hashtag was related to the event but the text was expressing support, and a

reviewer classified it as support related.

The number of retweets in each of the six combined classes of tweet content

categories and tweet formats follows closely a power law distribution with an r-squared

of 0.83 or higher (Figure 3-3), supporting earlier findings about the distribution of

retweets (Can et al., 2013). This means that only a small number of tweets received a

high number of retweets. For the estimation of the power-law exponent (α), a linear

regression with simple logarithmic binning was used (White et al., 2008). For the

Page 51: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

51

frequency distributions in Figure 3-3, only tweets that fall into exactly one format and

content category were used to better understand the effect of content and tweet format.

This means that, for example, tweets containing an image and a hashtag (i.e., two tweet

formats) were excluded.

Table 3-3 shows the mean numbers of retweets and their standard deviations for

the different tweet formats and content categories, using the same dataset. It can be

seen that mean retweet numbers increase from bottom to top (keyword – hashtag –

image) and are larger for the event than for support related tweets across all tweet

formats.

Since observations are count data the effect of tweet format and content

category on the popularity of a tweet was assessed using a two-way analysis of

deviance from the phia R package (De Rosario Martínez, 2015), assuming a negative

binomial distribution of observations. Hence, the observed counts were fit to a negative

binomial model with factors content category and tweet format and their interaction, and

then an ANOVA was run. Results for retweet numbers reveal a significant interaction

between tweet content and format and demonstrate significant main effects for tweet

format and content category (Table 3-4).

Since there are only two content categories (event and support), the main effect

on the content variable indicates that event tweets trigger significantly more retweets

than support tweets. Since there are more than two formats, the effect of format on

retweet numbers will be more closely analyzed using interaction contrasts (Table 3-5).

Results in Table 3-5 show that both for event and support content, tweets with

pictures receive more retweets than those with hashtags and keywords (rows 2, 6, 14,

Page 52: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

52

15). In addition, tweets with hashtags receive more retweets than tweets with keywords

(rows 1, 13). Event-related tweets receive more retweets than support related tweets,

which is true for tweets with pictures (row 12), hashtags (row 3), and keywords (row 8).

All these results match the visual pattern observable in Figure 3-4. The figure shows

that differences in mean retweet numbers between event and support related tweets

vary between hashtags, keywords, and pictures, suggesting that the effect of the

content category on the number of retweets depends on the tweet format.

The Effect Of The Profession On Tweet Popularity

Among the 169 users who shared attack related photos (based on the earlier

manual selection), 48 users were identified as journalists. Most of these 48 user

accounts belonged to individuals, and not to organizations. To classify a user profile

description, username, and links were parsed by the first author for information that

expressed an affiliation with any kind of news channel, such as television or online

newspaper.

To verify the user classification three graduate students who were not involved in

this study, were asked to conduct a manual identification of the same 169 users who

posted photos of the attacks. Out of the 48 journalists initially identified by the first

author, the graduate students confirmed 46 users to be journalists, and two additional

Twitter users were identified as journalists. One of the additional journalist users had a

link to an external website that, among others, contained the user’s profession. The

other one had a description in Arabic that included special characters, which were only

identified by one reviewer who speaks Arabic. Using this new set of 48 journalist users,

journalists were found to have a significantly higher median number of followers

(median = 1554) than other users (median = 377) using a Wilcoxon Signed Rank test (Z

Page 53: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

53

= 4095, p < 0.0001). To detect whether a tweet of a journalist was more influential than

a tweet of a non-journalist, after removing the effect of the number of followers, the

number of retweets per follower of a user was determined (Table 3-6).

A Wilcoxon Signed Rank test (Z = 3677, p = 0.006) showed that the median

number of retweets per follower for journalists (median = 0.003) was significantly higher

than for non-journalists (median = 0.001). This supports earlier findings from the

literature which states that journalists are able to generate Twitter response levels that

are comparable to those of media organizations, bloggers, bots, activities, and

politicians, and hence engage their audiences more than other types of Twitter users

(Lotan et al., 2011). A possible explanation is that journalists have faster access to

news information, which leads to faster subsequent information dissemination. Another

reason for higher retweet rates could be a higher trustworthiness of individual journalists

who built their credibility over time, especially those who are highly engaged in social

media activities (Jahng & Littau, 2016).

Analysis Of Information Spread

Information spread was first analyzed through exploratory data analysis, using

worldwide maps of retweets of tweets with images, and using worldwide maps of tweets

with event-related hashtags. A spatiotemporal regression analysis provides an

analytical framework for dispersion modeling of tweets with attack related hashtags.

Exploring Information Spread On World Maps

Retweets

For the first analysis, tweets with the event and support related pictures posted

between 9:00 p.m. (local time) on November 13 and 7 a.m. the next morning were used

as seed tweets. The worldwide locations of retweets were identified by finding original

Page 54: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

54

geotagged tweets (i.e., not retweets) of retweeting users within a six-hour window

around the retweet time. This approach was necessary since retweets, which were

obtained from the Twitter Search API, did not contain the geographic position of the

retweet, but only that of the original tweet. To obtain a location of the retweet, all

location types (exact coordinates, neighborhood, city, province, country) of tweets

around the retweet time (between three hours before and after) were used. Factors

limiting the success in identifying the location of retweets were the sparsity of

geotagged tweets of 1-2% and the limitation that only the first 20 retweets of a tweet

can be obtained from the Twitter API for tweets. Hence, only tweets with up to 20

retweets were used for this analysis. This approach reduced the sample to 259 tweets

with images of events or support. The method located 68 retweets out of 1451 total

retweets. Figure 3-5 visualizes the location of retweets that were located using the

before described method, separated by event and support related images.

Retweets were primarily found in Europe and the United States, which have a

higher Twitter penetration rate than countries on other continents. The higher density in

some European regions could be explained by their proximity to France, and therefore

higher safety concerns. While the map does not display all retweets of identified seed

tweets due to technical limitations described before, it provides a general overview of

the regions to which information about the attacks primarily spreads. Most retweets are

located in France (44), followed by the United States (13), Spain (5) and Germany (3).

Hashtags

Figure 3-6 visualizes the location of tweets with selected French (A, B) and

English (C, D) hashtags posted within the first two weeks of the attacks. These four

hashtags are a subset of the six hashtags used for tweet extraction described earlier.

Page 55: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

55

The maps show that the language of the hashtags has a clear effect on the

geographic spread of tweets. Tweets with French hashtags spread mostly within France

and to some extent to the only francophone Canadian province (Quebec) and

predominantly French-speaking Caribbean islands. As opposed to this, tweets with

English hashtags, whether they relate to attacks (Figure 3-6 C) or support (Figure 3-6

D), spread into many more countries around the world. The fact that English is more

widely spoken around the world than French1 may explain that English rather than

French hashtags are more widely used, leading to these distinct information diffusion

patterns.

Figure 3-7 plots the worldwide proportion of tweets containing particular hashtags

about attacks (solid lines) and support (dashed lines) among all tweets containing any

hashtag for the first two weeks after the attacks. The shape of the line graphs suggests

that the interest in the topic dropped quickly after two days. The daily counts are

measured for Paris local time. Since the attacks happened in the late evening hours

only a smaller proportion of tweets occurs on November 13th.

Kernel-density maps

Kernel-density maps were used to visualize the spatial distribution of tweets with

selected hashtags over time. To illustrate the spread of an English hashtag, Figure 3-8

visualizes Kernel density maps on top of individual tweet locations with the

#prayforparis hashtag within the first 9 hours of the attacks, grouped by 3-hour

aggregations. Visual inspection suggests that during the first three hours tweets occur

primarily in and near France and in parts of the US East coast.

1 http://www.diplomatie.gouv.fr/en/french-foreign-policy/francophony/the-status-of-french-in-the-world/

Page 56: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

56

In Figure 3-8 A it is between 9 p.m. and midnight local time in Europe, and the

highest concentration of related tweets is, as expected, in Western Europe due to the

proximity of Twitter users to the event. In that figure, it is late afternoon/early evening on

the US East coast, which is an active time for tweeting compared to morning or late

night hours (Andrienko et al., 2013). This can explain this early concentration of related

tweets in that region. As opposed to this, for selected regions in the Middle East or Asia,

the local time associated with the first map is closer to late night or early morning hours,

e.g. between 1 a.m. and 4 a.m. in Dubai, and between 5 and 8 a.m. in the Philippines.

This may explain the lower initial level of tweet responses to attacks in these areas.

Three hours later (Figure 3-8 B), the news spread further to populated areas around the

world with high Twitter penetration rates, such as Brazil, the western United States,

Central America, Indonesia, and the Philippines, but still only little to the Middle East

(with a local time between 4 a.m. and 7 a.m.). Another three hours later (Figure 3-8 C)

tweets spread further into adjacent regions of those highlighted in Figure 3-8 B, also

showing some response in the Middle East.

Spatiotemporal Regression For Global Spread Analysis

The purpose of the regression model was to find spatial and temporal regression

coefficients that reflect the spread of tweets containing any of the six hashtags in Figure

3-7 around the world. It was expected to reveal patterns similar to those observed in the

kernel density maps.

The data was constructed as a panel, with tweet counts for clusters of Twitter

places in three-hour time intervals prepared over a time period of two weeks. The use of

panel data allowed modeling a time-lagged neighbor effect, where the tweet count in

one area (e.g. Paris), affected the tweets count in “neighboring” areas, e.g. other

Page 57: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

57

populous metropolitan areas, such as New York, Rio De Janeiro, or London. The

analysis was run as a negative binomial regression for the count outcomes, since the

count data was over-dispersed. Stata software was used for this purpose.

In the given context of information dispersion through Twitter, the city in each

country that had most tweets with any of the selected hashtags was designated as a

local connector (neighbor) to Paris, which was considered the data source of

information. These major cities, which did not necessarily match political capitals of the

countries but were derived from a clustering process during in preparatory step, were

called tweeting capitals. Predictor variables were set up in a way that estimated

regression coefficients would model the spread of information from Paris to other

tweeting capitals, and the subsequent information spread to other smaller cities around

each tweeting capital. In recent studies, a similar framework with lagged variables was

used to disentangle the cause and effect of land use and transportation network growth

(Levinson, 2008), and to model the interaction in data growth between different crowd-

sourced datasets (Alivand & Hochmair, 2017).

Model formulation

A general model for panel data analysis, using a first order lag and a negative-

binomial distribution of the dependent variable, can be formulated in the presented

context as follows, similarly to (Levinson, 2008):

ln(Di,t)=Di,t-1φ+WDi,t-1ρ+Xi,t-1β+WXi,t-1χ+Ziζ+Tt-1ψ

where

Di,t is the number of tweets with hashtags in cluster i at time t,

W is a matrix of spatial interaction weights (the neighborhood matrix),

Page 58: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

58

Di,t-1 is the number of tweets with hashtags in cluster i at time t-1 (the lagged

value of the dependent variable),

X is a vector of variables that change with both cluster and time,

Z is a vector of cluster-specific variables that do not change with time,

T is a vector of time-specific variables that do not change with the cluster, and

φ, ρ, β, χ, ζ and ψ are coefficients to be estimated through regression.

The weight matrix defines spatial relationships between clusters. It consists of

binary values that indicate likely directionality in the change of the dependent variable

over time. With the chosen matrix setup, all the cities in a country are modeled to be

neighbors to their tweeting capital, and all the tweeting capitals are neighbors to Paris.

Vector Xi,t-1 contains the number of all tweets posted in each cluster i at time period t-1.

Vector Zi consists of i) a variable indicating the geodesic distance between the tweeting

capital of cluster i and Paris and ii) a variable describing the continent cluster i is located

in. The latter captures differences in time zones between clusters and thus the different

local times at which the attacks occurred. For modeling purposes, several time zones

were grouped together by continent, giving the following four continent groups: 1)

Europe, 2) the Americas, 3) the Middle East and 4) Asia and Australia. Vector Tt-1

represents the count of three-hour periods passed since the attacks, which is the same

for each cluster.

Data preparation

For the analysis, only those tweets were used that were geocoded either with

exact coordinates, or with a place tag at the neighborhood or city level, and that had any

of the six included hashtags within two weeks from the attacks. Neighborhood and city

places are represented as rectangular bounding box polygons in Twitter. In a first step,

Page 59: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

59

in order to avoid excessive zero counts at different time steps in analyzed places, the

number of places used in the regression analysis was reduced. This was achieved by

clustering all Twitter places at the neighborhood and city level into major cities through

distance-based clustering of twitter place centroids, using place polygon centroids and

the PostGIS function St_ClusterWithin(). This function returns an array of geometry

collections. Each collection contains a set of geometries whose centroids are separated

by no more than a specified distance. In our setup, if the distance between places was

shorter than 0.1 arc degrees, places were aggregated to a cluster. Figure 3-9

demonstrates the clustering process of 26 Twitter places (rectangles) in South Florida

into 4 major clusters (ellipses) and six smaller standalone clusters (rectangles of

different colors some distance away from ellipses). These smaller clusters were

retained for tracking the local spread of tweet information out of the tweeting capitals.

Clusters with fewer than one thousand tweets over the course of two weeks were

excluded from the analysis as well as clusters that had fewer than 40 tweets with

hashtags related to the Paris attacks.

Model estimation

Table 3-7 shows the results of the model estimation, which predicts the count of

tweets with selected hashtags in place clusters.

Results indicate that cities which are designated as a country’s tweeting capital

are associated with a higher number of tweets than other cities in that country. An

increasing distance between the tweeting capital and Paris, as well as the number of

three-hour periods since the hashtag inception, are negatively associated with the

number of tweets with hashtags. The latter indicates that the growth in the numbers of

the tweets with hashtags declines over time. The number of tweets in a given time

Page 60: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

60

period was positively correlated with the number of tweets with hashtags observed in a

place cluster during that time period. As expected a lagged count of tweets with

hashtags at t-1 (shown as L1) is a strong predictor of the count of tweets with hashtags

at t. The number of tweets with hashtags in clusters is also affected by the number of

tweets sent in the “tweeting capital” of the country in the previous time period, as

indicated by the Δhashtags variable. This shows that the local spread of information

within a country from the tweeting capital to other cities in the country explains part of

the tweeting activities in those cities, suggesting a hierarchical structure of information

diffusion. This matches the visual perception of diffusion patterns in Figure 3-8.

Locations in Asia and Australia received an increased number of tweets compared to

other continents, after controlling for distance from Paris, possible due to high

population densities in certain Asian regions.

Discussion

This research presented a multi-faceted analysis of information spread through

tweets under consideration of tweet format and content category. Event-related tweets

triggered more retweets than those expressing support, possibly due to the higher

information content found in the first group of tweets. The rich visual information content

of images might also explain why tweets with images received higher retweet numbers

than tweets with event-related keywords or hashtags. The 140-character limit in tweets

at that time allowed only so much content to be posted, and a picture seemed to be

worth more than 140 characters. Tweets with hashtags were more popular than those

with keywords related to the attacks, which could be expected because hashtags make

tweets searchable both by followers and non-followers and are links to other tweets that

contain them.

Page 61: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

61

The study showed that in emergency situations like the Paris attacks Twitter is

widely used both by journalists and non-journalists. However, tweets with images

posted by journalists received significantly more attention per follower than tweets with

images sent by other users, suggesting that journalists, through their continued work

and frequent association with larger media companies, already built their follower

network and trustworthiness.

Different exploration methods for the geographic diffusion of tweets were chosen

for tweets with images and tweets with hashtags. For tweets with images, global

retweeting patterns were analyzed. This task necessitated, however, a complex

approach to estimate the geographic position of retweets, and was constrained by API

limitations. These technical obstacles may explain why only a few earlier studies tackled

the question of spatial information diffusion on Twitter. If quoted tweets were to add the

user’s current geolocation (instead of the position of the original tweet), this would

render the retweet map (Figure 3-5) more complete. Since such tweets would provide

additional user information, e.g. position information, and hence modify the original

tweets, by definition they would resemble quoted tweets instead of retweets. For the

diffusion of tweets with hashtags, all hashtag occurrences could be mapped,

independent of how a tweeting user learned about that hashtag. This allows for a

complete estimation of spread patterns, although it conceals details about the path that

the information traveled along from a set of seed tweets. Mapping hashtag locations

showed that hashtags in French were predominantly used in francophone territories,

e.g., France and Quebec, whereas English hashtags had a more global coverage.

Kernel density maps of English hashtag occurrences showed radial spread patterns

Page 62: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

62

across the world, namely travel from Paris primarily to other metropolitan areas around

the world, and from there to smaller surrounding places.

Twitter users do not represent the general population and it is important to

emphasize that all conclusions about social behavior found in related studies apply

primarily to Twitter users and not necessarily the general population (Lansley &

Longley, 2016; Mislove et al., 2011). In this study, conclusions were driven by an even

smaller group of Twitter users, namely those who post geo-tagged tweets, adding more

to population bias (Malik et al., 2015). For example, the information level about the Paris

attacks in regions with weak phone data coverage (Cvetojevic et al., 2016) and low

Twitter penetration rates (Hawelka et al., 2014) might be underestimated for such

regions if alternative news channels (e.g. TV, radio) exist that offset the lack of Twitter

data access (Nielsen & Schrøder, 2014). These potential limitations apply at least to the

geographic analysis of information spread (e.g., retweet maps, hashtag distribution

maps, Kernel density maps), and the regression model for spread analysis. As opposed

to this, the comparison of retweet numbers as well as the temporal distribution of

hashtags are expected to more closely represent the communication structure among

all Twitter users, because no explicit spatial component was involved in the

corresponding analysis procedures. In addition, the fact that a large portion of the geo-

tagged tweets used in this study had place locations instead of exact coordinates limited

the spatial resolution of the conducted spatial analysis. This posed, however no serious

problem to a global spread analysis, as it was conducted in this paper.

Keyword-based filtering of tweets was limited to English and French languages.

With other languages, it would be difficult to identify content relating to the attacks, and

Page 63: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

63

to find volunteers who help to check the correctness of the automatic classification of

tweets into the different content categories. Besides this, the scarcity of geotagged

tweets, combined with the small percentage of tweets posted in other languages (Hong

et al., 2011) limits the spatial spread analysis to only a few languages, such as English,

Japanese, Portuguese, Indonesian, Spanish, or French. Given that pictures relating to

the attacks were selected manually and that this is a time-consuming process, only

tweets with pictures posted between the attacks and the next morning were examined,

which were still around 9000 tweets for the wider Paris area. A longer time frame would

also include the pictures of the aftermath of the attacks, such as crowds and lines at the

airport due to the elevated security measures. However, tweets with such pictures might

tend to have a local rather than a global coverage since only a limited group of the

affected users would be interested in that kind of information (e.g. travel agencies, local

residents).

Generally, the hashtags have shown to be a viable approach to tracking

geographic information flows in Twitter. However, focusing on the occurrence of

hashtags only eliminates the aspect of information flow since hashtag analysis does not

account for follower tracking.

Page 64: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

64

Figure 3-1. Bounding box (this map extent) around Paris, which was used to select

original tweets with images, hashtags, and keywords whose spread, was analyzed.

Page 65: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

65

A B

Figure 3-2. Tweet with photos A) photos of the attacks, B) artistic images expressing support shared with tweets.

Page 66: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

66

Figure 3-3. Power law fitting the distribution of retweets, separated by tweet format and

content category.

Page 67: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

67

Figure 3-4. Interaction between tweet type and content category on the number of

retweets.

Page 68: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

68

Figure 3-5. Retweets of tweets with pictures related to the Paris attacks.

Page 69: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

69

A C

B D

Figure 3-6. Geographic distribution of hashtags: A) #AttentatsParis, B) #fusillade (en: shooting, gunshots), C) #ParisAttacks, D) #PrayForParis.

Page 70: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

70

Figure 3-7. Temporal distribution of hashtags.

Page 71: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

71

A B

C Figure 3-8. Kernel density maps for the first 9 hours of #prayforparis hashtag usage

(tweet density is shown in thousand tweets per square km). A) 0-3 hours, B) 3-6hours, C) 6-9 hours.

Page 72: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

72

Figure 3-9. Distance-based clustering of twitter places around Barcelona.

Page 73: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

73

Table 3-1. Breakdown of geometry types in the analyzed dataset of tweets (wide Paris area, 13 Nov-27 Nov)

Geometry type Tweets

Place type: city 85.30% Exact coordinates 9.58% Place type: admin 5.12%

Page 74: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

74

Table 3-2. Confusion matrix for tweet content classification

Hashtags Keywords

Events Support Events Support

Events 70.8% 2.0% 72.9% 13.7%

Support 16.9% 96.0% 6.3% 68.5%

Other 12.3% 2.0% 20.8% 17.8%

Total 100.0% 100.0% 100.0% 100.0%

Page 75: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

75

Table 3-3. Popularity of tweets for different tweet formats and content categories

Tweet format Events Support

Tweet count Retweets mean

(SD of the mean) Tweet count

Retweets mean (SD of the mean)

Image 183 96.3 (31.3) 188 41.2 (26.0)

Hashtag 10098 9.6 (1.2) 12014 2.8 (0.5)

Keyword 15164 4.9 (0.5) 3181 2.0 (0.3)

Page 76: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

76

Table 3-4. Analysis of deviance for retweets

Retweets Degrees of

freedom LR Chisq P(>Chisq) Significance

Content category 1 4295.4 < 0.001 *** Tweet format 2 4891.0 < 0.001 *** Content type: Tweet format 2 93.3 < 0.001 ***

Signif. codes: *** p < 0.001; ** p < 0.01; * p < 0.05

Page 77: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

77

Table 3-5. The interaction between tweet format and content category on retweets (P-value adjustment method: Holm)

Row Content type: Tweet format Difference Chisq P(>Chisq) Significance

1 Events:hashtag-Events:keyword 1.970 1291.563 < 0.001 ***

2 Events:hashtag-Events:picture 0.100 471.954 < 0.001 ***

3 Events:hashtag-Support:hashtag 3.401 3684.638 < 0.001 ***

4 Events:hashtag-Support:keyword 4.752 2433.635 < 0.001 ***

5 Events:hashtag-Support:picture 0.233 192.060 < 0.001 ***

6 Events:keyword-Events:picture 0.051 794.777 < 0.001 ***

7 Events:keyword-Support:hashtag 1.727 868.737 < 0.001 ***

8 Events:keyword-Support:keyword 2.412 829.422 < 0.001 ***

9 Events:keyword-Support:picture 0.118 414.868 < 0.001 ***

10 Events:picture-Support:hashtag 34.126 1107.310 < 0.001 ***

11 Events:picture-Support:keyword 47.676 1260.757 < 0.001 ***

12 Events:picture-Support:picture 2.338 32.951 < 0.001 ***

13 Support:hashtag-Support:keyword 1.397 113.445 < 0.001 ***

14 Support:hashtag-Support:picture 0.069 651.304 < 0.001 ***

15 Support:keyword-Support:picture 0.049 781.996 < 0.001 ***

Signif. codes: *** p < 0.001; ** p < 0.01; * p < 0.05

Page 78: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

78

Table 3-6. Retweet statistics for tweets posted by journalists and non-journalists

Journalist Number of

users Followers per user (average/median)

Retweets per follower (average/median)

False 121 3480.8/377 0.16/0.001 True 48 7559.9/1554 0.26/0.003

Page 79: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

79

Table 3-7. Negative binomial regression for panel data (Europe is the default continent)

Variable Coefficient Std. Err. Z value P>|z| Significance

(Intercept) 0.508 0.037 13.81 <0.001

Tweeting capital 0.585 0.050 11.67 <0.001 ***

Three hour time periods -0.095 0.001 -76.10 <0.001 ***

Number of all tweets 0.001 0.000 8.23 <0.001 ***

Number of tweets with hashtags at t-1 (L1)

>0.000 0.000 6.64 <0.001 ***

Distance from capital to Paris <0.000 0.000 -4.74 <0.001 ***

Δhashtags (for capital) 0.001 0.000 55.37 <0.001 ***

Continent (the Americas) 0.092 0.061 1.51 0.132

Continent (Asia) 0.600 0.095 6.31 <0.001 ***

Continent (the Middle East) -0.112 0.143 -0.78 0.433

Number of observations 20,800

Number of groups (3h time steps)

40

Observations per group 520

Adjusted McFadden pseudo ρ2 0.144

Signif. codes: *** p < 0.001; ** p < 0.01; * p < 0.05

Page 80: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

80

CHAPTER 4 MODELING INTERURBAN MENTIONING RELATIONSHIPS IN THE U.S. TWITTER

NETWORK USING GEO-HASHTAGS

Study Background

The study of competition and interactions between cities has a long history

(Kresl, 1995) and aimed at determining a hierarchical structure of a city's importance in

various domains, such as finance or trade. The eminent role of a city can be derived

from the concentration of facilities, such as hospitals, schools, or universities or,

alternatively, be determined within a network of cities. In the latter approach two cities

can be considered linked if they share the headquarters of large multinational

companies, trade goods, interchange services, such as finance, accounting, law,

advertising or management, or interchange people, which, at the global scale, led to so

called world city networks (Derudder et al., 2013; Taylor, 2001; Zook & Brunn, 2005).

More recently, information flows and exchange in telecommunication and social

networks were used to describe the role of cities in travel and communication patterns

at different scales. Especially during the last decade or so, social networks grew

substantially, some with the number of monthly active users exceeding hundreds of

millions (Twitter) or even billions (Facebook). Twitter offers free access to the public

portion of their data, which was hence analyzed to better understand user interaction

and community building within the network (Goolsby, 2010; Myers et al., 2014; Weng et

al., 2013). About 1-2% of public tweets are geo-tagged (Graham et al., 2014) and have

therefore explicit geographic information attached. This information was used to study

the role of geographic distance and national boundaries on the formation of social ties

and communities which showed that online social networks and the underlying real

world geography are closely related (Stephens & Poorthuis, 2014; Takhteyev et al.,

Page 81: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

81

2012). Using tweets from 58 cities around the world, (Lenormand et al., 2015) found

that, based on node degree and betweenness network measures, New York and

London play a central role on the global travel scale. (Hawelka et al., 2014) identified

mobility regions by partitioning a country-to-country network of Twitter user flows at

different hierarchical levels, and (Sobolevsky et al., 2013) partitioned human population

based on the network of communication activities using country-wide data sets of

telephone calls. The analysis of inter-urban movements in China from check-in data

(i.e., a piece of geo-tagged content posted by a user) showed that communities follow

approximately province boundaries (Liu et al., 2014).

Little is known about what causes strong social network ties at a larger,

aggregate level and across city boundaries. Such analysis could lead to a better

understanding of mutual cultural, sociodemographic, or economic commonalities

between distant regions and their effect on communication. Explanations of strong ties

between regions would need to reach beyond factors that are commonly used to explain

the strength of a tie between two people, such as the frequency of their interaction or

the intensity of their emotional attachment (Koput, 2010). With the need to find

approaches to strengthen ties within a network (e.g. effectiveness of teamwork in a

company) and to better utilize intra-organizational and extra-organizational capital,

social scientists have explored if and how overall properties of the social network

structure affect the strength of social ties within the network (Fernandez et al., 2000).

For example, a stronger tie between two people is hypothesized to lead to a higher

proportion of other people tied to both of them due to factors such as time capacity

(limited time we can devote to social interaction, leading to larger group events and

Page 82: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

82

hence closer ties between all involved individuals) and homophily (Granovetter, 1973).

The latter concept means that we interact socially primarily with others who share

similar interests, for example, based on demographics or location, as opposed to

heterophily, which describes the increased social interaction between individuals of

dissimilar characters. The research presented in this paper will extend communication

analysis from between individual users to the city level and hence explain the role of

cities in the Twitter network with regards of city interactions. The goal of this study is to

explore the interurban network structure of hashtag-based mentions in the Twitter

network using network structure metrics, to model the strength of mutual city mentions

based on city covariates, and to explain some of the underlying processes leading to

this inter-urban interaction.

Related Work

Social network theory provides explanations to many questions about social

phenomena, and the analysis of community network structure remains a prime area of

network research (Stephen P Borgatti et al., 2009). Social science distinguishes

between different types of dyadic relations, including similarities (e.g. sharing a

location), social relations (e.g. kinship), interactions (e.g. who talked to whom), and

flows (e.g. that of resources). The strength of a tie between people can be modeled

along various dimensions, including the amount of time shared, emotional intensity,

intimacy, or social distance, such as education level (Gilbert & Karahalios, 2009), but it

is also influenced by network topology and informal social circles (Burt, 1995).

Social network graphs often comprise communities or cliques, which are natural

divisions of network groups into densely connected subgroups (Koput, 2010). Previous

research efforts have developed algorithmic approaches to optimize the detection of

Page 83: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

83

communities in networks, where the quality of partitions is often measured by the

modularity (Newman & Girvan, 2003). For example, (Blondel et al., 2008) developed a

heuristic method to optimize modularity, which was tested on social networks, citation

networks, and web networks of different scale, with up to 1 billion links. Other studies

used Latent Dirichlet allocation (LDA) for detecting communities from individual

movement data, such as GPS tracking trajectories for automobiles or geo-tagged

tweets from visitors in Florida (Kempinska et al., 2017; Valle et al., 2017). Despite the

massive amount of crowd-sourced data from social media it is important to notice that,

due to the demographic and geographic sampling bias (Duggan et al., 2015; Hawelka et

al., 2014; Longley & Adnan, 2016) as well as the small percentage of geo-tagged tweets

the results of Twitter behavioral studies are not necessarily representative of the

general population or even of all Twitter users.

Increasingly complex frameworks of human connectivity define interactions

between places (Thiemann et al., 2010), and the development of new communication

systems, such as the Internet or social media, has generated new forms of social

contacts. (Kato et al., 2012) analyzed in detail favorites, follows, and mentions on

Twitter from a network structural point of view and found that their indegrees and

outdegrees exhibit a scale-free property, which means that their degree distribution

follows approximately a power law. (Weng et al., 2010) analyzed follower behavior in

Twitter and found that the presence of reciprocity can be explained by homophily. This

means that a twitterer follows a friend because of being interested in some of the topics

posted by the friend, and that vice versa the friend follows back because he or she finds

that they share similar topics of interest. The authors therefore propose “TwitterRank”,

Page 84: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

84

an extension of the PageRank algorithm, to measure the influence of users in Twitter

under the consideration of the similarity of topics that users are interested in. (Kwak et

al., 2010) found that reciprocal relationships on Twitter are driven by geographic and

popularity homophily, where users with less than 1000 followers tend to be co-located

with their reciprocal followers of similar popularity. (Snijders, 2011) provides a detailed

overview of statistical methods for social network analysis and lists transitivity,

reciprocation and homophily as main network dependencies. The paper mentions also

the Multiple Regression Quadratic Assignment Procedure (QAP) defined by

(Krackhardt, 1988) which can be used for the exploration of nodal covariates for

modeling the strength of social ties. (McPherson et al., 2001) analyze the influence of

homophily on the formation of ties in social networks and concludes that

sociodemographic, behavioral and intrapersonal similarities divide social space and

heavily influence the formation of connections. The similarity between social network

users was found to explain more than half of the behavioral contagion (Aral et al.,

2009). Previous studies examined the structure, topology, and strenght of ties also in

other types of communication networks. For example, (Onnela et al., 2007) examined

the resilience of mobile phone networks to edge removal by analyzing communication

patterns of millions of mobile phone users. The study showed that the removal of weak

links would affect the network's overall integrity more than that of strong links since

weak links connect different communities, as opposed to strong ties.

Study Setup

The study area comprised the 50 U.S. states, i.e. the contiguous U.S., Hawaii,

and Alaska as well as Puerto Rico. Public tweets were downloaded through the Twitter

Streaming Application Programming Interface (API) and REST API, where the Python

Page 85: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

85

library Tweepy was used as a client. Tweets were downloaded in Javascript Object

Notation file format (JSON) and stored in a PostgreSQL database. In order to download

all geotagged tweets and to not exceed the maximum available download bandwidth,

the world was divided into seven download regions, for which tweets were collected

between September 20 and October 20, 2016. The total number of geotagged tweets

downloaded per region for that time period together with their download share is shown

in Figure 4-1.

(Moffitt, 2014) lists three types of location information contained in tweets:

geotag (exact location or Twitter place)

the geographic location mentioned in the tweet post (including hashtag)

location in the user profile.

For this study, only tweets that contain both the first and second type of location

were included so that the directionality of city mentioning could be derived. More

specifically, the first type of location was used to identify out of which city the posted

tweet mentioned another city, and the second type of location was used to identify

which city was mentioned in that tweet. A Twitter hashtag is a string of characters

preceded by the hash (#) character, and is generated by users to categorize content

and to highlight topics. Therefore, for the second type of location information,

geographic locations mentioned in tweets were included only if they were part of a

hashtag. Such mentioning would clearly indicate an intended topical connection to that

city, as opposed to a more casual mentioning of the city name in a tweet. Hashtags

have been used before to observe content trends and to track topical information

propagation (Chong, 2016; Lotan et al., 2011), but not to analyze mentioning patterns

between cities.

Page 86: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

86

All geotagged tweets contain in the JSON structure a place information tag that

shows the country from which the tweet was posted (see lines highlighted in boldface in

Figure 4-2).

In order to limit tweets to the study region only, i.e., the United States, regions

north-west 1, north-west 2 and north-west 3 in Figure 4-1 were queried. These three

regions contained a total of 98,508,449 geo-tagged tweets, 68,218,710 out of which

were from the United States. The final selection yielded 10,493,455 tweets with

hashtags. In a next step, hashtags were ordered by frequency and the first 1500 most

frequent hashtags were manually analyzed for city names. An earlier automated attempt

to geocode tweets through comparison between hashtags and Twitter place names,

using the Levenshtein distance, led to unsatisfactory results (e.g., due to duplicate

names or a different spatial resolution of place regions between both compared

sources) and was therefore not pursued any further. During the manual matching

process, each hashtag was verified on Google Maps and Wikipedia to ensure that it

indeed represented a city name. City names in hashtags that occurred more than once

at different locations were excluded to avoid ambiguity. This process resulted in a total

of 309 city geo-hashtags.

The geography of mentioning cities was obtained through Twitter places from

tweets that used a place type “city”, like shown in the example in Figure 4-2, or exact

coordinates combined with place type “city”. Cities are a Twitter place type that falls

between the twitter "admin" place type and the Twitter "neighborhood" type in terms of

spatial resolution, and can only be found in selected regions around the world

(Hochmair et al., 2018), including part of the U.S., Europe, Canada, Brazil, India, or

Page 87: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

87

Japan. Figure 4-3 compares the spatial layout of originating cities (Twitter places

visualized as green polygons) and that of mentioned cities (Kernel density heat map). It

shows that heavily populated metropolitan areas, such as New York, the San Francisco

Bay, Philadelphia, Washington D.C., or Dallas have the highest density of places

mentioned in hashtags. The Kernel density visualization was used to show the spatial

distribution of mentioned cities (per km2). It should be noted that actual locations of

mentioned cities are typically more dense in the center of the Kernel density peaks, but

they do exist on the fringes as well.

Next, since the same city could be mentioned in a hashtag and but also be the

location of the geo-tagged tweet (e.g. mentioning), the final stage of the data

preparation included the assignment of cities from both data sources (hashtags, place

type) into a common geographic scheme, namely the U.S. Census Metropolitan and

Micropolitan Areas. To assign a city to a Census Metropolitan or Micropolitan Area the

centroid of a city bounding box of Twitter places (compare Figure 4-3) was used. This

was done automated for the city place type in tweets, whereas the cities for the 313

geo-hashtags were first manually geocoded and then fit inside the nearest Census

Metropolitan or Micropolitan area. The union of mentioned and mentioning cities

resulted in a total of 432 cities across the U.S. A few more conditions were used to

ensure that bot or spam tweets were excluded from the data set. At first, only tweets

from mobile devices were used, hence applying the filtering based on the source of the

tweet. Then to filter out bots, the “botometer” API was applied (Varol et al., 2017).

Additionally, several other users were removed who used more than three hashtags per

tweet. The last step was necessary since some users with politically motivated tweets

Page 88: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

88

had a high number of geo-hashtags in every tweet and were thus biasing the outdegree

of cities.

Analyzing the Network Structure of Mentions

Social networks are often modeled as graphs. Therefore, measures of graph

structure are important to understand the role of different network components (e.g.

nodes, links) and actors in the network. A comprehensive review of measures relating to

the organization of a social network and the interaction between actors can be found in

the literature (Barthélemy, 2011; Boccaletti et al., 2006; Koput, 2010; Snijders, 2011).

This section reviews concepts of social network analysis which are used in the modeling

of inter-city hashtag mentions, including centrality, node degree, or reciprocity.

Graph Generation

As a basis for subsequent social network analysis a directed, weighted graph

was created. Cities were abstracted as nodes, and mentions of cities in tweet hashtag

as edges. The edge weight was the number of times a tweet in city A mentioned city B

in a hashtag. For graph analysis and visualization the R package igraph was used

(Csardi & Tamas, 2006). As an example, Figure 4-4 shows a sub-graph comprised of

33 cities that have an indegree higher than 30 in a layout proposed by (Adai et al.,

2004). Line width corresponds to the number of directed mentions. The closeness of the

nodes is proportionate to the weight of the links between them. Hence, the layout does

not resemble geographic proximities, but rather proximities in the social network space.

The entire resulting network had the following dyad census:

Mutual links (the number of pairs of cities with mutual mentions): 307

Asymmetric links (the number of pairs of cities with one-way mentions): 1,527

Null links (the number of pairs of cities with no mentions between them): 91,262

Page 89: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

89

The Distance Between Mentioning Cities

To calculate mention distances between two cities (where city A mentions city B

in a tweet with a geo-hashtag) all distances between cities were counted as often as city

A mentioned city B. The mean distance of a mention was 1293 km and the median was

834 km (the latter corresponding to the approximate distance between San Diego and

San Francisco). These mention distances reveal significantly smaller values than

unweighted distances between all possible city pairs in the city mention graph (with

mean = 1423 km, median = 1070 km). This means that mentions take place in localized

and regional clusters.

Figure 4-5 shows the distribution of distances between all pairs of cities (blue

histogram) and the distribution of distances of mentions between cities. The pronounced

peak of the weighted distance distribution in the two smallest distance bins (yellow)

compared to the shape of the blue histogram suggests that mentions between cities are

more common at shorter distances than the corresponding geographic layout of cities

would suggest.

Node Degree

In an undirected graph, the degree of a node is the number of links incident to

that node. In a directed graph, the indegree id(n) and outdegree od(n) of a node is the

number of incoming or outgoing edges, respectively, and the degree of a node deg(n) is

the sum of its id(n) and od(n) (Sporns, 2002). The concept of node degree has been

extended to weighted networks, where the weighted in- and outdegrees consider the

sum of weights of incoming or outgoing edges and hence measure the strength of

nodes in terms of the total weight of their connections (Barrat et al., 2004). A weighted

node degree is also referred to as node strength. Node strength is the commonly used

Page 90: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

90

measure for the analysis of the weighted networks (Barrat et al., 2004; Opsahl et al.,

2010). Therefore, weighted degree or strength of nodes will be used in the subsequent

analyses of this study.

Table 4-1 shows the weighted indegree and outdegree of cities in the U.S. wide

mention graph. The outdegree denotes the number of times other cities are mentioned

in hashtags of tweets posted in that city whereas the indegree of a city denotes the

number of times tweets posted from other cities mention that city in a hashtag.

Table 4-1 A) shows that New York gets most mentions from other cities (660),

making it the most prominent city in this regard, followed by Atlanta (352), Los Angeles

(349) and Boston (303). Table 4-1 B) shows that New York and Los Angeles mention

the highest number times other cities, which could be attributed the fact that they are

the largest and second largest cities in the U.S. by population. The steep decline in

weighted indegree and outdegree suggests a right-skewed distribution for both

variables.

The frequency of indegree and outdegree was fitted to a power law distribution

(Figure 4-6), where a linear regression with a simple logarithmic binning was used

(White et al., 2008). The R-squared was found to be 0.80 and 0.92 for incoming and

outgoing mentions, respectively. This case is typical for scale-free networks. It has been

shown that node strength follows a power-law distribution in scale-free networks (Tan &

Lei, 2013; Watts & Strogatz, 1998), which is also demonstrated for the network of city

mentions in this study. Also, numerous real-world networks have this topology (Wang &

Chen, 2003).

Page 91: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

91

Network Centrality Measures

Network centrality measures are commonly used to identify influential nodes in a

network. This is because of the potential power is given to a central actor to influence

information flows in such a way as to serve the actor’s interests (Freeman, 1977).

Different types of network centrality have been proposed using measures, such as

topology (neighborhood relationships), flows, or network distances (Barthélemy, 2011).

Some prominent examples include degree centrality (or strength centrality in the case of

weighted networks), Eigenvector centrality and its variant Page Rank centrality,

Kleinberg hub and authority centrality scores, or betweenness and closeness centrality.

Since betweenness centrality is not a suitable approach for weighted networks

(Dekker, 2008), we will compare some other centrality measures for the analyzed

mention network. Node strength, measured as the sum of mentions for the in- and

outdegree for a city, denoting the weighted in- and outdegree, is the first presented

centrality measure (Table 4-1).

Other computed weighted centrality measures were degree centrality, closeness

centrality, Eigenvector centrality and PageRank centrality (McCulloh, 2010), using the

igraph R package. The measures are standardized by dividing them with by the highest

possible value, that is 1/(N-1) where N is the number of vertices in the graph. In the

1990’s, the concepts of hubs and authorities have been used to analyze the information

organization in hyperlinked networks (Kleinberg, 1999). Authoritative Web pages are

those that contain relevant information for questions posted on a specific search topic.

In the context of spatial social media networks, authorities can be thought of as

geographic locations that are frequently mentioned in tweets. A hub in hyperlinked

networks is a page that points to many good authorities. Again, in the context of social

Page 92: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

92

media networks a hub could denote a city that posts frequently about other important

cities. Hub and authority scores on the graph of mentions were computed using the

igraph R package. Table 4-2 shows the Pearson correlations between some of these

measures, which are all significant at the 5% level. The bivariate correlations between

weighted degree centrality and Eigenvector centrality are close to one, meaning that the

latter (and more complex) centrality measure gives similar score rankings for cities as

the weighted degree centrality, which is simpler to understand.

Since closeness centrality shows how close a node is to other nodes in a

network, information from a node with high closeness would diffuse through the network

the fastest (McCulloh, 2010). For the analyzed mention network, closeness centrality

gives a similar score for most cities. Nashville, TN has the highest closeness centrality

(0.449) and Oakland, CA has the lowest closeness centrality (0.390). Table 4-3 shows

the cities ranked by their Kleinberg hub and authority scores. The highest ranked

“authority” is New York City with an authority score of 1. Las Vegas, Atlanta, Los

Angeles and Washington, D.C. follow with authority scores 0.490, 0.448, 0.436 and

0.387, respectively, showing that there is a wide range of authority values among

analyzed cities.

Since New York users mention Los Angeles and Washington, D.C. only 41 times

each, but Twitter users from Los Angeles mention New York City 63 times, Las Vegas

55 times and Atlanta and Chicago 25 times each, Los Angeles has a high hub score.

We therefore conclude that Twitter users from Los Angeles tend to mention popular

cities more than users in other cities of the United States and that New York is the most

popular city in the country.

Page 93: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

93

Reciprocity And Connectance

Reciprocity is the proportion of reciprocated links. For the entire graph of 434

cities, 28.3% of links are mutual (Snijders, 2011). Connectance is a global topological

measure that is computed as the fraction of existing links divided by the squared

number of nodes in the observed network (Dunne et al., 2002), hence it is the fraction of

all possible links that are realized in a network. Table 4-4 shows reciprocity and

connectance values for the U.S. states with more than five cities used in the analyzed

network graph.

Colorado (numbers in bold) has the highest reciprocity in mentions between

cities and the highest connectance. A schematic figure of mention patterns for Colorado

is shown in Figure 4-7. When considering the complete graph with all analyzed U.S.

cities, the connectance is much lower with a value of 0.011. Hence, as expected, cities

located within a state are better connected than cities across the entire country.

(Kwak et al., 2010) found that only 22.1% of Twitter users have a reciprocal

relationship in terms of follower behavior. As opposed to this, for the entire network of

U.S. cities, the percentage of cities that reciprocate mentions by at least one tweet is

higher (28.3%). The correlation between reciprocity and connectance at the state level

is 0.61 (p = 0.012).

Sentiment Analysis

Tweets convey textual information that can be quantified by its sentiment. In the

context of this work, it is of interest to see if the average sentiment score of tweets

associated with a city is related to communication tie variables. Text processing of

entire tweet posts (text and hashtags) was run for tweets that use city hashtags using

the “text2vec” R package (Selivanov, 2016), which implements the method in (Bryl,

Page 94: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

94

2017). This approach uses a machine learning classifier that is trained on the Sentiment

140 corpus of 1.6 million tweets that was labeled using emoticons (Go et al., 2009). This

labeled dataset was divided into training and testing subsets in 80:20 ratio. The

following texts processing procedure was applied to the training subset. The vocabulary,

which is a list of all words used that were found in the analyzed text documents (tweets

in this case), is cleaned from stop words. Furthermore, a Document-Term matrix (DTM)

was created and term frequency – inverse document frequency (TF-IDF) model was

applied to DTM. Next, the generalized linear model classifier was trained using “glmnet”

R package (Friedman et al., 2010), with TF-IDF transformed Document Term Matrix as

the independent and existing sentiment as the dependent variable. Then, the trained

classifier was tested against the testing subset of tweets and the training set shows an

area under the curve (AUC) measure is of 0.875 which is generally considered as good

(Vidya et al., 2015). Finally, the trained GLM model classifier is used to classify the

sentiment of the tweets used in this study.

For the analysis, only tweets in English were used, based on the language

metadata setting of every tweet. Each tweet receives a probability value of having a

positive sentiment between zero and one. Based on this, the weighted average

sentiment score was calculated for all of the city’s incoming tweets, where only cities

with more than 30 incoming tweets were used for the analysis to reduce data noise. A

total of 30 cities remained after this step. For the interpretation of mean values, cities

with notably high or low values were reviewed in more detail by looking at the context of

tweets associated with a city.

Page 95: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

95

The average incoming tweet sentiment ranged from 0.486 for Tulsa, OK, and

0.644 for Portland, OR. Many Twitter users that were tweeting about Tulsa did so in the

context of the Black Lives Matter movement. This topic was also frequently mentioned

in tweets about Charlotte, SC, which had a mean sentiment of 0.513. While geo-

hashtags and sentiment analysis detected actual events in these cities, for Roanoke,

VA, many tweets were about a fictive event in the television series American Horror

Story: Roanoke, which received a low average sentiment score of 0.560.

As opposed to this, tweets about Cleveland, OH, received tweets with a high

average sentiment value of 0.641. Hashtags often used together with #Cleveland were

#Windians, #RallyTogether, #Indians, which are related to the baseball team Cleveland

Indians. Therefore, high sentiment values can be indicative of sporting events. The

same was observed for Boston, MA, which earned a high average sentiment of 0.643

where #redsox, #RedSox (a baseball team from Boston) and #travel, #fall, #igboston

(all travel related) often occurred. Figure 4-8 shows the most frequently used words in

tweets with geo-hashtags of these cities, reflecting some of these topics. Furthermore,

Los Angeles (0.596), New York City (0.606), Chicago (0.605) and Atlanta (0.618)

received tweets of similar magnitude, although tweets about sports events were not

predominant for these cities. Initial word clouds about New York showed that a Comi-

con was a commonly mentioned topic because the frequently used hashtag #nycc was

associated with that event. This means that #nycc was a dominant hashtag used with

New York geohashtags such as #nyc, #NYC, #NewYork, etc. To be able to avoid other

words being masked by that event, this topic was removed from the word cloud in

Figure 4-8 F). Furthermore, in all word clouds spatial locations were removed as well,

Page 96: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

96

e.g. words like Manhattan and Brooklyn, to obtain thematic topics instead. We can

conclude that sentiment analysis is able to detect events in smaller cities.

Next, the average sentiment score of a mentioned city was related to the

distance between mentioning city and the mentioning city, where the distance variable

was transformed with the natural logarithm. Figure 4-9 A) and B) show the mean

sentiment of tweets for inter-city mentions for a total of 10 and 52 pairs of cities,

respectively, where each involved city pair that had more than 30 and 10 tweets,

respectively. The thresholds of 30 and 10 were selected as a minimum sample size for

calculation of the mean sentiment. The plotted data points include New York, Los

Angeles, San Francisco, Detroit, Dallas, Washington D.C. and some of their

surrounding smaller cities for Figure 4-9 A). In Figure 4-9 B) only pairs of cities with

more than 25 mentions are annotated to avoid the clutter.

The negative slope of the regression line for both subgraphs in Figure 4-9

indicates a general decrease of the mean sentiment score in tweets with the distance

between mentioned and the mentioning city. Based on results from these cities, we

conclude that Twitter users love thy neighbor.

Homophily and Heterophily

Further analysis was conducted to explore the processes underlying the inter-city

communication ties. More specifically, this section is concerned with individual

characteristics of cities (and city pairs) that drive homophily or heterophily. For this

purpose, one needs to examine the similarity or dissimilarity of individual characteristics

in city pairs that have higher mutual mentions compared to city pairs with fewer mutual

mentions. This will be achieved through regressing relational data on observed mention

data, using the Quadratic Assignment Procedure (QAP) regression (Krackhardt, 1987).

Page 97: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

97

Data Preparation

City characteristics (nodal covariates)

Various city characteristics are expected to influence how frequently a city is

mentioned in hashtags and how strong mutual communication ties between city pairs

will be. The following attributes at the city level were used for the QAP regression model

to predict the tie strength (measured by mutual mentions) between cities:

Demographics. City population and number of housing units were aggregated from 2010 Census block data obtained from (Census, 2014).

Airports. The total number of passengers boarding in a city was derived from the commercial airports within the city area. Boarding numbers were obtained from (FAA, 2010).

Schools. The number of students enrolled in schools per city was compiled for post-Secondary Education facilities from the Homeland Infrastructure Foundation-Level data for the 2014-2015 school year. Types of schools include among others Doctoral/Research Universities, Masters Colleges and Universities, Baccalaureate Colleges, Associates Colleges, Theological seminaries, or Medical Schools.

Occupation employment data. Occupational Employment Statistics for 2016 were obtained from the Bureau of Labor Statistics (BLS) at the city level. More specifically, BLS uses revised metropolitan area divisions (see https://www.bls.gov/oes/current/msa_def.htm). For each city the corresponding division could be matched with a U.S. Census Metropolitan or Micropolitan Area, except for Boston, where three Metropolitan Divisions had to be joined into a Metropolitan area. BLS employment data is subdivided into 22 broad and 1371 specific occupational categories. This study uses all broad and a few specific occupation categories as they relate to tourism and real estate development. Table 4-5 lists the occupations that were used as city covariates, where rows 1-22 are broad BLS occupational categories, rows 23-25 in boldface are specifically chosen BLS occupational categories, and occupations marked with a * were not used since they were not reported for each city. For each category the number of employees per city was divided by the total number of employees across all categories in that city and then multiplied by 1000. For further processing cities from the occupation data table had to be manually matched to the U.S. Census Metropolitan and Micropolitan Areas as the naming conventions were different. Aside from the hypothesized occupation predictors, a number of other occupations included in the model were of exploratory nature.

Page 98: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

98

Exploration of occupation variables provided some expected patterns. For

example, Las Vegas has a high number of employees per 1000 who work in hotels (4.1

compared to a mean employment of 2.1 across all cities). Ithaca, NY (Cornell

University) with 164.5, Gainesville, FL (University of Florida) with 129.2, Merced, CA

(University of California) with 128.2, and Champaign-Urbana, IL (University of Illinois)

with 125.2 all have a high number of employees per 1000 working force in the education

sector compared to the mean of 65.2 across all analyzed cities. Out of 432 cities

mentioned or mentioning in hashtags, 316 could be matched to regions in the BLS

tables. The unmatched cities were excluded from the QAP regression since they were

small and had only a few incoming or outgoing mentions.

Dissimilarity and similarity matrices

In the QAP regression an adjacency matrix for a social relation, in our case, the

number of mentions in tweet hashtags is the dependent variable, whereas a set of

attribute dyadic similarity or dissimilarity matrices represent independent variables.

Therefore, all individual level data of cities need to be transformed to dyadic measures

of similarity or dissimilarity that can be regressed on social relations. The individual

characteristic can be subdivided into attributes (numerical or categorical) and

affiliations. A set of rules for transforming individual level data to dyadic measures is

provided in (Koput, 2010). Individual level data in our dataset consist of single item

attributes. The match rule for categorical variables states that if two agents match in

terms of the category (e.g. being located in the same state) a 1 needs to be placed in a

cell for conversion to dyadic, otherwise 0. This approach results in a similarity matrix.

The state variable was the only measure with a similarity matrix in our dataset. The

absolute difference rule converts individual numerical data to dyadic by computing the

Page 99: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

99

absolute difference between both agents in the corresponding cell, giving a dissimilarity

matrix.

Demographic and occupational variables were numeric. Hence, the dissimilarity

matrices were computed as an absolute difference between values for a pair of cities for

most variables. One specific case is the distance dissimilarity matrix which was

populated with the geodesic distances between pairs of cities in kilometers. To calculate

this distance the PostGIS function ST_Distance was used. All dyadic independent

variables (except for state) and the dependent variable were log transformed, similarly

to (Zahn, 1991).

Network Regression

QAP correlations were calculated using the UCINET 6 software package

(Borgatti et al., 2002). The highest correlations between the matrix of city mentions and

matrices of absolute differences in city attributes were found for airports (0.1, p = 0.001),

jobs (0.128, p = 0.001), population (0.131, p = 0.001) and schools (0.120, p = 0.001).

A multiple-regression coefficient QAP (short MRQAP) regression (Dekker et al.,

2003) was used to model how the number of mentions between cities depends on

demographic and occupation variables, as well as airport passengers numbers in cities,

the distance between cities and state boundaries. MRQAP is unbiased under

multicollinearity conditions. Therefore, potential QAP correlations between independent

variable matrices were not examined. The regression itself was done using the R

package SNA (Butts, 2016), closely following the method in (McFarland et al., 2010).

OLS regression could not be used to predict network ties since observations are

correlated due to using mentions from the same city or of the same city. The MRQAP

regression does not calculate the standard error to determine statistical significance, but

Page 100: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

100

instead randomly shuffles rows and columns of the matrix representing the dependent

variable. For each model in this study, 2000 such permutations were used, as

suggested by (Cook, 2012). OLS regression coefficients are then calculated from the

permuted matrices.

Results of MRQAP regression provide coefficients (with their levels of

significance) describing the slope of a linear relationship between independent variables

and the dependent social relation (Koput, 2010). Four combinations of matrix type and

coefficient sign can occur in the regression results. If the independent variable is a

similarity matrix a positive coefficient indicates that greater similarity contributes to a

stronger tie. A positive coefficient for an independent variable that is coded as a

dissimilarity matrix would indicate that greater dissimilarity makes the tie less likely, or,

expressed differently, that greater similarity makes the tie more likely. Therefore, both

cases provide evidence for homophily. Heterophily is present for the remaining two

cases, i.e. where the independent variable is a similarity matrix and the coefficient are

negative, or where the independent variable is a dissimilarity matrix and the coefficient

is positive. Regression results will be interpreted with respect to these four cases.

Estimated results for three different regression models are reported in Table 4-6

where only the arithmetic sign of coefficients (but not their magnitude) and their level of

significance are shown. The three models include subgraphs of cities that have a

minimum indegree of 10, 30 and 50 respectively and therefore take a more prominent

role in the network compared to the remaining (excluded) cities. Subgraphs were

analyzed since the model for the entire graph explained the very small percentage of

the dependent variable variation, namely only 4%. Only variables that are significant in

Page 101: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

101

at least one of the four models are reported in the table. Variables with a *, **, or *** in

the left column of each model with a dissimilarity matrix denote homophily, whereas

such variables with asterisks in right columns reveal heterophily. For the only variable,

that uses a similarity matrix (state variable), the plus sign in the right column actually

indicates homophily.

Results in Table 4-6 show that the R-squared increases with more stringent city

filters and is highest for the model 3. This can be explained by a larger number of

mentions across the participating cities and hence less noise that comes from random

communication ties. Three variables (highlighted in boldface) are significant in all three

models. A few regression outcomes will be looked at in more detail. The positive and

significant population variable means that city pairs that heavily differ in the population

are likely to form connections. This could come from the fact that big cities get many

mentions from smaller neighbors.

The negative sign for distance means that cities that are further apart will be less

likely to form connections. Hence, as expected, a closer proximity between cities has a

positive effect on the formation of the mention ties. For the first three models, cities

located within a state have stronger ties than across state boundaries, showing

evidence of homophily for this variable and confirming findings from section 4.4 that

cities within states are better connected. It demonstrates also a close relation between

online social networks and the underlying ‘real world’ geography (Stephens & Poorthuis,

2014).

Page 102: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

102

In addition, the school enrollment variable indicates heterophily, which could

suggest that, similar to the population variable, cities with smaller student population are

those predominantly tweeting about cities with larger educational institutions.

The number of workers in building and grounds cleaning related occupations

expresses heterophily in all three models. The number of occupations in this field is very

high for touristic cities, such as Las Vegas (63.7) or Orlando (49.2), but lower for others,

such as Charlotte, Atlanta and Los Angeles (value of around 25), although those cities

receive many mentions. This variable is also relatively high for Ithaca, NY (~44) which is

home to Cornell University. It seems that this variable indicates touristic cities, although

it is only moderately correlated with hotels (Pearson R2 = 0.538). This occupation

comprises mostly janitors (~50%) and landscaping workers (~20%). This could indicate

that leading cities in some other aspects (e.g. tourism or education), which requires also

a high work force in building maintenance, tend to receive many mentions from less

prominent cities in this aspect.

Discussion And Conclusions

This study analyzes the prominence of cities as well as the interaction between

cities using hashtag mentions in tweets. It expands therefore more traditional measures

of city prominence (e.g. presence of prominent corporation) or city ties (e.g. commodity

flows). In addition, it explains processes leading to stronger or weaker ties between

cities using Quadratic Assignment Procedure.

New York City is the most popular city when considering the highest number of

mentions, where many tweets that use prominent hashtags, such as #newyorkcity, #nyc

and #newyork are related to travel, Central Park, photography and fashion. The city

popularity can have a dynamic component that is, based on events of limited duration.

Page 103: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

103

Examples are hashtags #KeithLamontScott, #BlackLivesMatter and #CharlotteProtest

that are often used together with the city hashtag #Charlotte. The countrywide network

of city mentions exhibits a scale-free topology, which means that only a few cities are

connected to many others, whereas the majority of cities are connected to only a few

others. Similarly, incoming and outgoing mentions follow closely a power-law

distribution, giving the network a scale-free property in the aspect of city prominence. At

the state level, cities are mostly connected with Colorado, which also reveals the

highest reciprocal mention relationships among all U.S. states.

Further network centrality measures were calculated and compared to identify

most influential cities. Since weighted degree centrality is highly correlated with

Eigenvector and PageRank centralities, the prior property itself accurately identifies the

most popular cities. New York has the highest node strength and thus receiving most

mentions from other cities. Kleinberg authority scores confirmed New York as the most

popular city in the country and revealed that Los Angeles tweets mostly about other

popular cities. In terms of information diffusion, several cities have similarly high

closeness values. Among them, Nashville, TN, has the highest value and would in this

respect the best entry point for a fast spread of news across the network of city

mentions.

Connectance and reciprocity measures suggest that state-level subgraphs of

mentions are, as can be expected, better connected than the graph for the entire

country, confirming earlier studies which suggest a close relationship between

geographic and network space in terms of communication clustering.

Page 104: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

104

Sentiment analysis identified events in smaller cities. Geo-hashtags also allowed

analyzing of the connection between sentiment and the distance between the involved

cities. The moderate negative correlation (R2=0.35) can be interpreted as an attenuation

in the sentiment of tweets with an increase in distance. Therefore, we conclude that

Twitter users tend to tweet more favorably about their neighboring cities. A possible

extension of this analysis would be to include tweets from a longer time period.

QAP regression shed some light on the factors that play a significant role in

communication ties between cities. For example, closer geographic proximity as well as

the location in the same state led to stronger communication ties, displaying examples

of homophily. For other variables, such as population, a larger difference in attribute

levels leads to stronger ties, showcasing heterophily, where larger cities may attract a

disproportionally higher number of mentions from small cities than this is the case for

cities of approximately equal size.

This regression methodology can be extended to account for additional city

covariates. It is important to emphasize that this analysis does not necessarily show the

true (i.e. long-term) popularity of analyzed cities or city ties, but may be biased by short

term events or name ambiguity. For example, some of the tweets that were manually

examined and contained the #Atlanta hashtag were posting about a television show

called Atlanta (hence tweeting about that show and not necessarily about the city). Also,

some tweets containing #LA used it for the state of Louisiana and not for the city of Los

Angeles, what is typically used for. To get more accurate results, future work calls

therefore for advanced filtering techniques for hashtags, e.g. using advanced text

processing, geo-ontologies, similarity measures, and thesauri. Another potential step

Page 105: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

105

further in this analysis would be to extend the study to worldwide cities, more closely

resembling the idea of world city (Freeman, 1977). The obvious challenge would be to

obtain attribute data for the production of dyadic relations, such as occupational

employment statistics, for countries across the world, which will hopefully be facilitated

through an increasing number of open data initiatives around the world.

Page 106: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

106

Figure 4-1. Setup of world regions used for Twitter data download.

Page 107: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

107

{ "id": "42e46bc3663a4b5f", "url": "https://api.twitter.com/1.1/geo/id/42e46bc3663a4b5f.json", "name": "Fort Worth", "country": "United States", "full_name": "Fort Worth, TX", "attributes": {}, "place_type": "city", "bounding_box": { "type": "Polygon", "coordinates": [ [ [-97.538285, 32.569477], [-97.538285, 32.990456], [-97.033542, 32.990456], [-97.033542, 32.569477] ] ] }, "country_code": "US" } Figure 4-2. Country place tag in geo-tagged tweets JSON file.

Page 108: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

108

Figure 4-3. Locations of originating cities of tweets (green polygons) and density of mentioned cities (blueish Kernel density map).

Page 109: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

109

Figure 4-4. Force directed layout for a sub-graph of cities that have more than 30

incoming mentions.

Page 110: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

110

Figure 4-5. Distribution of weighted and unweighted distances (in km) between U.S. cities.

Page 111: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

111

Figure 4-6. Power law fitting the distribution of the weighted indegree and weighted

outdegree of the city mentions graph.

Page 112: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

112

Figure 4-7. A network of mentions between cities in Colorado (link width is proportionate

to edge weights).

Page 113: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

113

A B C

D E F

Figure 4-8. Word clouds of the words most used with some of the analyzed geo-hashtags: A) Cleveland, OH, B) Roanoke, VA, C) Boston, MA, D) Tulsa, OK, E) Charlotte, NC, F) New York, NY.

Page 114: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

114

A

B

Figure 4-9. Mean sentiment value of tweets between pairs of cities plotted against distance (in 1000s of km) between pairs of cities. A) links with more than 30 mentions, B) links with more than 10 mentions.

Page 115: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

115

Table 4-1. Cities with highest weighted indegree and outdegree (strength): A) Indegree, B) Outdegree City Indegree

New York, NY 660 Atlanta, GA 352 Los Angeles, CA 349 Boston, MA 303 Chicago, IL 279 Las Vegas, NV 270 Charlotte, SC 252 Washington, DC 251 San Francisco, CA 212 Detroit, MI 203 Miami, FL 192 Philadelphia, PA 150 Dallas, TX 133 Seattle, WA 124 Cleveland, OH 118 Nashville, TN 112 Houston, TX 87 Denver, CO 85 San Diego, CA 81 Portland, OR 74

City Outdegree

Los Angeles, CA 385 New York, NY 385 Cambridge, MA 206 Chicago, IL 174 Washington, DC 160 Oakland, CA 117 Warren, MI 116 San Francisco, CA 109 Atlanta, GA 106 Long Island, NY 102 Anaheim, CA 97 Houston, TX 97 Fort Worth, TX 86 San Diego, CA 82 Seattle, WA 81 Miami, FL 78 Fort Lauderdale, FL 76 Newark, NJ 73 Dallas, TX 72 Phoenix, AZ 70

A B

Page 116: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

116

Table 4-2. Pearson correlation between weighted centrality measures Pearson correlation Degree centrality Eigenvector centrality Kleinberg Authority Score

Degree centrality 1.000 0.950 0.732

Eigenvector centrality 0.950 1.000 0.716

Kleinberg Authority Score 0.732 0.716 1.000

Page 117: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

117

Table 4-3. City ranking based on closeness centrality, together with Kleinberg hub and authority scores: A) Authority scores, B) Hub scores. City Authority score

New York, NY 1.000 Las Vegas, NV 0.490 Atlanta, GA 0.448 Los Angeles, CA 0.436 Washington, DC 0.387 Chicago, IL 0.360 San Francisco, CA 0.307 Miami, FL 0.248 Detroit, MI 0.234 Charlotte, SC 0.203 Philadelphia, PA 0.179 San Diego, CA 0.166 Dallas, TX 0.154 Houston, TX 0.124 Seattle, WA 0.122 Nashville, TN 0.120 Cleveland, OH 0.101 Denver, CO 0.097 Anaheim, CA 0.064 Tulsa, OK 0.063

City Hub score

Los Angeles, CA 1.000 New York, NY 0.668 Long Island, NY 0.561 Chicago, IL 0.401 Newark, NJ 0.395 Washington, DC 0.355 Miami, FL 0.264 Anaheim, CA 0.260 Oakland, CA 0.237 Atlanta, GA 0.230 Houston, TX 0.218 Philadelphia, PA 0.207 Warren, MI 0.195 San Francisco, CA 0.194 Seattle, WA 0.164 San Diego, CA 0.157 Fort Lauderdale, FL 0.150 Dallas, TX 0.147 Fort Worth, TX 0.147 Riverside, CA 0.127

A B

Page 118: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

118

Table 4-4. City mentions state subgraph indicators State Reciprocity Connectance Number of nodes

AL 0.000 0.044 10

CA 0.500 0.108 27

CO 0.667 0.214 7

FL 0.372 0.102 21

GA 0.182 0.122 10

IL 0.167 0.109 11

MD 0.500 0.190 7

MI 0.167 0.066 14

NC 0.211 0.122 13

NJ 0.333 0.083 9

NY 0.421 0.144 12

OR 0.000 0.133 6

PA 0.000 0.044 17

SC 0.250 0.190 7

TX 0.333 0.065 22

WA 0.545 0.122 10

Page 119: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

119

Table 4-5. Mean number of employees in given occupation per 1000 employees in any occupation across all analyzed cities, and its and standard deviation of the mean. Categories in boldface highlight specific occupational categories whereas those in regular font show broad occupation categories

Row Occupations

Mean number of employees in given occupation per 1000

employees

Standard Error (Standard

Deviation of the Mean)

1 Architecture and engineering 16.5 0.51 2 Arts, design, entertainment, sports, and

media 10.5 0.21

3 Building and grounds cleaning and maintenance

32.5 0.39

4 Business and financial operations 41.9 0.81 5 Community and social service 15.5 0.26 6 Computer and mathematical 21.1 0.75 7 Construction and extraction 40.4 0.69 8 Education, training, and library 65.2 0.86 9 Farming, fishing, and forestry 5.4 0.94 10 Food preparation and serving related 98.6 0.88 11 Healthcare practitioners and technical 64.8 0.76 12 Healthcare support 30.3 0.43 *13 Installation, maintenance, and repair 41.5 0.40 14 Legal 5.7 0.15 15 Life, physical, and social science 7.7 0.29 16 Management 44.9 0.62 17 Office and administrative support 153.7 0.78 18 Personal care and service 32.8 0.54 19 Production 71.1 1.95 *20 Protective service 23.3 0.48 21 Sales and related 106.3 0.76 22 Transportation and material moving 66.1 0.96 23 Hotels 2.1 0.07 24 Retail 49.6 0.47 25 Real estate 2.2 0.07

Page 120: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

120

Table 4-6. Arithmetic signs of estimated coefficients from Multivariate QAP regression on four models

Model # 1 2 3 Subgraph selection criterion Indegree>10 Indegree>30 Indegree>50

# of nodes 57 33 21 Reciprocity .556 .710 .847

Arithmetic sign of slope coefficient

- + - + - +

Numerical variables (dissimilarity matrix)

Airports

*

Art and design

*

Building grounds cleaning

***

***

** Distance **

*

**

Farming * Healthcare practitioners *

Healthcare support

* Hotels **

***

**

Personal services

*

Population

**

**

Production

***

*

Retail

** Schools *

Categorical variables (similarity matrix)

State

***

**

Adjusted R squared 0.192 0.290 0.426

Signif. codes: *** p < 0.001; ** p < 0.01; * p < 0.05

Page 121: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

121

CHAPTER 5 CONCLUSIONS

The conducted case studies enhance the understanding of geospatial patterns of

information exchange and propagation through Twitter. Some of the factors that

influence the spread of information are also identified. The achieved results can be used

to improve and optimize a variety of real-world applications in the social media and

information science domain.

The exploration of Twitter and Instagram photos helped to better understand the

two VGI sources. Analyzed spatial offsets between object location and photo upload

location varied significantly across the continents, where the offset distance was

smallest in the United States, followed by Europe and other analyzed continents,

potentially indicating different availability levels of mobile Internet. Twitter places and

Instagram location tags were found to be available at a different spatial granularity.

Twitter places were generally found to be available at the neighborhood and city level.

The wide availability of user-contributed points of interest in Instagram facilitated the

analysis of their accuracy, revealing, for example, multiple labels and locations to

indicate the same point of interest.

A comprehensive analysis of Twitter images and other tweet content formats in

the response to Paris attacks answered several questions related to information

propagation. Tweets with images, when compared to tweets with hashtags or text

related to the attacks, received the highest attention from the Twitter community.

Journalists whose tweets earned a higher popularity than those of non-journalists

posted a large portion of the images. This indicates that journalists use Twitter perhaps

more actively than users for whom Twitter is not used as a professional work tool. An

Page 122: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

122

expansion of this work could be the analysis of Twitter images during other types of

events. To circumvent the unavailability of location information of retweets due to

technical limitations of the Twitter API, a new method of locating retweets was applied.

That is, retweets were located based on geolocated tweets by the same users posted

within a short period of time before and after the retweet. This method resulted in a

denser map of retweets in Europe indicating higher safety concerns and closer

proximity to the terrorist attack locations.

The geographic distribution of hashtags showed that hashtags in French are not

immune to language barriers, since tweets with French hashtags were used mainly in

francophone territories, whereas hashtags in English were posted all over the world.

Analysis of the temporal distribution of the hashtags showed that the public interest in

the attacks peaked at the day after the events and diminished two to three days later.

The spatiotemporal regression model of the hashtag spread showed that the

number of hashtags in places around the country depends on the number of tweets in

the main city in the country (the “tweeting capital” of the country). This suggests a two-

level hierarchical structure of the spread within a country from the tweeting capital to

surrounding cities.

The third case study introduced new measures of popularity of U.S. cities and

modeled their interaction through geo-hashtags. The network of inter-city mentions was

found to be a scale-free network. Network centrality measures identified New York as

the most popular city and that Twitter users from Los Angeles mention popular cities

more than users from any other place in the U.S. The sentiment analysis identified

events in some smaller cities and showed a weak trend in decline of the average

Page 123: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

123

sentiment of tweets with the distance between cities. Further network regression model

identified significant factors in communication between cities. Namely, cities receive

most mentions from smaller neighboring cities and from the cities in the same state. In

case of state and distance the underlying process is homophily since connections are

more likely to occur between similar cities. For population underlying process is

heterophily since connections are more likely to occur between cities that are different.

The network regression can be extended with the inclusion of the additional relevant city

descriptors and by analysis of the tweets collected over a longer period of time.

In conclusion, Twitter is a valuable and rich source of geographic information.

Beneficiaries from the better understanding of information flow through Twitter can be

governments, marketing research companies and emergency management

organizations, to name a few. The format of tweets, their theme (or content category)

and user profession were found to be important in relaying the information to the world

and the news landscape. All these factors affect the popularity of tweets, which is found

to play a significant role in raising geographic situational awareness in emergency

situations such as terrorist attacks. Inter-city mentions can be viewed as information

flows from the mentioned city to the mentioning city. Some of the analyzed cases

showed that events in the mentioned cities were sources of information. The model

proposed in the second study can be used by the emergency situations regulators for

geographic and temporal prediction of the intensity of Twitter users’ reaction. The

potential application of the model of the inter-urban city mentions can be the

dimensioning of airline traffic between cities.

Page 124: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

124

LIST OF REFERENCES

Achananuparp, P., Lim, E.-P., Jiang, J., & Hoang, T.-A. (2012). Who is Retweeting the

Tweeters? Modeling, Originating, and Promoting Behaviors in the Twitter Network. ACM Transactions on Management Information Systems, 3(3), 13:1–13:30. https://doi.org/10.1145/2361256.2361258

Adai, A. T., Date, S. V., Wieland, S., & Marcotte, E. M. (2004). LGL: Creating a map of protein function with an algorithm for visualizing very large biological networks. Journal of Molecular Biology, 340(1), 179–190. https://doi.org/10.1016/j.jmb.2004.04.047

Alivand, M., & Hochmair, H. H. (2017). Spatiotemporal analysis of photo contribution patterns to Panoramio and Flickr. Cartography and Geographic Information Science, 44(2), 170–184. https://doi.org/10.1080/15230406.2016.1211489

Andrienko, G., Andrienko, N., Bosch, H., Ertl, T., Fuchs, G., & Jankowski, P. (2013). Thematics Patterns in Georeferenced Tweets through Space-Time Visual Analytics. Computing in Science & Engineering, 15(13), 72–82. https://doi.org/doi.ieeecomputersociety.org/10.1109/MCSE.2013.70

Aslam, S. (2018). Twitter by the Numbers: Stats, Demographics & Fun Facts. Retrieved from https://www.omnicoreagency.com/twitter-statistics/

Bakshy, E., Hofman, J., Mason, W., & Watts, D. (2011). Everyone’s an influencer: quantifying influence on twitter. In Proceedings of the fourth ACM international conference on Web search and data mining SE - WSDM ’11 (pp. 65–74). ACM. https://doi.org/doi: 10.1145/1935826.1935845

Barrat, A., Barthélemy, M., Pastor-Satorras, R., & Vespignani, A. (2004). The architecture of complex weighted networks. In Proceedings of the National Academy of Sciences of the United States of America (Vol. 101, pp. 3747–3752). https://doi.org/10.1073/pnas.0400087101

Barthélemy, M. (2011). Spatial networks. Physics Reports, 499(1–3), 1–101. https://doi.org/10.1016/j.physrep.2010.11.002

Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10), 10008–10020. https://doi.org/10.1088/1742-5468/2008/10/P10008

Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., & Hwang, D.-U. (2006). Complex networks: Structure and dynamics. Physics Reports, 424(4–5), 175–308.

Borgatti, S. P., Everett, M. G., & Freeman, L. C. (2002). Ucinet 6 for Windows: Software for Social Network Analysis. Harvard, MA: Analytic Technologies.

Page 125: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

125

Borgatti, S. P., Mehra, A., Brass, D. J., & Labianca, G. (2009). Network Analysis in the Social Sciences. Science, 323(5916), 892–895. https://doi.org/10.1126/science.1165821

Brennan, S., Sadilek, A., & Kautz, H. (2013). Towards understanding global spread of disease from everyday interpersonal interactions. In Proceedings of the Twenty-Third international joint conference on Artificial Intelligence (pp. 2783–2789). Beijing, China: AAAI Press.

Bryl, S. (2017). Machine Learning in R using doc2vec approach. Retrieved from https://analyzecore.com/2017/02/08/twitter-sentiment-analysis-doc2vec/

Burt, R. S. (1995). Structural Holes: The Social Structure of Competition. Harvard University Press.

Can, E. F., Oktay, H., & Manmatha, R. (2013). Predicting retweet count using visual cues. Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management, 1481–1484. https://doi.org/10.1145/2505515.2507824

Census. (2014). 2014 TIGER/Line® Shapefiles: Blocks (2010). Retrieved from https://www.census.gov/cgi-bin/geo/shapefiles2014/layers.cgi

Cha, M., Haddai, H., Benevenuto, F., & Gummadi, K. P. (2010). Measuring User Influence in Twitter : The Million Follower Fallacy. International AAAI Conference on Weblogs and Social Media, 10–17. https://doi.org/10.1.1.167.192

Chang, H. C. (2010). A new perspective on Twitter hashtag use: Diffusion of innovation theory. Proceedings of the ASIST Annual Meeting, 47. https://doi.org/10.1002/meet.14504701295

Cheng, Z., Caverlee, J., & Lee, K. (2010). You are where you tweet: a content-based approach to geo-locating twitter users. In Proceedings of the 19th ACM international conference on Information and knowledge management (pp. 759–768). New York, NY, USA: ACM. https://doi.org/10.1145/1871437.1871535

Chong, M. (2016). Sentiment analysis and topic extraction of the twitter network of #prayforparis. Proceedings of the Association for Information Science and Technology, 53(1), 1–4. https://doi.org/10.1002/pra2.2016.14505301133

Compston, S. (2014). Identifying and Understanding Retweets & Quote Tweets. Retrieved January 10, 2017, from http://support.gnip.com/articles/identifying-and-understanding-retweets.html

Cook, J. M. (2012). Gender, voting and cosponsorship in the Maine State legislature. New England Journal of Political Science, IV(1), 1–30.

Page 126: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

126

Csardi, G., & Tamas, N. (2006). The igraph software package for complex network research. InterJournal.

Cvetojevic, S., Juhász, L., & Hochmair, H. H. (2016). Positional Accuracy of Twitter and Instagram Images in Urban Environments. GI_Forum 2016, 1, 191–203. https://doi.org/10.1553/giscience2016_01_s191

De Longueville, B., & Smith, R. S. (2009). “ OMG , from here , I can see the flames !”: a use case of mining Location Based Social Networks to acquire spatio- temporal data on forest fires. In Proceedings of the 2009 International Workshop on Location Based Social Networks (LBSN ’09) (pp. 73–80). Seattle, Washington, USA. https://doi.org/10.1145/1629890.1629907

De Rosario Martínez, H. (2015). Analysing interactions of fitted models. https://doi.org/10.1007/s13398-014-0173-7.2

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9

Dekker, D., Krackhardt, D., & Snijders, T. (2003). Multicollinearity robust QAP for multiple regression. NAACSOS Conference, Omni William Penn., 1–5.

Derudder, B., Taylor, P. J., Hoyler, M., Ni, P., Liu, X., Zhao, M., … Witlox, F. (2013). Measurement and Interpretation of Connectivity of Chinese Cities in World City Network, 2010. Chinese Geographical Science, 23(3), 261–273.

Duggan, M., Ellison, N. B., Lampe, C., Lenhart, A., & Madden, M. (2015). Demographics of Key Social Networking Platforms. Retrieved January 10, 2017, from http://www.pewinternet.org/2015/01/09/demographics-of-key-social-networking-platforms-2/

Dunne, J. A., Williams, R. J., & Martinez, N. D. (2002). Food-web structure and network theory: The role of connectance and size. PNAS, 99(20), 12917–12922. https://doi.org/10.1073/pnas.192407699

Evangelopoulos, N., Ashton, T., Winson-Geideman, K., & Roulac, S. (2015). Latent Semantic Analysis and Real Estate Research: Methods and Applications. Journal of Real Estate Literature, 23(2), 353–380. https://doi.org/10.5555/0927-7544.23.2.353

FAA. (2010). Passenger Boarding (Enplanement) and All-Cargo Data for U.S. Airports - Previous Years. Retrieved from https://www.faa.gov/airports/planning_capacity/passenger_allcargo_stats/passenger/previous_years/#2000

Page 127: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

127

Ferguson, C., Inglis, S. C., Newton, P. J., Cripps, P. J. S., Macdonald, P. S., & Davidson, P. M. (2014). Social media: A tool to spread information: A case study analysis of Twitter conversation at the Cardiac Society of Australia & New Zealand 61st Annual Scientific Meeting 2013. Collegian, 21(2), 89–93. https://doi.org/10.1016/j.colegn.2014.03.002

Fernandez, R. M., Castilla, E. J., & Moore, P. (2000). Social capital at work: networks and employment at a phone center. American Journal of Sociology, 105(5), 1288–1356.

Fischer, F. (2012). VGI as big data. A new but delicate geographic data-source. GeoInformatics, 15(3), 46–47.

Freeman, L. C. (1977). A Set of Measures of Centrality Based on Betweenness. Sociometry, 40(1), 35–41.

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Journal of statistical software. Journal of Statistical Software, 33(1).

Gilbert, E., & Karahalios, K. (2009). Predicting Tie Strength With Social Media. In CHI 2009. Boston, Massachusetts, USA: ACM.

Go, A., Bhayani, R., & Huang, L. (n.d.). Twitter Sentiment Classification using Distant Supervision.

Goodchild, M. F. (2007). Citizens as Voluntary Sensors: Spatial Data Infrastructure in the World of Web 2.0. International Journal of Spatial Data Infrastructures Research, 2, 24–32.

Goolsby, R. (2010). Social Media as Crisis Platform: The Future of Community Maps/Crisis Maps. ACM Transactions on Intelligent Systems and Technology, 1(1), Article 7.

Graham, M., Hale, S. A., & Gaffney, D. (2014). Where in the World Are You? Geolocation and Language Identification in Twitter. Professional Geographer, 66(4), 568–578. https://doi.org/10.1080/00330124.2014.907699

Granovetter, M. P. (1973). The strength of weak ties. American Journal of Sociology, 78(6), 1360–1380.

Gründemann, T., & Burghardt, D. (2016). Visual Analysis of Thematic, Social and Geospatial Patterns of Microblogging Content Using D3. In LinkVGI workshop in association with the 19th AGILE Conference on Geographic Information Science.

Guille, A., Hacid, H., Favre, C., & Zighed, D. a. (2013). Information Diffusion in Online Social Networks: A Survey. Sigmod, 42(2), 17–28. https://doi.org/10.1145/2503792.2503797

Page 128: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

128

Hawelka, B., Sitko, I., Beinat, E., Sobolevsky, S., Kazakopoulos, P., & Ratti, C. (2014). Geo-located Twitter as proxy for global mobility patterns. Cartography and Geographic Information Science, 41(3), 260–271. https://doi.org/10.1080/15230406.2014.890072

Hochmair, H. H., & Cvetojevic, S. (2014). Assessing the Usability of Georeferenced Tweets for the Extraction of Travel Patterns: A Case Study for Austria and Florida. GI_Forum 2014, 30–39. https://doi.org/10.1553/giscience2014s30

Hong, L., Convertino, G., & Chi, E. H. (2011). Language Matters in Twitter : A Large Scale Study. In Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (pp. 518–521).

Hübl, F., Cvetojevic, S., Hochmair, H. H., & Gernot, P. (2017). Analyzing Refugee Migration Patterns using Geo-tagged Tweets. ISPRS International Journal of Geo-Information, 6(10), 302. https://doi.org/10.3390/ijgi6100302

Hung, K.-C., Kalantari, M., & Rajabifard, A. (2016). Methods for assessing the credibility of volunteered geographic information in flood response: A case study in Brisbane, Australia. Applied Geography, 68, 37–47. https://doi.org/10.1016/j.apgeog.2016.01.005

Jahng, M. R., & Littau, J. (2016). Interacting is believing: Interactivity, social cue, and perceptions of journalistic credibility on twitter. Journalism & Mass Communication Quarterly, 93(1), 38–58. https://doi.org/10.1177/1077699015606680

Jurdak, R., Zhao, K., Liu, J., AbouJaoude, M., Cameron, M., & Newth, D. (2015). Understanding human mobility from Twitter. PLoS ONE, 10(7). https://doi.org/10.1371/journal.pone.0131469

Jurgens, D. (2013). That’s What Friends Are For: Inferring Location in Online Social Media Platforms Based on Social Relationships. In Proceedings of the 7th International AAAI Conference on Weblogs and Social Media (pp. 273–282). https://doi.org/papers3://publication/uuid/7775D7FA-9933-4BE3-B8D4-023023980AB5

King, G., Lam, P., & Roberts, M. E. (2017, March). Computer-Assisted Keyword and Document Set Discovery from Unstructured Text. American Journal of Political Science. https://doi.org/10.1111/ajps.12291

Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604–632. https://doi.org/10.1145/324133.324140

Koput, K. W. (2010). Social Capital: An Introduction to Managing Networks. (E. Elgar, Ed.). Cheltenham, UK.

Page 129: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

129

Kotzias, D., Lappas, T., & Gunopulos, D. (2014). Addressing the sparsity of location information on twitter. In CEUR Workshop Proceedings (Vol. 1133, pp. 339–346). https://doi.org/10.1.1.429.2390

Krackhardt, D. (1987). QAP Partialing as a Test of Spuriousness. Social Networks, 9(9), 171–186.

Krackhardt, D. (1988). Predicting With Networks: Nonparametric Multiple Regression Analysis of Dyadic Data. Social Networks. https://doi.org/10.1016/0378-8733(88)90004-4

Kresl, P. K. (1995). The Determinants of Urban Competitiveness: A Survey. In P. K. Kresl & G. Gapper (Eds.), North American Cities and the Global Economy (pp. 45–68). London: Sage.

Kwak, H., Lee, C., Park, H., & Moon, S. (2010). What is Twitter, a social network or a news media? Proceedings of the 19th International Conference on World Wide Web. Raleigh, North Carolina, USA: ACM. https://doi.org/10.1145/1772690.1772751

Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211–240. https://doi.org/10.1037/0033-295X.104.2.211

Lansley, G., & Longley, P. A. (2016). The geography of Twitter topics in London. Computers, Environment and Urban Systems, 58, 85–96. https://doi.org/10.1016/j.compenvurbsys.2016.04.002

Lenormand, M., Gonçalves, B., Tugores, A., & Ramasco, J. J. (2015). Human diffusion and city influence. Journal of The Royal Society Interface, 12(109), 20150473. https://doi.org/10.1098/rsif.2015.0473

Lenormand, M., Tugores, A., Colet, P., & Ramasco, J. J. (2014). Tweets on the road. PLoS ONE, 9(8). https://doi.org/10.1371/journal.pone.0105407

Lerman, K., & Ghosh, R. (2010). Information Contagion: an Empirical Study of the Spread of News on Digg and Twitter Social Networks. In Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media (Vol. V, pp. 90–97). Washington, D.C. https://doi.org/10.1146/annurev.an.03.100174.001431

Levinson, D. (2008). Density and dispersion: The co-development of land use and rail in London. Journal of Economic Geography, 8(1), 55–77. https://doi.org/10.1093/jeg/lbm038

Li, L., & Goodchild, M. F. (2010). The Role of Social Networks in Emergency Management. International Journal of Information Systems for Crisis Response and Management, 2(4), 48–58. https://doi.org/10.4018/jiscrm.2010100104

Page 130: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

130

Li, Y., Li, Q., & Shan, J. (2017). Discover Patterns and Mobility of Twitter Users—A Study of Four US College Cities. ISPRS International Journal of Geo-Information, 6(2), 42. https://doi.org/10.3390/ijgi6020042

Liu, Y., Sui, Z., Kang, C., & Gao, Y. (2014). Uncovering Patterns of Inter-Urban Trip and Spatial Interaction from Social Media Check-In Data. PLoS ONE, 9(1), e86026.

Longley, P. A., & Adnan, M. (2016). Geo-temporal Twitter demographics. International Journal of Geographical Information Science, 30(2), 369–389. https://doi.org/10.1080/13658816.2015.1089441

Lotan, G., Graeff, E., Ananny, M., Gaffney, D., Pearce, I., & Boyd, D. (2011). The Revolutions Were Tweeted: Information Flows during the 2011 Tunisian and Egyptian Revolutions. International Journal of Communication, 5, 1375–1405. https://doi.org/1932–8036/2011FEA1375

MacEachren, A. M., Robinson, A. C., Jaiswal, A., Pezanowski, S., Savelyev, A., Blanford, J., & Mitra, P. (2011). Geo-twitter analytics: Applications in crisis management. In 25th International Cartographic Conference (pp. 3–8).

Malik, M. M., Lamba, H., Nakos, C., & Pfeffer, J. (2015). Population Bias in Geotagged Tweets. 9th International AAAI Conference on Weblogs and Social Media, 18–27.

McCulloh, I. (2010). Network Topology Effects on Correlation between Centrality Measures. Connections, 30(1), 21–28.

McFarland, D., Messing, S., Nowak, M., & Westwood, S. J. (2010). Social Network Analysis Labs in R.

Mislove, A., Lehmann, S., Ahn, Y., Onnela, J., & Rosenquist, J. N. (2011). Understanding the Demographics of Twitter Users. In Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (pp. 554–557).

Moffitt, J. (2014). Twitter Geographical Metadata. Retrieved from http://support.gnip.com/articles/geo-intro.html

Myers, S. A., Sharma, A., Gupta, P., & Lin, J. (2014). Information Network or Social Network? The Structure of the Twitter Follow Graph. In Proceedings of the 23rd International Conference on World Wide Web (pp. 493–498). Seoul, Korea: ACM.

Nielsen, R. K., & Schrøder, K. C. (2014). The Relative Importance of Social Media for Accessing, Finding, and Engaging with News: An eight-country cross-media comparison. Digital Journalism, 2(4), 472–489. https://doi.org/10.1080/21670811.2013.872420

Page 131: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

131

Onnela, J. P., Saramaki, J., Hyvonen, J., Szabo, G., Lazer, D., Kaski, K., … Barabasi, A. L. (2007). Structure and tie strengths in mobile communication networks. In Proceedings of the National Academy of Sciences (Vol. 104, pp. 7332–7336). https://doi.org/10.1073/pnas.0610245104

Opsahl, T., Agneessens, F., & Skvoretz, J. (2010). Node centrality in weighted networks: Generalizing degree and shortest paths. Social Networks, 32(3), 245–251. https://doi.org/10.1016/j.socnet.2010.03.006

Pei, S., Muchnik, L., Andrade José S., J., Zheng, Z., & Makse, H. A. (2014). Searching for superspreaders of information in real-world social media. Scientific Reports, 4, 5547.

Romero, D. M., Meeder, B., & Kleinberg, J. (2011). Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter. WWW’11 Proceedings of the 20th International Conference on World Wide Web, 695–704. https://doi.org/10.1145/1963405.1963503

Selivanov, D. (2016). text2vec: Modern Text Mining Framework for R.

Seo, H. (2014). Visual Propaganda in the Age of Social Media: An Empirical Analysis of Twitter Images During the 2012 Israeli–Hamas Conflict. Visual Communication Quarterly, 21(3), 150–161. https://doi.org/10.1080/15551393.2014.955501

Shelton, T., Poorthuis, A., Graham, M., & Zook, M. (2014). Mapping the data shadows of Hurricane Sandy: Uncovering the sociospatial dimensions of “big data.” Geoforum, 52, 167–179. https://doi.org/10.1016/j.geoforum.2014.01.006

Signorini, A., Segre, A. M., & Polgreen, P. M. (2011). The use of Twitter to track levels of disease activity and public concern in the US during the influenza A H1N1 pandemic. PloS One, 6(5), e19467. https://doi.org/http://dx.doi.org/10.1371/journal.pone.0019467

Snijders, T. A. . (2011). Statistical Models for Social Networks. Annual Review of Sociology, 37(1), 131–153. https://doi.org/10.1146/annurev.soc.012809.102709

Sobolevsky, S., Szell, M., Campari, R., Couronné, T., Smoreda, Z., & Ratti, C. (2013). Delineating Geographical Regions with Networks of Human Interactions in an Extensive Set of Countries. PLoS ONE, 8(12), e81707.

Sporns, O. (2002). Graph Theory Methods for the Analysis of Neural Connectivity Patterns. In R. Kötter (Ed.), Neuroscience databases. A practical guide. (pp. 171–185). Boston, MA: Kluwer Academic Press. https://doi.org/10.1007/978-1-4615-1079-6_12

Steiger, E., Ellersiek, T., Resch, B., & Zipf, A. (2011). Uncovering latent mobility patterns from Twitter during mass events. Journal for Geographic Information Science, 1, 525–534. https://doi.org/10.1553/giscience2015s525

Page 132: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

132

Stephens, M., & Poorthuis, A. (2014). Follow thy neighbor: Connecting the social and the spatial networks on Twitter. Computers, Environment and Urban Systems, 53, 87–95. https://doi.org/10.1016/j.compenvurbsys.2014.07.002

Takhteyev, Y., Gruzd, A., & Wellman, B. (2012). Geography of Twitter networks. Social Networks, 34(1), 73–81. https://doi.org/10.1016/j.socnet.2011.05.006

Tan, L., & Lei, D. (2013). Exact Solutions of a Generalized Weighted Scale Free Network. Journal of Applied Mathematics, 2013, 1–6. https://doi.org/10.1155/2013/902519

Taylor, P. J. (2001). Specification of the World City Network. Geographical Analysis, 33(2), 181–194.

Tsur, O., & Rappoport, A. (2012). What’s in a Hashtag? Content based Prediction of the Spread of Ideas in Microblogging Communities. Proceedings of the Fifth ACM International Conference on Web Search and Data Mining - WSDM ’12, 643. https://doi.org/10.1145/2124295.2124320

Valle, D., Cvetojevic, S., Robertson, E. P., Reichert, B. E., Hochmair, H. H., & Fletcher, R. J. (2017). Individual Movement Strategies Revealed through Novel Clustering of Emergent Movement Patterns. Scientific Reports, 7, 44052. https://doi.org/10.1038/srep44052

Varol, O., Ferrara, E., Davis, C. A., Menczer, F., & Flammini, A. (2017). Online Human-Bot Interactions: Detection, Estimation, and Characterization.

Vidya, N. A., Fanany, M. I., & Budi, I. (2015). Twitter Sentiment to Analyze Net Brand Reputation of Mobile Phone Providers. Procedia Computer Science, 72, 519–526. https://doi.org/10.1016/j.procs.2015.12.159

Wang, X. F., & Chen, G. (2003). Complex networks: Small-world, scale-free and beyond. IEEE Circuits and Systems Magazine, 3(1), 6–20. https://doi.org/10.1109/MCAS.2003.1228503

Watts, D. J. J., & Strogatz, S. H. H. (1998). Collective dynamics of “small-world” networks. Nature, 393(6684), 440–442. https://doi.org/10.1038/30918

Weng, L., Menczer, F., & Ahn, Y.-Y. (2013). Virality Prediction and Community Structure in Social Networks. Scientific Reports, 3(2522). https://doi.org/10.1038/srep02522

White, E. P., Enquist, B. J., & Green, J. L. (2008). On estimating the exponent of power law frequency distributions. Ecology, 89(4), 905–912.

Yang, J., & Counts, S. (2010). Predicting the Speed , Scale , and Range of Information Diffusion in Twitter. Fourth International AAAI Conference on Weblogs and Social Media, 355–358. https://doi.org/10.1016/j.adhoc.2011.06.003

Page 133: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

133

Zahra, K., Ostermann, F. O., & Purves, R. S. (2017). Geographic variability of Twitter usage characteristics during disaster events. Geo-Spatial Information Science, 20(3), 231–240. https://doi.org/10.1080/10095020.2017.1371903

Zook, M. A., & Brunn, S. D. (2005). Hierarchies, Regions and Legacies: European Cities and Global Commercial Passenger Air Travel. Journal of Contemporary European Studies , 13(2), 203–220.

Page 134: © 2018 Sreten Cvetojević - University of Floridaufdcimages.uflib.ufl.edu › UF › E0 › 05 › 18 › 22 › 00001 › ... · is often referred to as Volunteered Geographic Information

134

BIOGRAPHICAL SKETCH

Sreten Cvetojević was born in Ljubovija, Serbia. In 2011, he graduated with an

Engineer’s Diploma (Dipl. Ing. – equivalent to U.S. Bachelor of Science and Master of

Science degree) in Telecommunications networks and traffic engineering from the

University of Belgrade, Serbia. He was accepted as a Ph.D. student and a research

assistant at the University of Florida in 2013 and graduated with a Ph.D. in Forest

resources and conservation with a concentration in geomatics in 2018.