27
The Geography of Topics from Geo-referenced Social Media Data in London Guy Lansley Department of Geography, University College London @GuyLansley [email protected] AAG Annual Meeting 2015 Chicago, USA

The geography of topics from geo-referenced social media data in London

Embed Size (px)

Citation preview

The Geography of Topics from

Geo-referenced Social Media Data

in London

Guy LansleyDepartment of Geography,

University College London

@GuyLansley

[email protected]

AAG Annual Meeting 2015

Chicago, USA

Context

• Twitter could pose as a useful source of temporal population data

at a very small area geography

• These data can be used to predict how the population negotiate

travel around cities at an small area aggregate level

• The content of the Tweets poses an interesting area of research

into how peoples activity and their behaviour on social media

may link to time, place and space

• Such insight could be very useful to marketing firms and retailers

• This research aims to implement an unsupervised text modelling

approach to cluster the Tweets into distinctive topics and analyse

how they vary by time and space in Central London

Previous Research

• Geo-located Twitter data and High Street Insight

• Lansley (2014) Evaluating the utility of geo-referenced Twitter data as a

source of reliable footfall insight

• Classifying Tweets using an unsupervised learning algorithm

• Lai, Cheng and Lansley (2015) Spatio-Temporal Patterns of Passengers’

Interests at London Tube Stations

Twitter Hashtags and Space

Representativeness of Twitter

Day NightTwitter

Census

• Previous research has found geo-located Twitter data sourced from the

UK to be over-representative of young, White British adults, and there is

also a higher penetration amongst males

• Twitter has been proven to be a useful indicator of footfall. Although the

proportional spatial distribution of Tweets has been found to differ from

Census statistics (see below)

Data available through the Twitter API

• User Creation Date

• Followers

• Friends

• User ID

• Language

• Location

• Name

• Screen Name

• Time Zone

• Geo Enabled

• Latitude

• Longitude

• Tweet date and time

• Tweet text

Twitter Data

• As text modelling is computationally

intensive and the density of Tweets

can be very low in some places it

was decided to restrict the sample

to Inner London.

• Tweets from 1st January 2013 until

31st December 2013 were

downloaded

• To understand the typical weekday

patterns only Tweets from Tuesday,

Wednesday and Thursday were

used

Filtering

Aim: to reduce the amount of noise in the dataset

• Tweets with fewer than 3 words were removed

• Words with fewer than 3 characters and more than 15 characters

were removed

• URLs were removed

• Tweets from users with over 2000 Tweets were removed from

the sample

• Tweets from false users who had requoted texts repeatedly were

removed from the sample

Number of Tweets

Total weekday Tweets from Greater London in 2013 3,341,959

Total weekday Tweets from Inner London 1,679,571

Coordinates cleaned & false users removed 1,545,899

Strings cleaned 1,301,004

Methods• All of the Twitter text strings were converted into a corpus

• Converted to lower case

• Numbers were removed

• Punctuation was removed

• Stop words were removed

• The corpus was lexicalized in R

• The document was then run through a Latent Dirichlet Allocation

(LDA) model

• LDA is an unsupervised approach to document modelling that

discovers latent semantic topics in large collections of texts

• The number of topic groups (k) is predefined by the user

• 20 groups were made

• 100 subgroups were made from running additional LDA models

on the Tweets from each of the 20 groups individually

Latent Dirichlet Allocation

• Blei et al. (2003) Latent Dirichlet Allocation:

Tweet Text Time Date x y

Tweet 1

Tweet 2

Tweet 3

… … … … … …

• Each Tweet (as an individual text document) is assigned to one

topic group based on the generated probabilities from the LDA

model

What do all the Tweets have in common?

20 Twitter Groups

1 Photography and Sights

2 Optimism, Kindness and Positivity

3 Leisure and Attractions

4 TV and Film

5 Humour and Informal Conversations

6 Transport and Travel

7 Politics, Beliefs and Current Affairs

8 Sport and Games

9 Anticipation and Socialising

10 Business, Information and Networking

11 Pessimism and Negativity

12 Music and Musicians

13 Routine Activities

14 Food and Drink

15 Body, Appearances and Clothes

16 Social Media and Apps

17 Slang and Profanities

18 Place and Check-Ins

19 Wishes and Gratitude

20 Foreign and Other

Photography and Sights

Optimism, Kindness and Positivity

Leisure and Attractions

TV and Film

Humour and Informal Conversations

Transport and Travel

Politics, Beliefs and Current Affairs

Sport and Games

Anticipation and Socialising

Business, Information and Networking

Pessimism and Negativity

Music and Musicians

Routine Activities

Food and Drink

Body, Appearances and Clothes

Social Media and Apps

Slang and Profanities

Place and Check-Ins

Wishes and Gratitude

Foreign and Other

20 Twitter Groups

Word Clouds

3. Leisure and Attractions 14. Food and Drink

Time Distribution0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Photography and SightsOptimism, Kindness and PositivityLeisure and AttractionsTV and FilmHumour and Informal ConversationsTransport and TravelPolitics, Beliefs and Current AffairsSport and GamesAnticipation and SocialisingBusiness, Information and NetworkingPessimism and NegativityMusic and MusiciansRoutine ActivitiesFood and DrinkBody, Appearances and ClothesSocial Media and AppsSlang and ProfanitiesPlace and Check-InsWishes and GratitudeForeign and Other

All Tweets

Spatial Distributions

Leisure and Attractions Transport and Travel Music and Musicians

Standardised Residuals

200x200m grid

Land Use

• To understand how Tweets my correspond with Land Use

• Tweets from different land use categories were intersected with

the Generalised Land Use Database (GLUD)

• The GLUD categorise all of England into polygons of 9

categories

• It was created from recoding OS MasterMap (2005)

Domestic Buildings and Gardens

Non-Domestic Buildings

Public Green Space

• However, it is now 10 years

out of date and there are

some notable errors

-6

-4

-2

0

2

4

6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Pro

po

rtio

nal

dif

fere

nce

of

Twit

ter

Gro

up

s (%

)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Ph

oto

grap

hy

and

Sig

hts

Op

tim

ism

, Kin

dn

ess

and

P

osi

tivi

ty

Leis

ure

an

d A

ttra

ctio

ns

TV a

nd

Film

Hu

mo

ur

and

Info

rmal

C

on

vers

atio

ns

Tran

spo

rt a

nd

Tra

vel

Po

litic

s, B

elie

fs a

nd

C

urr

ent

Aff

airs

Spo

rt a

nd

Gam

es

An

tici

pat

ion

an

d

Soci

alis

ing

Bu

sin

ess,

Info

rmat

ion

an

d N

etw

ork

ing

Pes

sim

ism

and

N

ega

tivi

ty

Mu

sic

and

Mu

sici

ans

Ro

uti

ne

Act

ivit

ies

Foo

d a

nd

Dri

nk

Bo

dy,

Ap

pea

ran

ces

and

C

loth

es

Soci

al M

edia

an

d A

pp

s

Slan

g an

d P

rofa

nit

ies

Pla

ce a

nd

Ch

eck-

Ins

Wis

hes

an

d G

rati

tud

e

Fore

ign

an

d O

ther

Land Use and Tweets Domestic Buildings and Gardens

Non-Domestic Buildings

Public Green Space

Rail

Key Places

• It is also possible to collect Tweets from particular

locations to observe how they compare

• We have selected all of the Tweets from the

immediate vicinity of 6 unique locations in London

• Both the frequency and the content of Tweets

were found to be influenced by the local activity

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Ph

oto

grap

hy

and

Sig

hts

Op

tim

ism

, Kin

dn

ess

and

P

osi

tivi

ty

Leis

ure

an

d A

ttra

ctio

ns

TV a

nd

Film

Hu

mo

ur

and

Info

rmal

C

on

vers

atio

ns

Tran

spo

rt a

nd

Tra

vel

Po

litic

s, B

elie

fs a

nd

C

urr

ent

Aff

airs

Spo

rt a

nd

Gam

es

An

tici

pat

ion

an

d

Soci

alis

ing

Bu

sin

ess,

Info

rmat

ion

an

d N

etw

ork

ing

Pes

sim

ism

and

N

ega

tivi

ty

Mu

sic

and

Mu

sici

ans

Ro

uti

ne

Act

ivit

ies

Foo

d a

nd

Dri

nk

Bo

dy,

Ap

pea

ran

ces

and

C

loth

es

Soci

al M

edia

an

d A

pp

s

Slan

g an

d P

rofa

nit

ies

Pla

ce a

nd

Ch

eck-

Ins

Wis

hes

an

d G

rati

tud

e

Fore

ign

an

d O

ther

Residential

The Emirates Stadium

The O2 Arena

Waterloo Station

Westfield Stratford

Soho

Canary Wharf

Ratio

x 0.5x 1.0x 2.0

Twitter Groups and Key Places

1 Photography and Sights 2Optimism, Kindness and

Positivity3 Leisure and Attractions 4 TV and Film 5

Humour and Informal Conversations

a Landmarks a Anticipation a Fashion and Shopping a Television a Opinions

b Outdoors b Mood b Museums and Galleries b Celebrities b Laughter

c Urban c Achievements c Nightlife c Reality c Chat

d Instagram d Conversations d Shows and Entertainment d Cinema and Film d Affection

e Architecture e Reflections e Events and Socialising e Reactions e Mates

6 Transport and Travel 7Politics, Beliefs and Current

Affairs8 Sport and Games 9 Anticipation and Socialising 10

Business, Information and Networking

a Journeys a Politics a Other Sports a Wishes a Training

b Trains and Delays b Religion b Footballers b The Day before b Conference

c Public Transport c Newspapers c London Teams c Events c Brands

d Roads and Cycling d Political Awareness d International Football d Weekend d Jobs and Careers

e Travel Incidents e Current Affairs e Football Managers e Holidays e Data and Technology

11 Pessimism and Negativity 12 Music and Musicians 13 Routine Activities 14 Food and Drink 15Body, Appearances and

Clothes

a Problems a Pop Stars and Music Videos a Exercise a Food a Cosmetics

b Hate and Anger b Radio and Downloads b Work b Drink b Body and Health

c Sadness and Awkwardness c Concerts c Feelings c Meals c Clothes

d Life and Changes d Albums d Education d Coffee and Cake d Cute

e Worry and Confusion e Sleep e Hunger e Weather

16 Social Media and Apps 17 Slang and Profanities 18 Place and Check-Ins 19 Wishes and Gratitude 20 Foreign and Other

a Social Media Activity a Street Slang a Events a Friends a Portuguese

b Services b Abuse b Routine Places b Via Social Media b French

c Technology and Brands c People c Attractions c People c Spanish

d Communications d Jokes d Markets d Celebrations d Turkish

e Trending e Misuse e Stations e Thanks and Affection e Italian

f Other

100 Subgroups

Labels were inferred from the most overrepresented words

Subgroups

Museums and Galleries

Fashion and Shopping

Events and Socialising

Shows and Entertainments

Nightlife

• Topic 3 – Leisure and Attractions

Topic 3 Subgroups across Central London

Fashion and Shopping Museums and Galleries

Nightlife Shows and Entertainments

Topic 13 Subgroup D – Education

UCL

University of

Westminster

Imperial College

LondonLondon South

Bank University

Kings College

London

Queen Mary

London Metropolitan

University

University of

Greenwich

City University

Goldsmiths

Birkbeck

SOAS

LSE

Various

Various

Underrepresented Overrepresented

25

Clapham Junction

Victoria

Waterloo

London Bridge

Liverpool Street

Fenchurch Street

St Pancras

Kings Cross

Euston

Paddington

Marylebone

Lewisham

Topic 6 Subgroup B – Trains and Delays

Underrepresented Overrepresented

Conclusions

• There are a distinctive geography of Tweets in London

which can be represented by a discrete Tweet content

classification produced from a generative probabilistic

model

• The composition of Tweets varies by time and space within

Central London

• Land use and it’s associated activity correspond with the

content of geo-located Tweets transmitted locally

• It may be possible to link the topics to socio-economic

status via focusing on Tweets recorded from residential

locations

References

• Blei, D., Ng, A., and Jordan, M. (2003) Latent Dirichlet allocation.

Journal of Machine Learning Research, 3:993–1022

• Cheng, T. & Wicks, T. (2014). Event Detection using Twitter: A Spatio-

Temporal Approach. Plos One, 9(6)

• Lai, J., Cheng, T, and Lansley, G. (2015) Spatio-Temporal Patterns of

Passengers’ Interests at London Tube Stations. In the Proceedings of

the 23rd Conference on GIS Research UK. 15th – 17th April, 2015.

University of Leeds, UK

• Lansley, G (2014) Evaluating the utility of geo-referenced Twitter data

as a source of reliable footfall insight. In the Proceedings of the

Association of American Geographers AGM 2014. 8th – 12th April,

2014. Tampa, USA

• Longley, P. Adnan, M. and Lansley, G. (2015) The geo-temporal

demographics of Twitter usage. Environment and Planning A. 47(2) 465

– 484