Exploiting Human-Generated Text for Trend Mining

Embed Size (px)

Citation preview

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    1/47

    Exploiting Human-Generated Text forTrend Mining

    Vasileios Lampos

    July, 2013

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 1/47

    1/47

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    2/47

    Outline

    Motivation, Aims [Facts, Questions] Data

    Nowcasting Events

    Extracting Mood Patterns

    Inferring Voting Intention

    |= Conclusions

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 2/47

    2/47

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    3/47

    Facts

    We started to work on those ideas back in 2008, when...

    Web contained 1 trillion unique pages (Google)

    Social Networks were rising, e.g.

    Facebook: 100m (2008)

    >1.11 billion active users (March, 2013)

    Twitter: 6m (2008) 554m active users (July, 2013)

    New technologies to handle Big Data (e.g., Map-Reduce)

    User behaviour was changing Socialising via the Web Giving up privacy (Debatin et al., 2009)

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 3/47

    3/47

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    4/47

    General questions/aims

    Does human-generated text posted on web platforms (or

    elsewhere) contain useful information?

    How can we extract this information...... automatically? Therefore, not we, but a machine.

    Practical / real-life applications?

    Can those large samples of human input assist studies in otherscientific fields?Social Sciences, Psychology, Epidemiology

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 4/474/47

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    5/47

    The Data (1/3) Why Twitter?

    Twitter...

    has a lot of content that is publicly accessible provides a well-documented API for several forms of data collection

    contains opinions and personal statements on various domains is connected with current affairs (usually in real-time) includes geo-located content

    offers the option for personalised, per-user modelling

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 5/475/47

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    6/47

    The Data (2/3)What does a @tweet look like?

    Figure 1: Some biased and anonymised examples of tweets (limit of 140characters/tweet, # denotes a topic)

    (a) (user will remain anonymous) (b) they live around us

    (c) citizen journalism (d) flu attitude

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 6/476/47

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    7/47

    The Data (3/3)

    Data Collection & Preprocessing

    The easiest part of the process... not true! Storage space, crawler implementation, parallel dataprocessing, adapt to new technologies

    Data collected via Twitters Search API:

    collective sampling tweets geo-located in 54 urban centres in the UK periodical crawling (every 3 or 5 minutes per urban centre)

    Data collected via Twitters REST API:

    user-centric sampling

    preprocessing to approximate users location (city & country) ... or manual user selection from domain experts get their latest tweets (3,000 or more)

    Several forms of ground truth (flu/rainfall rates, polls)

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 7/477/47

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    8/47

    Nowcasting Events from the

    Social Web

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 8/478/47

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    9/47

    Nowcasting?

    We do not predict the future, but infer the present i.e. the very recent past

    Figure 2: Nowcasting the magnitude of an event () emerging in the real worldfrom Web information

    Our case studies: nowcasting (a) flu rates & (b) rainfall rates (?!)

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 9/479/47

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    10/47

    What do we get in the end?

    This is a regression problemi.e. time interval i we aim to infer yi R using text input xxxi Rn

    0 5 10 15 20 25 300

    2

    4

    6

    8

    10

    12

    14

    16

    Days

    R

    ainfallr

    ate(m

    m)

    Bristol

    Actual

    Inferred

    Figure 3: Inferred rainfall rates for Bristol, UK (October, 2009)

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 10/4710/47

    M h d l (1/5) T i V S

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    11/47

    Methodology (1/5) Text in Vector Space

    Candidate features (n-grams): C = {ci}Set of Twitter posts for a time interval u:

    P(u) =

    {pj

    }Frequency of ci in pj:

    g(ci, pj) =

    if ci pj,0 otherwise.

    g Boolean, maximum value for is 1

    Score of ci in P(u):

    s

    ci, P(u)

    =

    |P(u)

    |j=1

    g(ci, pj)

    |P(u)|

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 11/4711/47

    M h d l (2/5)

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    12/47

    Methodology (2/5)

    Set of time intervals: U= {uk} 1 hour, 1 day, ...

    Time series of candidate features scores:

    X(U) =

    xxx(u1) ... xxx(u|U|)T

    ,

    where

    xxx(ui) =

    s

    c1, P(ui)

    ... s

    c|C|, P(ui)T

    Target variable (event):

    yyy(U) =

    y1 ... y|U|T

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 12/4712/47

    M th d l (3/5) F t l ti

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    13/47

    Methodology (3/5) Feature selection

    Solve the following optimisation problem:

    minw

    X(U)www

    yyy(U)

    22

    s.t. www1 t,t = wwwOLS1 , (0, 1].

    Least Absolute Shrinkage and Selection Operator (LASSO)argmin

    wwwX(U)www yyy(U)22 + www1

    (Tibshirani, 1996)

    Expect a sparse www (feature selection)

    Least Angle Regression (LARS) computes entire regularisationpath (wwws for different values of ) (Efron et al., 2004)

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 13/4713/47

    M th d l (4/5)

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    14/47

    Methodology (4/5)

    LASSO is model-inconsistent:

    inferred sparsity pattern may deviate from the true model, e.g.,when predictors are highly correlated (Zhao and Yu, 2006)

    bootstrap [?] LASSO (Bolasso) performs a more robust featureselection (Bach, 2008)?:

    in each bootstrap, input space is sampled with replacement apply LASSO (LARS) to select features select features with nonzero weights in all bootstraps

    better alternative soft-Bolasso: a less strict feature selection select features with nonzero weights in p% of bootstraps (learn p using a separate validation set)

    weights of selected features determined via OLS regression

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 14/4714/47

    Methodolog (5/5) Sim lified s mma

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    15/47

    Methodology (5/5) Simplified summary

    Observations: X

    R

    mn (m time intervals, n features)

    Response variable: yyy Rm

    For i = 1 to number of bootstrapsForm Xi

    X by sampling X with replacement

    Solve LASSO for Xi and yyy, i.e. learn wwwi RnGet the k n features with nonzero weights

    End_For

    Select the v n features with nonzero weight in p% of the bootstrapsLearn their weights with OLS regression on X(v) Rmv and yyy

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 15/4715/47

    How do we form candidate features?

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    16/47

    How do we form candidate features?

    Commonly formed by indexing the entire corpus(Manning, Raghavan and Schtze, 2008)

    We extract them from Wikipedia, Google Search results, PublicAuthority websites (e.g., NHS)

    Why? reduce dimensionality to bound the error of LASSO

    L(www) L(www) + Q, with Q min

    W21N

    +p

    N,

    W21N

    +W1

    N

    p candidate features, N samples, empirical loss

    L(www) and

    www1 W1 (Bartlett, Mendelson and Neeman, 2011)

    Harry Potter Effect!

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 16/4716/47

    The Harry Potter effect (1/2)

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    17/47

    The Harry Potter effect (1/2)

    Figure 4: Events co-occurring (correlated) with the inference target may affectfeature selection, especially when the sample size is small.

    180 200 220 240 260 280 300 320 3400

    50

    100

    150

    200

    250

    300

    Day Number (2009)

    Event

    Score

    Flu (England & Wales)

    Hypothetical Event I

    Hypothetical Event II

    (Lampos, 2012a)

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 17/4717/47

    The Harry Potter effect (2/2)

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    18/47

    The Harry Potter effect (2/2)Table 1: Top 1-grams correlated with flu rates in England/Wales (0612/2009)

    1-gram Event Corr. Coef.latitud Latitude Festival 0.9367

    flu Flu epidemic 0.9344swine 0.9212harri Harry Potter Movie 0.9112

    slytherin 0.9094potter 0.8972

    benicassim Benicssim Festival 0.8966graduat Graduation (?) 0.8965

    dumbledor Harry Potter Movie 0.8870hogwart 0.8852quarantin Flu epidemic 0.8822gryffindor Harry Potter Movie 0.8813ravenclaw 0.8738

    princ 0.8635swineflu Flu epidemic 0.8633

    ginni Harry Potter Movie 0.8620weaslei 0.8581

    hermion 0.8540draco 0.8533

    Solution: ground truth with some degree of variability

    (Lampos, 2012a)

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 18/4718/47

    About n grams

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    19/47

    About n-grams

    1-grams decent (dense) representation in the Twitter corpus unclear semantic interpretation

    Example: I am not sick. But I dont feel great either!

    2-grams

    very sparse representation in tweets

    sometimes clearer semantic interpretation

    Experimental process indicated that...

    a hybrid combination of 1-grams and 2-grams

    delivers the best inference performance

    refer to (Lampos, 2012a)

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 19/4719/47

    Flu rates Example of selected features

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    20/47

    Flu rates Example of selected features

    Figure 5: Font size is proportional to the weight of each feature; flipped n-gramsare negatively weighted. All words are stemmed (Porter, 1980).

    (Lampos and Cristianini, 2012)

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 20/4720/47

    Rainfall rates Example of selected features

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    21/47

    Rainfall rates Example of selected features

    Figure 6: Font size is proportional to the weight of each feature; flipped n-gramsare negatively weighted. All words are stemmed (Porter, 1980).

    (Lampos and Cristianini, 2012)

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 21/4721/47

    Examples of inferences

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    22/47

    Examples of inferences

    0 5 10 15 20 25 300

    20

    40

    60

    80

    100

    120

    Days

    FluRate

    C.En

    gland&Wales

    Actual

    Inferred

    (a) Central England/Wales (flu)

    0 5 10 15 20 25 300

    20

    40

    60

    80

    100

    120

    Days

    FluRate

    S.England

    Actual

    Inferred

    (b) South England (flu)

    0 5 10 15 20 25 300

    2

    4

    6

    8

    10

    12

    14

    16

    Days

    R

    ainfallrate(mm)

    Bristol

    Actual

    Inferred

    (c) Bristol (rain)

    Figure 7: Examples of flu and rainfall rates inferences from Twitter content

    (Lampos and Cristianini, 2012)

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 22/4722/47

    Performance figures

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    23/47

    Performance figures

    Table 2: RMSE for flu rates inference (5-fold cross validation), 50m tweets,21/06/200919/04/2010

    Method 1-grams 2-grams Hybrid

    Baseline 12.442.37 13.813.29 11.621.58Bolasso 11.142.35 12.642.57 10.572.2CART ensemble 9.635.21 13.134.72 9.44.21

    Table 3: RMSE (in mm) for rainfall rates inference (6-fold cross validation), 8.5mtweets, 01/07/200930/06/2010

    Method 1-grams 2-grams Hybrid

    Baseline 2.910.6 3.10.57 4.392.99

    Bolasso 2.730.65 2.950.55 2.600.68CART ensemble 2.710.69 2.720.72 2.640.63

    As implemented in (Ginsberg et al., 2009) Classification and Regression Tree (Breiman et al., 1984) & (Sutton, 2005)

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 23/4723/47

    Flu Detector

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    24/47

    Flu Detector

    URL: http://geopatterns.enm.bris.ac.uk/epidemics

    Figure 8: Flu Detector uses the content of Twitter to nowcast flu rates in severalUK regions

    (Lampos, De Bie and Cristianini, 2010)

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 24/4724/47

    http://geopatterns.enm.bris.ac.uk/epidemicshttp://geopatterns.enm.bris.ac.uk/epidemics
  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    25/47

    Extracting Mood Patterns from

    Human-Generated Content

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 25/4725/47

    Computing a mood score

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    26/47

    p gTable 4: Mood terms from WordNet Affect

    Fear Sadness Joy Anger

    afraid depressed admire angryfearful discouraged cheerful despise

    frighten disheartened enjoy enviouslyhorrible dysphoria enthousiastic harassed

    panic gloomy exciting irritate... ... ... ...

    (92 terms) (115 terms) (224 terms) (146 terms)

    Mood score computation for a time interval d using n mood terms

    msd =1

    n

    ni=1

    c(td)iN(td)

    c(td)i : count of term i in the Twitter corpus of day d

    N(td): number of tweets for day dUsing the sample of d days, compute a standardised mood score:

    msstdd =msd ms

    ms

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 26/4726/47

    The mood of the nation (1/5)

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    27/47

    ( / )

    Figure 9: Daily time series (actual & their 14-point moving average) for the moodof Joy based on Twitter content geo-located in the UK

    Jul 09 Jan 10 Jul 10 Jan 11 Jul 11 Jan 122

    0

    2

    4

    6

    8

    10

    933 Day Time Series for Joy in Twitter Content

    Date

    NormalisedEmotionalValence

    * RIOTS

    * CUTS

    * XMAS* XMAS

    * XMAS

    * roy.wed.

    * halloween

    * halloween

    * halloween

    * valentine* valentine

    * easter

    * easter

    raw joy signal

    14day smoothed joy

    (Lansdall-Welfare, Lampos and Cristianini, 2012a&b)

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 27/4727/47

    The mood of the nation (2/5)

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    28/47

    ( / )

    Figure 10: Daily time series (actual & their 14-point moving average) for themood of Anger based on Twitter content geo-located in the UK

    Jul 09 Jan 10 Jul 10 Jan 11 Jul 11 Jan 124

    3

    2

    1

    0

    1

    2

    3

    4

    5933 Day Time Series for Anger in Twitter Content

    Date

    NormalisedEmotionalValence

    * RIOTS

    * CUTS

    * XMAS

    * XMAS

    * XMAS

    * roy.wed.

    * halloween

    * halloween

    * halloween

    * valentine

    * valentine* easter

    * easter

    raw anger signal

    14day smoothed anger

    (Lansdall-Welfare, Lampos and Cristianini, 2012a&b)

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 28/4728/47

    The mood of the nation (3/5)

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    29/47

    ( / )

    Window of 100 days: 50 before & after the point of interest

    msstdi = msstdi+1i+50 ms

    stdi50i1

    Jul 09 Jan 10 Jul 10 Jan 11 Jul 11 Jan 121

    0.5

    0

    0.5

    1

    1.5

    Date

    Differencein

    mean

    Anger

    Fear

    Date of Budget Cuts

    Date of Riots

    Figure 11: Change point detection using a 100-day moving window

    (Lansdall-Welfare, Lampos and Cristianini, 2012a)

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 29/4729/47

    The mood of the nation (4/5)

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    30/47

    ( / )

    Figure 12: Projections of 4-dimensional mood score signals (joy, sadness, anger andfear) on their top-2 principal components (PCA) Twitter content from 2011

    1.5 1 0.5 0 0.5 10.5

    0.4

    0.3

    0.2

    0.1

    0

    0.1

    0.2

    0.3

    0.4

    SaturdaySunday

    MondayTuesday

    Wednesday

    Thursday

    Friday

    1st Principal Component

    2ndPrincipalComponent

    Days of the Week

    (a) Days of the week (2011)

    8 6 4 2 0 2 4 6 82

    0

    2

    4

    6

    8

    10

    1

    2345

    6789

    10

    1112

    1314

    151617

    18192021

    22

    23

    24252627

    28

    2930

    3132

    333435 363738

    3940

    4142

    4344

    45

    46474849

    5051

    5253545556 57585960 6162

    636465

    66

    676869

    707172

    737475

    76

    7778

    7980818283

    84 858687888990

    91

    92

    93

    949596979899100

    101102103104105

    106107

    108109110111112 113

    114

    115

    116117118

    119

    120121

    122

    123124

    125126

    127128

    129

    130131132

    133 134135

    136137138139

    140141

    142143144

    145146147

    148149150151152153154

    155156157158159160161

    162163164165166167168

    169

    170

    171172173

    174175176177178179180

    181

    182183184185

    186187188189

    190191

    192193194195

    196 197

    198199

    200201202203

    204205

    206207

    208209210

    211212213214215216217

    218219

    220221222

    223224

    225226227228229230

    231

    232233234235236237238239240241

    242243244245

    246247248249250251252

    253 254

    255256257

    258259

    260261

    262263264

    265

    266267

    268269

    270271272

    273

    274275276277278279

    280281

    282

    283284285286

    287

    288289

    290291292293294

    295 296297

    298299300301

    302303304

    305306307308

    309310

    311312313314

    315

    316317

    318319320321322

    323324325326327

    328

    329

    330331

    332333

    334

    335336

    337338339340341342

    343

    344345346347348349

    350351352

    353354

    355356

    357

    358

    359

    360

    361362363

    364

    365

    1st Principal Component

    2ndPrincipalComponent

    Days in 2011

    (b) Days of the year (2011)

    Cluster INew Year (1), Valentines (45), Christmas Eve (358), New Years Eve (365)

    Cluster IIO.B. Ladens death (122), Winehouses death + Breivik (204), UK riots (221)

    (Lampos, 2012a)

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 30/4730/47

    The mood of the nation (5/5)

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    31/47

    ( / )

    URL: http://geopatterns.enm.bris.ac.uk/mood

    Figure 13: Mood of the Nation uses the content of Twitter to nowcast moodrates in several UK regions

    (Lampos, 2012a)

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 31/4731/47

    Circadian mood patterns (1/3)

    http://geopatterns.enm.bris.ac.uk/moodhttp://geopatterns.enm.bris.ac.uk/mood
  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    32/47

    )

    Compute 24-h mood score patterns

    Mood score computation for a time interval u = 24hours using nmood terms (WordNet) and a sample of D days:

    Ms(u) = 1|D||D|

    j=1

    1n

    ni=1

    sf(tj,u)i

    sf(td,u)

    i =f

    (td,u)i fi

    fi

    , i {1,..., n}.

    f(td,u)

    i: normalised frequency of a mood term i during time interval u in day dD

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 32/4732/47

    Circadian mood patterns (2/3)

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    33/47

    Figure 14: Circadian (24-hour) mood patterns based on UK Twitter content

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 33/4733/47

    Circadian mood patterns (3/3)

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    34/47

    Figure 15: Autocorrelation of circadian mood patterns based on hourly lagsrevealing daily and weekly periodicities

    1 12 24 36 48 60 72 84 96 108 120 132 144 156 168

    0

    0.2

    0.4

    Autocorr. Lags (Hours)

    Autocorr.

    (Fea

    r)

    Autocorr.

    Conf. Bound

    (a) Fear

    1 12 24 36 48 60 72 84 96 108 120 132 144 156 168

    0

    0.1

    0.2

    0.3

    0.4

    Autocorr. Lags (Hours)

    Autocorr.

    (Sadness)

    Autocorr.

    Conf. Bound

    (b) Sadness

    1 12 24 36 48 60 72 84 96 108 120 132 144 156 168

    0.2

    0

    0.2

    0.4

    Autocorr. Lags (Hours)

    Autocorr.

    (Joy)

    Autocorr.

    Conf. Bound

    (c) Joy

    1 12 24 36 48 60 72 84 96 108 120 132 144 156 168

    0

    0.1

    0.2

    0.3

    Autocorr. Lags (Hours)

    Autocorr.

    (Anger)

    Autocorr.

    Conf. Bound

    (d) Anger

    Further analysis available in (Lampos, Lansdall-Welfare, Araya and Cristianini, 2013)

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 34/4734/47

    Emotion in Books

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    35/47

    Input: Google Ngram corpus of 5m digitised books (Michel et al., 2010)Tool: WordNet Affect (Strapparava and Valitutti, 2004)

    qqq

    qqq

    qq

    qqq

    q

    qqqqqq

    qqq

    q

    q

    q

    q

    qq

    qqq

    qqq

    q

    q

    q

    q

    q

    q

    q

    q

    q

    qq

    q

    q

    q

    q

    qqq

    qq

    q

    qq

    qqq

    q

    q

    qqqqqqqqqqqqq

    q

    qqq

    qqq

    qq

    qqq

    qqqqq

    qqq

    qqqq

    qqq

    1900 1920 1940 1960 1980 2000

    1.0

    0.5

    0.0

    0.5

    1.0

    Year

    Joy

    Sadness(zscores)

    (a) Joy minus Sadness

    q

    qqqq

    q

    qqqqqqqqqqqq

    q

    qqqq

    qqqqqqqqqqqq

    qqqqqqqqqqq

    qq

    qqqqqqq

    qqqqqqqqq

    qqqqqqqqqqqqqqqqqqqqqqqqqqq

    qqqqqqqqqq

    1900 1920 1940 1960 1980 2000

    4

    2

    0

    2

    4

    Year

    Emotion

    Random(

    zscores)

    qqqqq

    q

    qqqqqqqqqqqqq

    q

    qqq

    qqqqqqqqqqqqqqqqqqqqqqq

    qqqqqqqqqqqqqqqqqqqqqqqqq

    qqqqqqqqqqqqqqqqqqqqqqqqq

    qqqqq

    q

    qqq

    q

    q

    qqqqqq

    qqqqq

    q

    q

    q

    qqq

    q

    q

    qqqq

    q

    qq

    qqqqqq

    qqqq

    qqqq

    q

    q

    qq

    qqqqqq

    qqqqqqqq

    qqqqqqq

    qqqqqqqqqqqqqqqqq

    qqqqqqqqqqq

    qq

    AllFear

    Disgust

    (b) Use ofemotion-related terms

    through time

    q

    q

    qq

    q

    q

    qq

    qq

    qq

    qqqq

    qq

    q

    qq

    q

    qq

    qq

    q

    q

    q

    q

    qq

    q

    qq

    q

    qq

    qq

    qqq

    qq

    q

    q

    q

    q

    qqq

    q

    qqqqqq

    q

    q

    qqqqqqqq

    qqqqq

    q

    qq

    qq

    q

    q

    qqqqqqqqq

    qq

    qqqq

    qq

    q

    qq

    1900 1920 1940 1960 1980 2000

    4

    2

    0

    2

    4

    Year

    AmericanBritish(zscores)

    q

    q

    qq

    q

    q

    q

    q

    qq

    q

    qq

    qqqqqqqqqqq

    qqqqqqqq

    qqqqqqqqqq

    qqqq

    qqqqqqqqqqq

    qq

    qqqqqqq

    qqqqqqq

    qqqqqqqqqqq

    qqqqqqqqqqqqqqq

    qqqqq

    qqqqqqqqqq

    q

    q

    qqq

    qq

    qq

    q

    qqq

    (c) American versusBritish English

    Figure 16: Emotion trends in 20th century books

    (Acerbi, Lampos, Garnett and Bentley, 2013)

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 35/4735/47

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    36/47

    Inferring Voting Intention fromSocial Media Content

    ... and a new way for modelling text regression

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 36/4736/47

    Motivations and Aims

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    37/47

    Social Media contain a vast amount of information aboutvarious topics (health, politics, finance)

    This information (X) can be used to assist predictions (y)

    f : X

    y, f usually formulates a linear regression task

    X accounts only for word frequencies; can we incorporate userinformation as well?

    Could we also exploit the statistical information held in multipleresponse variables?

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 37/4737/47

    Data Sets

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    38/47

    UK case study 60m tweets by 42K users from 30/04/2010 to 13/02/2012 Random selection and distribution of geo-located users proportional to regional

    population figures Main language: English 240 unique voting intention polls from YouGov

    percentages for Conservatives (CON), Labour Party (LAB) and Liberal Democrats(LIB)

    Austrian case study 800K tweets by 1.1K users from 25/01 to 01/12/2012 Users manually selected by Austrian political analysts

    Main language: German

    98 unique voting intention polls from various pollsters percentages for Social Democratic Party (SP), Peoples Party (VP), Freedom

    Party (FP) and Green Alternative Party (GR)

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 38/4738/47

    The Bilinear Model (1/2)Th i id i i l

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    39/47

    The main idea is simple:

    f(X) = uuuTXwww +

    X Rmp: matrix of user-word frequenciesuuu, www: user and word weights

    Our original bilinear text regression model:

    {www, uuu, } = argminwww,uuu,

    ni=1

    uuuTQiwww + yi2

    + (www, 1) + (uuu, 2)

    Qi: X for time instance i, yyy Rn: response variable (voting intention)

    www R

    m

    , uuuR

    p

    : word and user weights, R

    : bias(): a regularisation function

    Elastic Net (Zhou and Hastie, 2005) for () Bilinear Elastic Net (BEN) (Lampos, Preoiuc-Pietro and Cohn, 2013)

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 39/4739/47

    The Bilinear Model Multi-Task Learning (2/2)

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    40/47

    Apply 1/2 regulariser (Argyriou, Evgeniou and Pontil, 2008)Extends the notion of Group LASSO (Yuan and Lin, 2006) for a-dimensional yyy

    Bilinear Group 1/2 (BGL)

    {W, U, } = argminW,U,

    t=1n

    i=1 uuuTt Qiwwwt + t yti

    2

    + 1

    mj=1

    Wj2 + 2p

    k=1

    Uk2,

    W = [www1 ... www]: words weight matrix wwwt refers to t-th political entityU

    = [uuu

    1 ...uuu]: users weight matrixWj, Uj: j-th rows of weight matrices W and U respectively

    R: bias term per task

    (Lampos, Preoiuc-Pietro and Cohn, 2013)

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 40/4740/47

    Evaluation Performance Tables (1/2)

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    41/47

    Table 5: UK case study Average RMSEs representing the error of the inferredvoting intention percentage for a 10-step validation process

    CON LAB LIB B 2.272 1.663 1.136 1.69Blast 2 2.074 1.095 1.723LEN 3.845 2.912 2.445 3.067BEN 1.939 1.644 1.136 1.573BGL 1.7851.7851.785 1.5951.5951.595 1.0541.0541.054 1.4781.4781.478

    Table 6: Austrian case study

    SP VP FP GR

    B 1.535 1.373 3.3 1.197 1.851Blast 1.1481.1481.148 1.556 1.6391.6391.639 1.536 1.47LEN 1.291 1.286 2.039 1.1521.1521.152 1.442BEN 1.392 1.31 2.89 1.205 1.699BGL 1.619 1.0051.0051.005 1.757 1.374 1.4391.4391.439

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 41/4741/47

    Evaluation (2/3)

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    42/47

    5 10 15 20 25 30 35 40 450

    5

    10

    15

    20

    25

    30

    35

    40

    Voting

    Intention%

    Time

    CON

    LAB

    LIB

    (a) Polls

    5 10 15 20 25 30 35 40 450

    5

    10

    15

    20

    25

    30

    35

    40

    Voting

    Intention%

    Time

    CON

    LAB

    LIB

    (b) BEN

    5 10 15 20 25 30 35 40 45

    0

    5

    10

    15

    20

    25

    30

    35

    40

    VotingIntention%

    Time

    CON

    LAB

    LIB

    (c) BGL

    Figure 17: UK case study 50 consecutive poll predictions

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 42/4742/47

    Evaluation (3/3)

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    43/47

    5 10 15 20 25 30 35 40 450

    5

    10

    15

    20

    25

    30

    Voting

    Intention%

    Time

    SP

    VP

    FP

    GR

    (a) Polls

    5 10 15 20 25 30 35 40 450

    5

    10

    15

    20

    25

    30

    Voting

    Intention%

    Time

    SP

    VP

    FP

    GR

    (b) BEN

    5 10 15 20 25 30 35 40 450

    5

    10

    15

    20

    25

    30

    VotingIntention%

    Time

    SP

    VP

    FPGR

    (c) BGL

    Figure 18: Austrian case study 50 consecutive poll predictions

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 43/4743/47

    Conclusions

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    44/47

    Social Media hold valuable information

    We can develop methods to extract portions of this informationautomatically detect, quantify, nowcast events (examples of flu and rainfall rates)

    extract collective mood patterns (we can do this for books too!)

    model other domains (such as politics)

    Different types of information (word frequencies, user accounts)can be fused for improved inference performance

    Side effect: user privacy

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 44/4744/47

    Significant collaborators...

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    45/47

    Prof. Nello Cristianini, University of Bristol (Artificial Intelligence)

    Prof. Alexander Bentley, University of Bristol (Anthropology)

    Dr. Trevor Cohn, University of Sheffield (Natural Language

    Processing)Dr. Alberto Acerbi, University of Bristol (Anthropology)

    Daniel Preoiuc-Pietro, University of Sheffield (Computer Science)

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 45/4745/47

    Last Slide!

  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    46/47

    The end.Any questions?

    Download the slides fromhttp://www.lampos.net/research/presentations-and-posters

    V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 46/4746/47

    References Acerbi, Lampos, Garnett and Bentley. The Expression of Emotions in 20th Century Books. PLoS ONE, 2013.

    http://www.lampos.net/research/presentations-and-postershttp://www.lampos.net/research/presentations-and-posters
  • 7/28/2019 Exploiting Human-Generated Text for Trend Mining

    47/47

    Argyriou, Evgeniou and Pontil. Convex multi-task feature learning. Machine Learning, 2008.

    Bach. Bolasso: Model Consistent Lasso Estimation through the Bootstrap. ICML, 2008.

    Bartlett, Mendelson and Neeman. L1-regularized linear regression: persistence and oracle inequalities. PTRF,2011.

    Debatin, Lovejoy, Horn and Hughes. Facebook and Online Privacy: Attitudes, Behaviors, and UnintendedConsequences. JCMC, 2009.

    Efron et al.. Least Angle Regression. The Annals of Statistics, 2004.

    Ginsberg et al. Detecting influenza epidemics using search engine query data. Nature, 2009.

    Lampos and Cristianini. Tracking the flu pandemic by monitoring the Social Web. CIP, 2010.

    Lampos, De Bie and Cristianini. Flu Detector Tracking Epidemics on Twitter. ECML PKDD, 2010.

    Lampos and Cristianini. Nowcasting Events from the Social Web with Statistical Learning. ACM TIST, 2012.

    Lampos. Detecting Events and Patterns in Large-Scale User Generated Textual Streams with Statistical

    Learning Methods. Ph.D. Thesis, University of Bristol, 2012.(a) Lampos. On voting intentions inference from Twitter content: a case study on UK 2010 General Election .

    CoRR, 2012.(b)

    Lampos, Preoiuc-Pietro and Cohn. A user-centric model of voting intention from Social Media. ACL, 2013.

    Lampos, Lansdall-Welfare, Araya and Cristianini. Analysing Mood Patterns in the United Kingdom throughTwitter Content. CoRR, 2013.

    Lansdall-Welfare, Lampos and Cristianini. Effects of the Recession on Public Mood in the UK. WWW, 2012.(a)

    Lansdall-Welfare, Lampos and Cristianini. Nowcasting the mood of the nation. Significance, 2012.(b)

    Manning, Raghavan and Schtze. Introduction to Information Retrieval, 2008.

    Michel et al. Quantitative Analysis of Culture Using Millions of Digitized Books. Nature, 2010.

    Porter. An algorithm for suffix stripping. Program, 1980.

    Strapparava and Valitutti. WordNet-Affect: an affective extension of WordNet. LREC, 2004.

    Tibshirani. Regression Shrinkage and Selection via the LASSO. JRSS, 1996.

    Yuan and Lin. Model selection and estimation in regression with grouped variables. JRSS, 2006.

    Zhao and Yu. On model selection consistency of LASSO. JMLR, 2006.

    Zhou and Hastie. Regularization and variable selection via the elastic net. JRSS, 2005.V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 47/4747/47