24
Analysing Word Meaning over Time by Exploiting Temporal Random Indexing Pierpaolo Basile, Annalina Caputo and Giovanni Semeraro Department of Computer Science – University of Bari Aldo Moro [email protected] Prima conferenza italiana di Linguistica Computazionale Pisa 9-10 Dicembre 2014

Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

Embed Size (px)

Citation preview

Page 1: Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

Pierpaolo Basile, Annalina Caputo and Giovanni Semeraro

Department of Computer Science – University of Bari Aldo Moro

[email protected]

Prima conferenza italiana di Linguistica Computazionale Pisa 9-10 Dicembre 2014

Page 2: Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

Marty, in 2015 people will surf on

the web!!!

Surf!?!?! On the web!?!?!?

Page 3: Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

Distributional Semantic Models (DSM)

• Analysis of word-usage statistics over huge corpora

• Geometrical space of concepts

• Similar words are represented close in the space

Page 4: Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

DSM issue…

Corpus DSM Word Space

Word Space is a snapshot of word co-occurrences over a linguistic corpus

Page 5: Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

…DSM issue…

Corpus1

Word Space1

Corpus2

Word Space2

Corpus3

Word Space3

Corpus4

Word Space4

Difficult to compare Word Spaces built on different corpora

Page 6: Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

Temporal DSM

Corpus1900

Word Space1

Corpus1920

Word Space2

Corpus1930

Word Space3

Corpus1940

Word Space4

Each corpus contains documents of a specific time period… …Word Spaces are still not comparable

Page 7: Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

Temporal Random Indexing (TRI)

Corpus1900

RI Space1

Corpus1920

RI Space2

Corpus1930

RI Space3

Corpus1940

RI Space4

Words (vectors) in different Word Spaces are comparable… …comparison on different time periods is possible!

Page 8: Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

Random Indexing

Corpus Vocabulary

Assign a Random Vector ci to each term ci <1, 0, 0, 0, -1, 0, 1, 0, -1, 0, 0, 0, 0, 0, 0>

RI Space

A word vector svi is the sum of random vectors assigned to the co-occurring words

𝑠𝑣𝑖 = 𝑐𝑖−𝑚<𝑖<+𝑚𝑑∈𝐶

Co-occurring words are defined as the set of m words that precede and follow wi

Page 9: Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

Temporal Random Indexing (TRI)…

Corpus T1-T2

Vocabulary Assign a Random Vector ci to each term

RI Space T1-T2

Corpus T2-T3

Corpus T3-T4

Corpus Tn-1-Tn

RI Space T2-T3

RI Space T3-T4

RI Space Tn-1-Tn

Page 10: Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

…Temporal Random Indexing

• For each Word Space Tk-Tk+1

– sum random vectors taking into account only documents dk in the period Tk-Tk+1

• A word wi has several semantic vectors (sv): one for each time period

• Vectors are comparable

𝑠𝑣𝑖,𝑇𝑘 = 𝑐𝑖−𝑚<𝑖<+𝑚𝑑𝑘∈𝐶

Page 11: Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

TRI System

• TRI system1 performs Temporal Random Indexing

1. Build a RI space for each year

2. Merge RI spaces to create new time periods

3. Load RI space and fetch vectors

4. Combine vectors

5. Retrieve similar vectors

6. Extract and compare neighbourhood of words

1https://github.com/pippokill/tri

Page 12: Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

Evaluation

• Two case studies 1. Project Gutenberg (PG): 349 Italian books from

1810 to 1922

2. ACL Anthology Network dataset (ANN): 21,212 papers published by ACL from 1960 to 2014

• Goals – Neighborhood: analyze the neighborhood of a

word

– Semantic shift: analyze words that clearly change their semantics

Page 13: Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

PG Dataset

• Dataset split in two time periods

– Pre 1900

– Post 1900

• Analyze the neighbourhood: “patria”

• Semantic shift: “cinematografo”

Page 14: Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

PG: neighbourhood of “patria”

Pre 1900 Post 1900

Libertà Libertà

Opera Gloria

Pari Giustizia

Comune Comune

Gloria Legge

Nostra Pari

Causa Virtù

Italia Onore

Giustizia Opera

Guerra Popolo

Page 15: Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

PG: semantic shift of “cinematografo”

Tpost 1900: “cinematografo” strongly related to “sonoro”

𝑠𝑖𝑚(𝑠𝑣𝑐𝑖𝑛𝑒𝑚𝑎𝑡𝑜𝑔𝑟𝑎𝑓𝑜,𝑇𝑝𝑟𝑒1900 ,𝑠𝑣𝑐𝑖𝑛𝑒𝑚𝑎𝑡𝑜𝑔𝑟𝑎𝑓𝑜,𝑇𝑝𝑜𝑠𝑡1900)=0.4

Page 16: Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

ANN Dataset

• Dataset split in decades

• Analyze the neighbourhood: “semantics”

• Semantic shift: “bioscience”, “unsupervised”

Page 17: Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

ANN: neighbourhood of “semantics”

1960-1969 1970-1979 1980-1989 1990-1999 2000-2009 2010-2014

linguistic natural syntax syntax syntax syntax

theory linguistic natural theory theory theory

semantic semantic general interpretation interpretation interpretation

syntactic theory theory general description description

natural syntax semantic linguistic meaning complex

linguistic language syntactic description linguistic meaning

distributional processing linguistic complex logical linguistic

process syntactic interpretation natural complex logical

computational description model representation representation structures

syntax analysis description logical structures representation

Page 18: Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

ANN: neighbourhood of “semantics”

1960-1969 1970-1979 1980-1989 1990-1999 2000-2009 2010-2014

linguistic natural syntax syntax syntax syntax

theory linguistic natural theory theory theory

semantic semantic general interpretation interpretation interpretation

syntactic theory theory general description description

natural syntax semantic linguistic meaning complex

linguistic language syntactic description linguistic meaning

distributional processing linguistic complex logical linguistic

process syntactic interpretation natural complex logical

computational description model representation representation structures

syntax analysis description logical structures representation

Page 19: Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

ANN: neighbourhood of “semantics”

1960-1969 1970-1979 1980-1989 1990-1999 2000-2009 2010-2014

linguistic natural syntax syntax syntax syntax

theory linguistic natural theory theory theory

semantic semantic general interpretation interpretation interpretation

syntactic theory theory general description description

natural syntax semantic linguistic meaning complex

linguistic language syntactic description linguistic meaning

distributional processing linguistic complex logical linguistic

process syntactic interpretation natural complex logical

computational description model representation representation structures

syntax analysis description logical structures representation

Page 20: Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

ANN: neighbourhood of “semantics”

1960-1969 1970-1979 1980-1989 1990-1999 2000-2009 2010-2014

linguistic natural syntax syntax syntax syntax

theory linguistic natural theory theory theory

semantic semantic general interpretation interpretation interpretation

syntactic theory theory general description description

natural syntax semantic linguistic meaning complex

linguistic language syntactic description linguistic meaning

distributional processing linguistic complex logical linguistic

process syntactic interpretation natural complex logical

computational description model representation representation structures

syntax analysis description logical structures representation

Page 21: Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

…ANN: semantic shift “bioscience”

bioscience

extraterrestrial, extrasolar

medline, bionlp, biomedi

before 1990 nowadays

Page 22: Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

…ANN: semantic shift “unsupervised”

unsupervised

observe, partition, selective

supervised, disambiguation,

probabilistic, algorithms, statistical

before 1990 nowadays

Page 23: Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

Conclusions

• Temporal Random Indexing

– build Word Spaces taking into account information about time

– developed and published an open-source framework

• Potentiality of our framework

– capture word usage changes over time

Page 24: Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

That’s all folks!

https://github.com/pippokill/tri