Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

Pierpaolo Basile, Annalina Caputo and Giovanni Semeraro

Department of Computer Science – University of Bari Aldo Moro

[email protected]

Prima conferenza italiana di Linguistica Computazionale Pisa 9-10 Dicembre 2014

Marty, in 2015 people will surf on

the web!!!

Surf!?!?! On the web!?!?!?

Distributional Semantic Models (DSM)

• Analysis of word-usage statistics over huge corpora

• Geometrical space of concepts

• Similar words are represented close in the space

DSM issue…

Corpus DSM Word Space

Word Space is a snapshot of word co-occurrences over a linguistic corpus

…DSM issue…

Corpus1

Word Space1

Corpus2

Word Space2

Corpus3

Word Space3

Corpus4

Word Space4

Difficult to compare Word Spaces built on different corpora

Temporal DSM

Corpus1900

Word Space1

Corpus1920

Word Space2

Corpus1930

Word Space3

Corpus1940

Word Space4

Each corpus contains documents of a specific time period… …Word Spaces are still not comparable

Temporal Random Indexing (TRI)

Corpus1900

RI Space1

Corpus1920

RI Space2

Corpus1930

RI Space3

Corpus1940

RI Space4

Words (vectors) in different Word Spaces are comparable… …comparison on different time periods is possible!

Random Indexing

Corpus Vocabulary

Assign a Random Vector ci to each term ci <1, 0, 0, 0, -1, 0, 1, 0, -1, 0, 0, 0, 0, 0, 0>

RI Space

A word vector svi is the sum of random vectors assigned to the co-occurring words

𝑠𝑣𝑖 = 𝑐𝑖−𝑚<𝑖<+𝑚𝑑∈𝐶

Co-occurring words are defined as the set of m words that precede and follow wi

Temporal Random Indexing (TRI)…

Corpus T1-T2

Vocabulary Assign a Random Vector ci to each term

RI Space T1-T2

Corpus T2-T3

Corpus T3-T4

Corpus Tn-1-Tn

…

RI Space T2-T3

RI Space T3-T4

RI Space Tn-1-Tn

…

…Temporal Random Indexing

• For each Word Space Tk-Tk+1

– sum random vectors taking into account only documents dk in the period Tk-Tk+1

• A word wi has several semantic vectors (sv): one for each time period

• Vectors are comparable

𝑠𝑣𝑖,𝑇𝑘 = 𝑐𝑖−𝑚<𝑖<+𝑚𝑑𝑘∈𝐶

TRI System

• TRI system1 performs Temporal Random Indexing

1. Build a RI space for each year

2. Merge RI spaces to create new time periods

3. Load RI space and fetch vectors

4. Combine vectors

5. Retrieve similar vectors

6. Extract and compare neighbourhood of words

1https://github.com/pippokill/tri

Evaluation

• Two case studies 1. Project Gutenberg (PG): 349 Italian books from

1810 to 1922

2. ACL Anthology Network dataset (ANN): 21,212 papers published by ACL from 1960 to 2014

• Goals – Neighborhood: analyze the neighborhood of a

word

– Semantic shift: analyze words that clearly change their semantics

PG Dataset

• Dataset split in two time periods

– Pre 1900

– Post 1900

• Analyze the neighbourhood: “patria”

• Semantic shift: “cinematografo”

PG: neighbourhood of “patria”

Pre 1900 Post 1900

Libertà Libertà

Opera Gloria

Pari Giustizia

Comune Comune

Gloria Legge

Nostra Pari

Causa Virtù

Italia Onore

Giustizia Opera

Guerra Popolo

PG: semantic shift of “cinematografo”

Tpost 1900: “cinematografo” strongly related to “sonoro”

𝑠𝑖𝑚(𝑠𝑣𝑐𝑖𝑛𝑒𝑚𝑎𝑡𝑜𝑔𝑟𝑎𝑓𝑜,𝑇𝑝𝑟𝑒1900 ,𝑠𝑣𝑐𝑖𝑛𝑒𝑚𝑎𝑡𝑜𝑔𝑟𝑎𝑓𝑜,𝑇𝑝𝑜𝑠𝑡1900)=0.4

ANN Dataset

• Dataset split in decades

• Analyze the neighbourhood: “semantics”

• Semantic shift: “bioscience”, “unsupervised”

ANN: neighbourhood of “semantics”

1960-1969 1970-1979 1980-1989 1990-1999 2000-2009 2010-2014

linguistic natural syntax syntax syntax syntax

theory linguistic natural theory theory theory

semantic semantic general interpretation interpretation interpretation

syntactic theory theory general description description

natural syntax semantic linguistic meaning complex

linguistic language syntactic description linguistic meaning

distributional processing linguistic complex logical linguistic

process syntactic interpretation natural complex logical

computational description model representation representation structures

syntax analysis description logical structures representation


1960-1969 1970-1979 1980-1989 1990-1999 2000-2009 2010-2014












1960-1969 1970-1979 1980-1989 1990-1999 2000-2009 2010-2014












1960-1969 1970-1979 1980-1989 1990-1999 2000-2009 2010-2014











…ANN: semantic shift “bioscience”

bioscience

extraterrestrial, extrasolar

medline, bionlp, biomedi

before 1990 nowadays

…ANN: semantic shift “unsupervised”

unsupervised

observe, partition, selective

supervised, disambiguation,

probabilistic, algorithms, statistical

before 1990 nowadays

Conclusions

• Temporal Random Indexing

– build Word Spaces taking into account information about time

– developed and published an open-source framework

• Potentiality of our framework

– capture word usage changes over time

That’s all folks!

https://github.com/pippokill/tri