Upload
pierpaolo-basile
View
98
Download
0
Tags:
Embed Size (px)
Citation preview
Analysing Word Meaning over Time by Exploiting Temporal Random Indexing
Pierpaolo Basile, Annalina Caputo and Giovanni Semeraro
Department of Computer Science – University of Bari Aldo Moro
Prima conferenza italiana di Linguistica Computazionale Pisa 9-10 Dicembre 2014
Marty, in 2015 people will surf on
the web!!!
Surf!?!?! On the web!?!?!?
Distributional Semantic Models (DSM)
• Analysis of word-usage statistics over huge corpora
• Geometrical space of concepts
• Similar words are represented close in the space
DSM issue…
Corpus DSM Word Space
Word Space is a snapshot of word co-occurrences over a linguistic corpus
…DSM issue…
Corpus1
Word Space1
Corpus2
Word Space2
Corpus3
Word Space3
Corpus4
Word Space4
Difficult to compare Word Spaces built on different corpora
Temporal DSM
Corpus1900
Word Space1
Corpus1920
Word Space2
Corpus1930
Word Space3
Corpus1940
Word Space4
Each corpus contains documents of a specific time period… …Word Spaces are still not comparable
Temporal Random Indexing (TRI)
Corpus1900
RI Space1
Corpus1920
RI Space2
Corpus1930
RI Space3
Corpus1940
RI Space4
Words (vectors) in different Word Spaces are comparable… …comparison on different time periods is possible!
Random Indexing
Corpus Vocabulary
Assign a Random Vector ci to each term ci <1, 0, 0, 0, -1, 0, 1, 0, -1, 0, 0, 0, 0, 0, 0>
RI Space
A word vector svi is the sum of random vectors assigned to the co-occurring words
𝑠𝑣𝑖 = 𝑐𝑖−𝑚<𝑖<+𝑚𝑑∈𝐶
Co-occurring words are defined as the set of m words that precede and follow wi
Temporal Random Indexing (TRI)…
Corpus T1-T2
Vocabulary Assign a Random Vector ci to each term
RI Space T1-T2
Corpus T2-T3
Corpus T3-T4
Corpus Tn-1-Tn
…
RI Space T2-T3
RI Space T3-T4
RI Space Tn-1-Tn
…
…Temporal Random Indexing
• For each Word Space Tk-Tk+1
– sum random vectors taking into account only documents dk in the period Tk-Tk+1
• A word wi has several semantic vectors (sv): one for each time period
• Vectors are comparable
𝑠𝑣𝑖,𝑇𝑘 = 𝑐𝑖−𝑚<𝑖<+𝑚𝑑𝑘∈𝐶
TRI System
• TRI system1 performs Temporal Random Indexing
1. Build a RI space for each year
2. Merge RI spaces to create new time periods
3. Load RI space and fetch vectors
4. Combine vectors
5. Retrieve similar vectors
6. Extract and compare neighbourhood of words
1https://github.com/pippokill/tri
Evaluation
• Two case studies 1. Project Gutenberg (PG): 349 Italian books from
1810 to 1922
2. ACL Anthology Network dataset (ANN): 21,212 papers published by ACL from 1960 to 2014
• Goals – Neighborhood: analyze the neighborhood of a
word
– Semantic shift: analyze words that clearly change their semantics
PG Dataset
• Dataset split in two time periods
– Pre 1900
– Post 1900
• Analyze the neighbourhood: “patria”
• Semantic shift: “cinematografo”
PG: neighbourhood of “patria”
Pre 1900 Post 1900
Libertà Libertà
Opera Gloria
Pari Giustizia
Comune Comune
Gloria Legge
Nostra Pari
Causa Virtù
Italia Onore
Giustizia Opera
Guerra Popolo
PG: semantic shift of “cinematografo”
Tpost 1900: “cinematografo” strongly related to “sonoro”
𝑠𝑖𝑚(𝑠𝑣𝑐𝑖𝑛𝑒𝑚𝑎𝑡𝑜𝑔𝑟𝑎𝑓𝑜,𝑇𝑝𝑟𝑒1900 ,𝑠𝑣𝑐𝑖𝑛𝑒𝑚𝑎𝑡𝑜𝑔𝑟𝑎𝑓𝑜,𝑇𝑝𝑜𝑠𝑡1900)=0.4
ANN Dataset
• Dataset split in decades
• Analyze the neighbourhood: “semantics”
• Semantic shift: “bioscience”, “unsupervised”
ANN: neighbourhood of “semantics”
1960-1969 1970-1979 1980-1989 1990-1999 2000-2009 2010-2014
linguistic natural syntax syntax syntax syntax
theory linguistic natural theory theory theory
semantic semantic general interpretation interpretation interpretation
syntactic theory theory general description description
natural syntax semantic linguistic meaning complex
linguistic language syntactic description linguistic meaning
distributional processing linguistic complex logical linguistic
process syntactic interpretation natural complex logical
computational description model representation representation structures
syntax analysis description logical structures representation
ANN: neighbourhood of “semantics”
1960-1969 1970-1979 1980-1989 1990-1999 2000-2009 2010-2014
linguistic natural syntax syntax syntax syntax
theory linguistic natural theory theory theory
semantic semantic general interpretation interpretation interpretation
syntactic theory theory general description description
natural syntax semantic linguistic meaning complex
linguistic language syntactic description linguistic meaning
distributional processing linguistic complex logical linguistic
process syntactic interpretation natural complex logical
computational description model representation representation structures
syntax analysis description logical structures representation
ANN: neighbourhood of “semantics”
1960-1969 1970-1979 1980-1989 1990-1999 2000-2009 2010-2014
linguistic natural syntax syntax syntax syntax
theory linguistic natural theory theory theory
semantic semantic general interpretation interpretation interpretation
syntactic theory theory general description description
natural syntax semantic linguistic meaning complex
linguistic language syntactic description linguistic meaning
distributional processing linguistic complex logical linguistic
process syntactic interpretation natural complex logical
computational description model representation representation structures
syntax analysis description logical structures representation
ANN: neighbourhood of “semantics”
1960-1969 1970-1979 1980-1989 1990-1999 2000-2009 2010-2014
linguistic natural syntax syntax syntax syntax
theory linguistic natural theory theory theory
semantic semantic general interpretation interpretation interpretation
syntactic theory theory general description description
natural syntax semantic linguistic meaning complex
linguistic language syntactic description linguistic meaning
distributional processing linguistic complex logical linguistic
process syntactic interpretation natural complex logical
computational description model representation representation structures
syntax analysis description logical structures representation
…ANN: semantic shift “bioscience”
bioscience
extraterrestrial, extrasolar
medline, bionlp, biomedi
before 1990 nowadays
…ANN: semantic shift “unsupervised”
unsupervised
observe, partition, selective
supervised, disambiguation,
probabilistic, algorithms, statistical
before 1990 nowadays
Conclusions
• Temporal Random Indexing
– build Word Spaces taking into account information about time
– developed and published an open-source framework
• Potentiality of our framework
– capture word usage changes over time
That’s all folks!
https://github.com/pippokill/tri