A Neural Model for Language Identification in Code...

Aaron Jaech, George Mulcaire, Shobhit Hathi,Mari Ostendorf, Noah A. Smith

A Neural Model for Language Identification in Code-Switched Tweets

This talk in a nutshell• Continuous-space (neural) representations

• Hierarchical approach for multiple kinds of context • Char n-grams should be interpreted in the context of a word

• Words should be interpreted in the context of their neighbors

• Tweet-level langid for similar languages and many languages • Good performance for Spanish/English code-switching

Character embeddings

CNN over characters

Bi-LSTM over words

Predictions for each word/tweet

Standard language ID approaches• Classifiers over character n-grams and word dictionary features • Character n-grams are highly informative… but limited

• Want to reason about a broader notion of similarity

Continuous representations for language• Continuous space representations (neural networks) have led

to progress in language modeling, machine translation, and other tasks • Key advantages vs n-gram models: • Share information over longer time ranges (recurrent neural net) • Soft parameter tying

• Our main contribution: use the advantages of continuous space representations in the task of language identification

Why is a CNN like an n-gram model…• Each filter in a convolutional neural network sees a fixed-width

character sequence • Filters act like linear classifiers indicating the presence or

absence of a generalized n-gram • Max-pooling creates fixed size representation for each word

• Sharing information between similar characters and patterns • Incorporating wider context into a single feature vector

…but different?

Convolutional network example

• Each filter sees three characters at a time (WIDTH = 3) • Hypothetical example:

• Filter 1 detects “YOU” trigram • Filter 2 detects words ending in ”S” • Filter 3 detects “ING” trigram

• Pooled vector indicates presence of pattern in the input word

Input Sequence: <S> Y O U R S </S> Pooled Vector

Filter 1 Output: X X ✓ X X X X ✓Filter 2 Output: X X X X X ✓ X ✓Filter 3 Output: X X X X X X X X

Char2Vec• Embeds each space-delimited word • Multiple width filters (from 3 to 6)

capture different length patterns • Residual network allows feature

interactions • RELU used as an activation function

Are word representations enough?

• Placing words in context is nontrivial:

• Exact matches across languages

• “que dolor, actual worst headache”

• Similar patterns can be misleading

Tweet level context• Take as input the sequence of word

vectors from Char2Vec • Process with a bi-directional LSTM • LSTM input gates minimize the effect

of URLs and usernames

• Make predictions for each word • Final prediction is average of

words’ labels

• We call this method C2V2L (character to vector to language)

Development datasets

• Twitter70 covers 70 languages from five language families • TweetLID is a small set of

confusable Iberian languages plus English with some ambiguous Tweets and some code-switching • Tweet-level supervision

Twitter70 TweetLID

Tweets 58,182 14,991

Character vocab 5,796 956

Languages 70 6

Code-switching? Not Labeled Yes

Balanced? Roughly No

Two hand-labeled Twitter datasets:

Baseline models• N-gram language models: • Train a separate character 5-gram language model for each language • Each LM estimates P(TEXT|LANGUAGE) • Pick the label that has the highest posterior probability

• argmax P(TEXT|LANGUAGE) P(LANGUAGE) • Special threshold used for the unknown language label

• Convolutional Model Baseline (C2L): • Treat entire tweet as a single character sequence

• Published baseline: langid.py (Lui and Baldwin, 2012) • Popular open source classifier that uses character n-gram features

Twitter70 results• Limited by performance on most

difficult language pairs • Croatian/Bosnian, Danish/Norwegian

Twitter70 Results

C2V2L 5-gram LMslangid.py

Training Examples Languages

0-200 Oriya

200-400 Uighur, Lithuanian

400-600 Danish, Amharic, Tibetan, Korean, Basque, Chinese, Ukrainian, Tagalog

600-800

Romanian, Turkish, Marathi, Catalan, Taiwanese, Haitian, Icelandic, Malayalam, Serbian, Sinhala, Kannada, Kurdish, Urdu, Hungarian, Punjabi, Cambodian, Latvian, Lao

800-1000

Polish, Japanese, Hindi-Latin, Finnish, Tamil, Burmese, Dutch, English, Bengali, Russian, Georgian, Portuguese, Pashto, Vietnamese, Gujarati, Slovak, Italian, Telugu, French, Estonian, Divehi

1000-1200

Persian, Nepali, Greek, German, Arabic, Bulgarian, Swedish, Armenian, Norwegian, Czech, Spanish, Slovene, Indonesian, Bosnian, Welsh, Thai, Croatian

1200+ Hebrew, Hindi, Sindhi

TweetLID results• Our model beats N-gram baseline and langid.py • Previous state of the art on this dataset is 75.3 F1 (Gamallo et al., 2014) • Hierarchical model is substantially better than non-hierarchical neural

network (C2L)

Model Avg. F1 eng spa cat eus por glg und amb

N-gram LM 75.0 74.8 94.2 82.7 74.8 93.4 49.5 38.9 87.0

Langid.py 68.9 65.9 92.0 72.9 70.6 89.8 52.7 18.8 83.8

C2L 72.7 73.0 93.8 82.6 75.7 89.4 57.0 18.0 92.1

C2V2L 76.2 75.6 94.7 85.3 82.7 91.0 58.5 27.2 94.5

Adding outside data• Surprisingly, most participants in TweetLID workshop

did worse in unconstrained track • 25,000 Wikipedia sentence fragments per language • Train model on weighted mixture of wiki and tweets • Use it to initialize a second model that is fine tuned on

just tweets • N-gram baseline is interpolated between wiki LM and

Twitter LM

Added data results• C2V2L beats previous record (Gamallo et al., 2014) for using outside data • Ability to leverage wiki data is an advantage of our C2V2L model

Hurtado et al. (2014) Gamallo et al. (2014) N-gram LM C2V2L

74.475.3

76.274.7

Constrained Unconstrained

Avg. F1

Code-switching• Our model is easily adapted to

predict code-switching • Just remove the last layer

(tweet level prediction averaging)

Vamos a echar un partido de fifa contra my brother ☺

spa spa spa spa spa spa spa spa eng eng other

Input:

Predictions:

Shared task results: Spa-Eng

L1 L2 NE Other

0.9840.938

0.9820.929

Ours Best other

Shared task results: Arabic

• Poor results for Modern Standard Arabic / Dialectical Arabic — but based only on Spanish-English hyperparameters

L1 L2 NE Other

0.8280.904

0.6030.603

Ours Best other

Hyperparameter tuning helps!

• Arabic-specific hyperparameters show improvement over the Spanish-English hyperparameters (in dev set comparison)

L1 L2 NE Other

0.7650.754

0.9250.955

Old (Sp-En) New (MSA-DA)

Error analysis — examples• “Oversharing” between words

• Named entities

Error analysis — examples• Ambiguous and unknown words

• Confusing Spanish and English

Similar words• Char2Vec can deal with out of vocabulary words • Learns to ignore punctuation and capitalization • Handles non-standard tokens like usernames and hashtags

couldn't @maria_sanchez noite

can’t 0.84 @Ainhooa_Sanchez 0.85 Noite 0.99don’t 0.80 @Ronal2Sanchez 0.71 noite. 0.98ain’t 0.80 @maria_lsantos 0.68 noite? 0.98don’t 0.79 @jordi_sanchez 0.66 noite.. 0.96didn’t 0.79 @marialouca? 0.66 noite, 0.95Can’t 0.78 @mariona_g9 0.65 noitee 0.92first 0.77 @mario_casas_ 0.65 noiteee 0.90

amharic

arabic

bulgarianbengali

tibetan

bosnian

catalan

kurdish

danishgerman

divehi

english

spanishestonian

basque

persian

finnishfrench

gujaratihebrew

hindi−latin

croatian

haitian

hungarianarmenian

indonesian

icelandic

italian

japanese

georgian

cambodian

kannada

korean

lithuanian

latvian

malayalammarathi

burmese

nepali

norwegian

punjabi

polishpashto

portuguese

romanian

russian

sindhi

sinhala

slovak

slovene serbian

swedish

telugu

tagalog

turkish

uighur

ukranian urdu

vietnamese

chinese taiwanese

−400

−200

−400 −200 0 200 400

Germanic

Balto-SlavicRomantic

bosnian

serbian

persianarabic

Indo-Iranian

• Related languages often appear next to each other in the T-SNE embedding • Orthographic, not phonetic,

similarity

Language vectors

Conclusions

• Neural/continuous models have a few advantages including ability to leverage outside data, integration with later stages • Hierarchical representation is key • Performance of the model should

continue to improve as more training data becomes available • Simple approach

Future directions• Joint segmentation and language ID • Supplement code-switching data with tweet-level supervision • Other tagging tasks (with joint language ID?) • Additional features: lexicons, POS, etc.

Links:• TweetLID: http://komunitatea.elhuyar.eus/tweetlid/ • “Twitter70”: https://blog.twitter.com/2015/evaluating-

language-identification-performance (“recall oriented”) • Our paper on ArXiv: https://arxiv.org/abs/1608.03030 • Contact: gmulc@cs.washington.edu

Thank you!

A Neural Model for Language Identification in Code...

Documents

Harvard Environmental Economics Programprojects.iq.harvard.edu/files/heep/files/dp24_kremer-etal.pdf · Bill and Melinda Gates Foundation Spring Cleaning: Rural Water Impacts,

Mothers' Human Capital and the Intergenerational ... Quisimbing etal.pdf · Chronic Poverty Research Centre ISBN: 978-1-906433-62-8 Mothers’ human capital and the intergenerational

The Diverged Trypanosome MICOS Complex as a Hub for ...schneider.dcb.unibe.ch/PDF/publications/Hashimi-etal.pdf · import and assembly (MIA) pathway in yeast by interaction with the

Externalisation and Internalisation of Collective ...rhverlag.de/ArchivIndB/3_03_Arrowsmith etal.pdf · externalisation-internalisation (engagement with sectoral multi-employer bargaining)

The Second Competition on Syntax-Guided Synthesisformal.epfl.ch/synt/2015/slides/fisman-etal.pdf · The Second Competition on Syntax-Guided Synthesis. ... Constraint Programming:

Cubic Spline as a Powerful Tools for Processing ...einspem.upm.edu.my/journal/fullpaper/vol11sfeb/10. Majid Khan etal.pdf · Teknologi Petronas, Malaysia ... imizing the function

The opportunity costs of reducing carbon emissions in an ...userpage.fu-berlin.de/.../bc2008_218_Stickler-EtAl.pdf · expanses) and climate (Ratter et al., 1973) are well-suited for

Journal of Volcanology and Geothermal Researchrepo.kscnet.ru/2521/1/belousov etal.pdf · 2016-02-12 · Overview ofthe precursors anddynamics of the2012–13basalticﬁssure eruption

Comparison of small mammal species in natural versus agri ...pcag.uwinnipeg.ca/Prairie-Perspectives/PP-Vol18/Hollier-etal.pdf · The population dynamics of small mammal species are

Dynamic Behaviour of Cracked Granite and Marble Columns ...ingegneriasismica.org/wp-content/uploads/2014/07/03_minafo-etal.pdf · available stone, characterize several historical

Biophysical characteristics of coastal vegetation in Bird ...pcag.uwinnipeg.ca/Prairie-Perspectives/PP-Vol16/Werner-etal.pdf · geocoding multi-temporal remote sensed images for shoreline

Overview for the Second Shared Task on Language ...ritual.uh.edu/wp-content/uploads/2015/06/molina-EtAl.pdf · SECOND WORKSHOP ON COMPUTATIONAL APPROACHES TO CODE SWITCHING 22. Conclusion

A HISTORICAL AND STATISTICAL COMPARISON OF TORNADO …nwafiles.nwas.org/digest/papers/2010/Vol34No2/Pg145-Gagan-etal.pdf · known term “Dixie Alley.” A basic statistical analysis

POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Networks, Fields and Organizations - eclectic.ss.uci.edueclectic.ss.uci.edu/~drwhite/pw/SNA/cohes-white-etal.pdf · Networks, Fields and Organizations: ... January 2004: Special Issue

The Sunk Cost Bias and Managerial Pricing Practicesaida.wss.yale.edu/~shiller/behmacro/2005-11/al-najjar-etal.pdf · The Sunk Cost Bias and Managerial Pricing Practices ... 2.4 Budgets

SKMBT C45417032214050 - cocovote.us 8. 46 5. 37 5 33 2. 88 .18 ... phil wyman orly taitz jonathan jaech. write-in. total 650 ... lydia a. gutierrez write-in. total

Aaron&Jaech&and&Mari&Ostendorf University&of&Washington · 2018-10-10 · high school musical horoscope chris brown high school musical funnyjunk.com homes for sale funbrain.com modular

1 Kalman Filtering with Intermittent Observationsjordan/papers/sinopoli-etal.pdf · Bruno Sinopoli, Luca Schenato, Massimo Franceschetti, Kameshwar Poolla, Michael I. Jordan, Shankar

Increasing Reading Comprehension and Engagement Through ...cori.umd.edu/research-publications/2004-guthrie-wigfield-etal.pdf · Increasing Reading Comprehension and Engagement Through