Textual Corpora, Treebanks, and the Human Language Project · Overview What kinds of (annotated) text corpora do computational ... Multi-lingual wordnets { synsets that cross languages

Overview Textual Data Research Example Human Language Project Libraries

Textual Corpora, Treebanks, and the HumanLanguage Project

Steven Abney

Department of LinguisticsComputer Science and Engineering

School of Information

2015 Mar 30


Overview

What kinds of (annotated) text corpora do computationallinguists use?

A research example: using existing treebanks to bootstraptreebanks in other languages

Extrapolating into the future: building a comprehensivemultilingual dataset – a Human Genome Project for language

What a library might provide


Textual DataPlain text

Language samples

Plain textPreferably large amounts (1M–1B words)In as many languages as possible

Some LDC titles:UN Parallel Text (Complete) [1994]European Language Newspaper Text [1995]Japanese Business News Text [1995]Spanish News Text [1995]Mandarin Chinese News Text [1995]CALLHOME Egyptian Arabic Transcripts [1997]Portuguese Newswire Text [1999]Korean Newswire [2000]HUB5 German Transcripts [2003]Chinese Gigaword [2003]Arabic Gigaword [2003]Czech Broadcast News Transcripts [2004]Web 1T 5-gram Version 1 [2006]Hungarian-English Parallel Text, Version 1.0 [2008]Web 1T 5-gram, 10 European Languages Version 1 [2009]

Aligned texts, multiple translations


Textual dataLexicons

Wordnet: lexical database

hypernymy, synonymy, meronymy“Synset” = set of words that share a meaning

Multi-lingual wordnets – synsets that crosslanguages (translation equivalents)

Babelnet – combines Wordnets with Wikipediaand Wiktionary

Panlex – database constructed from thousandsof bilingual print dictionaries

entityphysical entityobjectwholeliving thingorganismanimalchordatevertebratemammalplacentalungulateodd-toed ungulateequinehorse


Textual dataParts of speech

the Brown corpus:

Why?

Input for machine learningAutomatically train a system tolabel new textFirst step inlanguage-interpretation pipeline

The atFulton np-tlCounty nn-tlGrand jj-tlJury nn-tlsaid vbdFriday nran atinvestigation nnof inAtlanta’s np$recent jjprimary nnelection nnproduced vbd“ “no atevidence nn” ”that csany dti


Textual dataNamed entities

El portavoz delORG︷︸︸︷

Consejo Polıtico Municipal de IU en

LOC︷︸︸︷Santander,

PER︷︸︸︷Ernesto Gomez de la Hera, explico hoy en con-ferencia de prensa que este . . .

El Oportavoz Odel OConsejo B-ORGPolıtico I-ORGMunicipal I-ORGde I-ORGIU I-ORGen OSantander B-LOC, OErnesto B-PERGomez I-PERde I-PERla I-PERHera I-PER, Oexplico Ohoy Oen Oconferencia Ode Oprensa Oque Oeste O


Textual dataConstituent-structure treebanks

S

NP-SBJ

NP

NNP

Pierre

NNP

Vinken

,

,

ADJP

NP

CD

61

NNS

years

JJ

old

,

,

VP

MD

will

VP

VB

join

NP

DT

the

NN

board

PP-CLR

IN

as

NP

DT

a

JJ

nonexecutive

NN

director

NP-TMP

NNP

Nov.

CD

29

.

.

Penn Treebank


Textual dataDependency treebanks

join

Vinken

Pierre old

, years

61

,

will board

the

director

as a nonexecutive

Nov.

29

1 Pierre NNP 22 Vinken NNP 93 , , 24 61 CD 55 years NNS 66 old JJ 27 , , 28 will MD 99 join VB 010 the DT 1111 board NN 912 as IN 1513 a DT 1514 nonexecutive JJ 1515 director NN 916 Nov. NNP 917 29 CD 1618 . . 9

Much more compact than constituent trees, equivalent forpractical purposes

Purpose: training a parser (interpretation, translation)


Research ExampleHow can we learn a parser without a treebank?

Motivations

Linguistics: ultimate subject matter is human languagecapacity = ability to learn language.Google: access to data in all languagesDARPA: decision support in crisis management; informationextraction from news media and social media

Treebanks

I know of treebanks for 43 languages:ArabicArmenian, AncientBasqueBulgarianCatalanChineseCzechDanishDutchEnglishEnglish, Early Modern

English, MiddleEnglish, OldEstonianFinnishFrenchGermanGothicGreekGreek, AncientHebrewHindi-Urdu

HungarianIcelandicIndonesianItalianJapaneseKarukKoreanLatinPolishPortuguesePortuguese, Medieval

RomanianRussianSlavonic, Old ChurchSloveneSpanishSwedishThaiTurkishUgariticVietnamese

But there are 6800 languages (Ethnologue)


Research ExampleLearning a dependency parser for a new language

Accuracy measure: percentage of governors correctly identified

Monolingual grammatical inference 47%Delexicalized transfer 52%Multi-source delexicalized transfer 55%Adaptation using bitexts 59%Adaptation using bitexts + language relationships 62%Supervised training 84%

McDonald et al 2012

Existing methods neglect bilingual dictionaries


Research ExampleA challenge: low-resource languages

Making the methods practical for low-resource languages

Well-resourced languages: ∼ 50 = 0.7%E.g., Google translates 57 languagesAll but 18 are Indo-european, none are endangered.

Example: machine translation

Current methods require 2–10 million words of bitext fortrainingLargest source of bitext is the Bible: 0.8 M words, 459languages (7%).New Testament: 0.1 M words, 1213 languages (18%)


The Human Language ProjectBootstrapping resources for low-resource languages

Where we would really like to go

Comprehensive language resourcesThe Human Genome Project for languages

Made urgent by language endangerment

“Low resource” = digitally endangered33% endangered, another 10% vulnerableHalf of the world’s languages have fewer than 6,000 speakers.4% have gone extinct since 1950Current rate of extinction: 2 languages/monthProjections: 50–90% loss by end of century


The Human Language ProjectResources

What do we mean by resources?

Target-language plain textMonolingual dictionaries with parts ofspeechBilingual dictionariesMorphological paradigmsBitextTreebank

All expressible in a simple data format

1 Pierre NNP 22 Vinken NNP 93 , , 24 61 CD 55 years NNS 66 old JJ 27 , , 28 will MD 99 join VB 010 the DT 1111 board NN 912 as IN 1513 a DT 1514 nonexecutive JJ 1515 director NN 916 Nov. NNP 917 29 CD 1618 . . 9


What Libraries Might ProvideScanned books

Traditional language description

A grammar and a lexicon, maybe a text collectionPrinted books, for human consumptionAll that is available for most languages

Lars Skrefsrud, A grammar of the Santhal language (1873)Arthur Macdonell, A Sanskrit-English dictionary (1893)


What Libraries Might ProvideScanned books

Difficulties

OCR is a huge problem – poor or nonexistent for non-romanscripts, poor for text with diacriticsPossible alternative: crowd-sourcingTranscription is one thing, conversion to dataset is another


What Libraries Might ProvideLibraries as digital archives?

Biggest current provider: Linguistic Data Consortium

Heavy emphasis on speech, languages with overwhelmingcommercial and intelligence value (English, Chinese, Arabic,western Europe)Expensive

Language documentation archives

Archiving of traditional field notes, recordingsSmall scale, little support for or awareness of computationalmethodsAccess is often highly restricted


What Libraries Might ProvideLibraries as digital archives?

What is lacking: archives that provide

Free public accessComprehensivenessMachine consumption

Broader issues: incentives/hindrances to producing resources

Recognition of dataset production as publicationProduction of machine-oriented datasets from copyrightedprint worksAbility to publish annotation of others’ datasets

Documents

Textual Corpora, Treebanks, and the Human Language Project · Overview What kinds of (annotated) text corpora do computational ... Multi-lingual wordnets { synsets that cross languages