Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Overview Textual Data Research Example Human Language Project Libraries
Textual Corpora, Treebanks, and the HumanLanguage Project
Steven Abney
Department of LinguisticsComputer Science and Engineering
School of Information
2015 Mar 30
Overview Textual Data Research Example Human Language Project Libraries
Overview
What kinds of (annotated) text corpora do computationallinguists use?
A research example: using existing treebanks to bootstraptreebanks in other languages
Extrapolating into the future: building a comprehensivemultilingual dataset – a Human Genome Project for language
What a library might provide
Overview Textual Data Research Example Human Language Project Libraries
Textual DataPlain text
Language samples
Plain textPreferably large amounts (1M–1B words)In as many languages as possible
Some LDC titles:UN Parallel Text (Complete) [1994]European Language Newspaper Text [1995]Japanese Business News Text [1995]Spanish News Text [1995]Mandarin Chinese News Text [1995]CALLHOME Egyptian Arabic Transcripts [1997]Portuguese Newswire Text [1999]Korean Newswire [2000]HUB5 German Transcripts [2003]Chinese Gigaword [2003]Arabic Gigaword [2003]Czech Broadcast News Transcripts [2004]Web 1T 5-gram Version 1 [2006]Hungarian-English Parallel Text, Version 1.0 [2008]Web 1T 5-gram, 10 European Languages Version 1 [2009]
Aligned texts, multiple translations
Overview Textual Data Research Example Human Language Project Libraries
Textual dataLexicons
Wordnet: lexical database
hypernymy, synonymy, meronymy“Synset” = set of words that share a meaning
Multi-lingual wordnets – synsets that crosslanguages (translation equivalents)
Babelnet – combines Wordnets with Wikipediaand Wiktionary
Panlex – database constructed from thousandsof bilingual print dictionaries
entityphysical entityobjectwholeliving thingorganismanimalchordatevertebratemammalplacentalungulateodd-toed ungulateequinehorse
Overview Textual Data Research Example Human Language Project Libraries
Textual dataParts of speech
the Brown corpus:
Why?
Input for machine learningAutomatically train a system tolabel new textFirst step inlanguage-interpretation pipeline
The atFulton np-tlCounty nn-tlGrand jj-tlJury nn-tlsaid vbdFriday nran atinvestigation nnof inAtlanta’s np$recent jjprimary nnelection nnproduced vbd“ “no atevidence nn” ”that csany dti
Overview Textual Data Research Example Human Language Project Libraries
Textual dataNamed entities
El portavoz delORG︷ ︸︸ ︷
Consejo Polıtico Municipal de IU en
LOC︷ ︸︸ ︷Santander,
PER︷ ︸︸ ︷Ernesto Gomez de la Hera, explico hoy en con-ferencia de prensa que este . . .
El Oportavoz Odel OConsejo B-ORGPolıtico I-ORGMunicipal I-ORGde I-ORGIU I-ORGen OSantander B-LOC, OErnesto B-PERGomez I-PERde I-PERla I-PERHera I-PER, Oexplico Ohoy Oen Oconferencia Ode Oprensa Oque Oeste O
Overview Textual Data Research Example Human Language Project Libraries
Textual dataConstituent-structure treebanks
S
NP-SBJ
NP
NNP
Pierre
NNP
Vinken
,
,
ADJP
NP
CD
61
NNS
years
JJ
old
,
,
VP
MD
will
VP
VB
join
NP
DT
the
NN
board
PP-CLR
IN
as
NP
DT
a
JJ
nonexecutive
NN
director
NP-TMP
NNP
Nov.
CD
29
.
.
Penn Treebank
Overview Textual Data Research Example Human Language Project Libraries
Textual dataDependency treebanks
join
Vinken
Pierre old
, years
61
,
will board
the
director
as a nonexecutive
Nov.
29
1 Pierre NNP 22 Vinken NNP 93 , , 24 61 CD 55 years NNS 66 old JJ 27 , , 28 will MD 99 join VB 010 the DT 1111 board NN 912 as IN 1513 a DT 1514 nonexecutive JJ 1515 director NN 916 Nov. NNP 917 29 CD 1618 . . 9
Much more compact than constituent trees, equivalent forpractical purposes
Purpose: training a parser (interpretation, translation)
Overview Textual Data Research Example Human Language Project Libraries
Research ExampleHow can we learn a parser without a treebank?
Motivations
Linguistics: ultimate subject matter is human languagecapacity = ability to learn language.Google: access to data in all languagesDARPA: decision support in crisis management; informationextraction from news media and social media
Treebanks
I know of treebanks for 43 languages:ArabicArmenian, AncientBasqueBulgarianCatalanChineseCzechDanishDutchEnglishEnglish, Early Modern
English, MiddleEnglish, OldEstonianFinnishFrenchGermanGothicGreekGreek, AncientHebrewHindi-Urdu
HungarianIcelandicIndonesianItalianJapaneseKarukKoreanLatinPolishPortuguesePortuguese, Medieval
RomanianRussianSlavonic, Old ChurchSloveneSpanishSwedishThaiTurkishUgariticVietnamese
But there are 6800 languages (Ethnologue)
Overview Textual Data Research Example Human Language Project Libraries
Research ExampleLearning a dependency parser for a new language
Accuracy measure: percentage of governors correctly identified
Monolingual grammatical inference 47%Delexicalized transfer 52%Multi-source delexicalized transfer 55%Adaptation using bitexts 59%Adaptation using bitexts + language relationships 62%Supervised training 84%
McDonald et al 2012
Existing methods neglect bilingual dictionaries
Overview Textual Data Research Example Human Language Project Libraries
Research ExampleA challenge: low-resource languages
Making the methods practical for low-resource languages
Well-resourced languages: ∼ 50 = 0.7%E.g., Google translates 57 languagesAll but 18 are Indo-european, none are endangered.
Example: machine translation
Current methods require 2–10 million words of bitext fortrainingLargest source of bitext is the Bible: 0.8 M words, 459languages (7%).New Testament: 0.1 M words, 1213 languages (18%)
Overview Textual Data Research Example Human Language Project Libraries
The Human Language ProjectBootstrapping resources for low-resource languages
Where we would really like to go
Comprehensive language resourcesThe Human Genome Project for languages
Made urgent by language endangerment
“Low resource” = digitally endangered33% endangered, another 10% vulnerableHalf of the world’s languages have fewer than 6,000 speakers.4% have gone extinct since 1950Current rate of extinction: 2 languages/monthProjections: 50–90% loss by end of century
Overview Textual Data Research Example Human Language Project Libraries
The Human Language ProjectResources
What do we mean by resources?
Target-language plain textMonolingual dictionaries with parts ofspeechBilingual dictionariesMorphological paradigmsBitextTreebank
All expressible in a simple data format
1 Pierre NNP 22 Vinken NNP 93 , , 24 61 CD 55 years NNS 66 old JJ 27 , , 28 will MD 99 join VB 010 the DT 1111 board NN 912 as IN 1513 a DT 1514 nonexecutive JJ 1515 director NN 916 Nov. NNP 917 29 CD 1618 . . 9
Overview Textual Data Research Example Human Language Project Libraries
What Libraries Might ProvideScanned books
Traditional language description
A grammar and a lexicon, maybe a text collectionPrinted books, for human consumptionAll that is available for most languages
Lars Skrefsrud, A grammar of the Santhal language (1873)Arthur Macdonell, A Sanskrit-English dictionary (1893)
Overview Textual Data Research Example Human Language Project Libraries
What Libraries Might ProvideScanned books
Difficulties
OCR is a huge problem – poor or nonexistent for non-romanscripts, poor for text with diacriticsPossible alternative: crowd-sourcingTranscription is one thing, conversion to dataset is another
Overview Textual Data Research Example Human Language Project Libraries
What Libraries Might ProvideLibraries as digital archives?
Biggest current provider: Linguistic Data Consortium
Heavy emphasis on speech, languages with overwhelmingcommercial and intelligence value (English, Chinese, Arabic,western Europe)Expensive
Language documentation archives
Archiving of traditional field notes, recordingsSmall scale, little support for or awareness of computationalmethodsAccess is often highly restricted
Overview Textual Data Research Example Human Language Project Libraries
What Libraries Might ProvideLibraries as digital archives?
What is lacking: archives that provide
Free public accessComprehensivenessMachine consumption
Broader issues: incentives/hindrances to producing resources
Recognition of dataset production as publicationProduction of machine-oriented datasets from copyrightedprint worksAbility to publish annotation of others’ datasets