Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Low-resource NLP
2
Language endangerment
3
How (many) endangered?
Language death
• Every 10 days or so a language dies• Most disappear unnoticed
• Killer languages• Worldwide phenomenon• What’s lost when a language dies?
Low-resource languages and the military
Sample initiatives
• DARPA LORELEI (Low Resource Languages for Emergent Incidents)• situational awareness by identifying elements of information in foreign language and
English sources, such as topics, names, events, sentiment and relationships.• IARPA Babel
• agile and robust speech recognition technology that can be rapidly applied to any human language
• DARPA BOLT1) allowing English-speakers to understand foreign-language sources of all genres, including chat, messaging and informal conversation; 2) providing English-speakers the ability to quickly identify targeted information in foreign-language sources using natural-language queries; and 3) enabling multi-turn communication in text and speech with non-English speakers. If successful, BOLT will deliver all capabilities free from domain or genre limitations.
DoD MURI (one effort)
NIST LoReHLT
New evaluation series aimed to advance HLT that provide rapid and effective response to emerging incidents where the language resources are very limited. LoReHLT16 plans to offer three evaluation tasks: machine translation (MT), named entity recognition (NER), and situation frames (SF).
Highlights:
- Three evaluation tasks: MT, NER, and SF- Two training conditions: constrained (required) and unconstrained- Surprise language evaluation- Three evaluation checkpoints to gauge performance based on training resources given
2016 Schedule:
Feb 19 - May 16: Registration period (deadline extended)Jun 2 - 8: Dry run periodJun 29: Encrypted evaluation data releasedJul 6: Surprise language announced and evaluation beginsJul 13: Checkpoint 1 dueJul 20: Checkpoint 2 dueAug 3: Checkpoint 3 dueAug 28 – 29: Post evaluation workshop (Nashua, New Hampshire, US)
For more information about the LoReHLT16 evaluation, please see the evaluation plan at http://www.nist.gov/itl/iad/mig/lorehlt16.cfm. If you have any questions, contact us at [email protected].
Feel free to forward this message to your colleagues who may find the evaluation of interest.
Best Regards,
NIST LoReHLT team
DARPA BOLT (2011-2016)
• Operationalize language processing of all kinds (mostly for DoD)• Machine translation, sentiment analysis, dialect recognition, prevarication
detection, etc.• Beyond the current paradigms, language resources (cf. trained on newswire)• MT and CLIR (A), HCI English+Arabic (B), ST English+Arabic (C), Arabic dialects
(D)
• Activity E: language, agents, and robotics
BOLT Activity E
• Grounded language acquisition by robots• Deep semantics, visual+tactile input, experiential learning of objects,
actions, and consequences• Acquires language via grounding, hypothesizing, automated reasoning• Human guides acquisition via situated, inter-active instruction• Robot demonstrates understanding via performance
Useful resources
• ACM TALLIP• LREC• NIST• Occasional workshops at ACL, EMNLP, HLT• LDC, ELRA SIG• WALS• Ethnologue• NSF DEL
Syriac corpus annotation
• 100,000 words• 15 years
• 10,000,000 words• 1,500 years??
• Community of scholar-annotators• Issues: who? which annotations? user interface?
how trustworthy? annotation cost? return on investment?
Attribute Value English gloss
Word Token LMLCCON to your king
Prefix L to
Suffix CON your (masc. plural)
Stem MLC inflected “king”
Baseform MLCA noun form of “king”
Root MLC realm, kingdom, queen, promise counsel, deliberate, reign
Corpus annotation
UnannotatedData
Annotate
Annotated Data
Train Model
Model
SupervisedLearning
UnannotatedData
Score & Rank Instances
Best Instance
AnnotateAnnotated
Data
Update Model
Model
ActiveLearning
Training Cost3
Selection Cost1Unannotated
Data
Score & Rank Instances
Best Instance
Annotated Data
Update ModelAnnotation Cost2
Annotate
Model
Cost
Annotation time vs. pre-labeling accuracy
Language revitalization
23
tiʔəʔ dbad
24
kʷi dscapaʔ (Pete Lomsdalen)
25
kʷsi sc'abiqʷ Kate Brown
26
kʷsi ʔəkʷyíqʷ Jennie Patense
27
kʷsi dscəpyíqʷMary Patense (Friday Consauk)
Father Eugène Casimir Chirouse, OMI
• Born May 8, 1821 in France (near Lyon)• 1847: to WA Territory via Oregon Trail• Only white clergy for entire Puget
Sound area• Lived among Lushootseed tribes 1857-
1878• Documented lifestyle, work, languages
(especially)• translated religious materials• compiled dictionary, lexicons,
grammar fragments• “The Apostle of the Puget Sound
Indians”• Fondly remembered to this day
29
A Chirouse manuscript
Visualizing textual versions
30
Sample prayer (portion)• Misereatur
2 Debalh kwe tlas oshabētem ato de Sherhk-Siam! 2 dibəɬ kʷi ɬasʔusabitəb ʔə tudiʔ səq siʔab
3 lilzirh kwēhk twal mok stam, 3 liɬʤixʷ qʷiqʼʷ dxʷʔal bəkʼʷ stab.
4 klob orhtsedelh ku …lētsǐētobolh ateto skwādăchĕlh! 4 ƛ'ubəxʷ cədiɬ kʷi ʔubalicyitubuɬ ʔə tə tudᶻəkʼʷadadcəɬ.
5 alh kwe tlas orhtobolh twal kwi tlo Darhulēchelh skwākhēd! 5 haʔɬ kʷi ɬusʔuxʷtubuɬ dxʷʔal kʷi ɬudxʷhəliʔcəɬ ckʼʷaqid.
6 Klob assista! 6 ƛ'ub ʔəsʔistəʔ.
31
Sample catechism Q&A• 3. Abil gwat asra'tleldhu kwi gwas stsakus, gula asared kwi gwas krwebitsuts dhual stsaku?
3. ʔəbilʼ gʷat ʔəsxaƛ'ildxʷ kʷi gʷəscʼaʔkʷs, gʷəl ʔəsʔəxid kʷi gwəsqʷibicuc dxʷʔal cʼaʔkʷ.
Abil gwat asra'tleldhu kwi gwas stsakus, gula tlo haidhu, ʔəbilʼ gʷat ʔəsxaƛ'ildxʷ kʷi gʷəscʼaʔkʷs, gʷəl ɬuhaydxʷ,
gula tlo tleldhu gwalh Sherk-Siam sgwadgwad, gʷəl ɬutɬildxʷ gʷəɬ səq siʔab sgʷadgʷad,
gula tlo rahab dhual ku sgwas tskwadads, gʷəl ɬuxahəb dxʷʔal kʷi sgʷaʔs ʤəkʼʷadads,
gula tlo whobed aku tsoku ku boku sa; gʷəl ɬuxʷəbəd ʔə kʷi cukʷ kʷi bəkʷ saʔ;
gula tlo tibitsut dhual kwi tlo tskwakred tlhdahu halhs.gʷəl ɬutibicut dxʷʔal kʷi ɬuckʼʷaqid tɬdəxʷ haʔɬs.
32
33
Chirouse terms re-introduced
• dxʷsyayus (God, lit. “hard worker”)• tul šəq skayuʔ (saints, lit. “dead people from the sky”)• liǰub (devil, cf. French “le diable”)• sxʷətagʷəb (Milky Way, lit. “falls in the middle”)• ʔabsčups sčusəd (comet, lit. “has+fire+star”)• qʷəcxaʔ (lark, cf. Upper Chehalis)• paləkʷ (pig, lit. “dig up, root up”)
Chirouse re verbs, culture• “The following verbs express some of the principal wild
Indian habits (as they are to be well remembered)”• Sample verbs
• To run a quill through the nostril, for ornament• To flatten the head of a baby• To blow on a sick to frighten the evil spirit• To louse a friend and eat the lice, to share one's love• To take a steam bath underground
35
1) The source sentences
• Randomly chosen from several sources• Conversational turns from published narratives (Hilbert)• Sample sentences from pedagogical grammars (Hess/Hilbert)• Example usage sentences from dictionary (Bates/Hess/Hilbert)
• Typed or scanned into Romanized form• No systematically balanced coverage
36
2) Parse the sentences
• Parse the words from each sentence using the morphology engine described earlier
• Parse the sentences using the link grammar parser described earlier
37
Preprocessing1) Romanize the sentence.
tuLildExW kWi ?aciLtalbixW ?E kWi lEpEskWi? .
2) Morphologically parse the sentence.tu+ Lil +d +ExW kWi ?aciLtal =bixW ?E kWi lEpEskWi? .
Lushootseed morphological parsePC-KIMMO>recognize LubElEskWaxWyildutExWCELLu+bE+lEs+^kWaxW+yi+il+d+ut+ExW+CELFut+ANEW+PrgSttv+help+YI+il+Trx+Rfx+Inc+our
Word |
NWord_____________________________|_____________________________
VWord DET2 | +CEL
VTnsAsp +our __________|__________FUT VWordLu+ |Fut+ VAsp0
_____________|______________ANEW VWordbE+ |ANEW+ VAsp2
__________________|___________________PROGRSTAT VWordlEs+ |ProgrStatv+ VFrame
_______|________VFrame NOW
_______|________ +ExWVFrame VSUFRFX +Incho
_______|_______ +utVFrame VSUFTRX +Rfx
_____|______ +d VFrame ACHV +Trx___|____ +il
VFrame VSUFYI +il| +yi
ROOT +yi^kWaxWhelp
38
39
Sample LG Lushootseed parse
linkparser> tu+ Lil +d +ExW kWi ?aciLtalbixW ?E kWi lEpEskWi?.
Found 15 linkages (15 had no P.P. violations)Linkage 1, cost vector = (UNUSED=0 DIS=4 AND=0 LEN=19)
+---------------------------Xp--------------------------+| +-------------EX-------------+ || +---------SOs--------+ | |+----Wd----+--ASP--+ | +----P----+ || +-PT+-TX+ | +---DT---+ | +--DT--+ || | | | | | | | | | |
LEFT-WALL tu+ Lil +d +ExW kWi ?aciLtalbixW ?E kWi lEpEskWi? .
Press RETURN for the next linkage.
Parsing Lushootseedlinkparser> bE+ Lil +t +Eb +ExW ?ElgWE? ?E ti?E? bE+ ?Es+ istE?.
++++Time 0.07 seconds (0.20 total)
Found 2 linkages (2 had no P.P. violations)Linkage 1, cost vector = (UNUSED=0 DIS=2 AND=0 LEN=28)
+----------------------------Xp---------------------------+| +-----------EM----------+ || +--------PA-------+ +---------P--------+ || +----ASP----+ | | +------DT------+ |+----Wd----+--MD--+ | | | | +----AD---+ || +-AD+-TX+ | | | | | | +-STV+ || | | | | | | | | | | | |
LEFT-WALL bE+ Lil +t +Eb +ExW ?ElgWE? ?E ti?E? bE+ ?Es+ istE? .
40
41
Another parse
linkparser> ?u+ da?a +d ?ElgWE? ?E kWi s+ gWistalb ti?E? SukWE?.
Found 1 linkage (1 had no P.P. violations)Unique linkage, cost vector = (UNUSED=0 DIS=4 AND=0 LEN=24)
+----------------------------Xp---------------------------+| +-------------------SOo-------------------+ || +------EX------+------P-----+ | |+-----Wd----+---SOs--+ | +----DT---+ | || +-PRF+-TX+ | | | +--NZ-+ +--DT--+ || | | | | | | | | | | |
LEFT-WALL ?u+ da?a +d ?ElgWE? ?E kWi s+ gWistalb ti?E? SukWE? .
42
Another parse
linkparser> bE+ Lil +t +Eb +ExW ?ElgWE? ?E ti?E? bE+ ?Es+ istE?.
Found 2 linkages (2 had no P.P. violations)Linkage 1, cost vector = (UNUSED=0 DIS=2 AND=0 LEN=28)
+----------------------------Xp---------------------------+| +-----------EM----------+ || +--------PA-------+ +---------P--------+ || +----ASP----+ | | +------DT------+ |+----Wd----+--MD--+ | | | | +----AD---+ || +-AD+-TX+ | | | | | | +-STV+ || | | | | | | | | | | | |
LEFT-WALL bE+ Lil +t +Eb +ExW ?ElgWE? ?E ti?E? bE+ ?Es+ istE? .
43
Another parse
linkparser> q'ili +t +Eb +ExW ?E ti?E? s+ ?il =aXad ?E ti?E? captain.
Found 11 linkages (11 had no P.P. violations)Linkage 3, cost vector = (UNUSED=0 DIS=6 AND=0 LEN=23)
+------------------------------Xp------------------------------+| +-------EM-------+ || +-----ASP----+ +-----P-----+ || +---MD--+ | | +---DT--+----MV---+-----P----+ |+---Wd--+-TX-+ | | | | +NZ+-LX-+ | +--DT--+ || | | | | | | | | | | | | |
LEFT-WALL q'ili +t +Eb +ExW ?E ti?E? s+ ?il =aXad ?E ti?E? captain .
44
3) Load parses into database
• The link structure for each word pair can be loaded into a database record
• This allows use of database manipulation techniques• Querying, or asking about, the contents is a commonly
performed task with databases
45
46
4) Query over the analyses
• Once in database format, the parses can be queried by a user using SQL:
• Which predicates have both a negative and an aspectual marker?• Which sentences have two oblique complements?• Find questions with past tense.• Which words are the most complex?
47
Sample structures
• Longest link
linkparser> huy lEk'W +t +Eb +ExW ?E ti?E? dxWsT'alb ti?E? lEpEskWi.
+-----------------------------------Xp----------------------------------+
| +-------------------------PA------------------------+ |
| +--------EM-------+ | |
| +-----ASP-----+ | | |
| +---MD---+ | +-------P------+ | |
+---Wg--+--Wd--+--TX-+ | | | +----DT---+ +---DT---+ |
| | | | | | | | | | | |
LEFT-WALL huy.a lEk'W.r +t +Eb +ExW ?E ti?E?.d dxWsT'alb.r ti?E?.d lEpEskWi .
• Most complex predicateT’u+ tu+ s+ takW +yi +Eb +s
huy ləkʷtəbəxʷ ʔə tiʔəʔ dxʷsƛ alb tiʔəʔ ləpəskʷi
ƛutustakʷyitəb
48
Lexical suffixes frequency?10=bixW3 =igWEd
2 =ELdat2 =a?kW2 =aCi?
2 =aXad2 =ali1 =abac
1 =al?txW1 =alikW1 =alus
1 =aq1 =gWas1 =gWiL
1 =gWil1 =i1 =iC
1 =qid
1 =ucid
49
Corpus queries• Which sentences have a negative and an aspectual marker?• Which sentences have 2 oblique complements?• Find questions with past tense.
• Bitext• Find all sentences where a given word is translated into English
some way.• Show English passive middle in Lushootseed
• Frequencies of links• Which is more common: past or present tense?• Find verbs with out-of-control suffixes• List all reduplicated forms by pattern
50
Sample SQL query
• List sentences with respect to the complexity of their predicates
SELECT SentNum,COUNT(SentNum) from gramruth2 WHERE (LHLex LIKE "*+" OR LHLex LIKE "*=")OR(RHLex LIKE "+*" OR RHLex LIKE "=*")GROUP BY SentNum;
51
Sample statistics (tokens)
• # sentences: 500• # morphemes: 2954
• suffixes: 623• prefixes: 607• lexical suffixes: 58
• # words: 1625• # S’s with only monomorphemic words: 43
52
Sample statistics (links, partial)
777 punctuation763 determiner
629 PP618 subject528 PP-object
380 aspectual323 transitive228 middle
205 stative193 nominalizer188 past
122 adverbial (sentential)99 achievement97 perfective
85 possessive82 future
73 lexical suffix
66 oblique
61 habitual
53 subordinating
51 adverbial (predicate)
45 passive
39 dubitative
32 benefactive
29 progressive
29 causative
15 object
10 adjective
8 determiner (feminine)
7 partitive
2 reflexive
Living the language