53
Low-resource NLP

Low-resource NLP

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Low-resource NLP

Low-resource NLP

Page 2: Low-resource NLP

2

Language endangerment

Page 3: Low-resource NLP

3

How (many) endangered?

Page 4: Low-resource NLP

Language death

• Every 10 days or so a language dies• Most disappear unnoticed

• Killer languages• Worldwide phenomenon• What’s lost when a language dies?

Page 5: Low-resource NLP

Low-resource languages and the military

Page 6: Low-resource NLP

Sample initiatives

• DARPA LORELEI (Low Resource Languages for Emergent Incidents)• situational awareness by identifying elements of information in foreign language and

English sources, such as topics, names, events, sentiment and relationships.• IARPA Babel

• agile and robust speech recognition technology that can be rapidly applied to any human language

• DARPA BOLT1) allowing English-speakers to understand foreign-language sources of all genres, including chat, messaging and informal conversation; 2) providing English-speakers the ability to quickly identify targeted information in foreign-language sources using natural-language queries; and 3) enabling multi-turn communication in text and speech with non-English speakers. If successful, BOLT will deliver all capabilities free from domain or genre limitations.

Page 7: Low-resource NLP

DoD MURI (one effort)

Page 8: Low-resource NLP
Page 9: Low-resource NLP
Page 10: Low-resource NLP

NIST LoReHLT

New evaluation series aimed to advance HLT that provide rapid and effective response to emerging incidents where the language resources are very limited. LoReHLT16 plans to offer three evaluation tasks: machine translation (MT), named entity recognition (NER), and situation frames (SF).

Highlights:

- Three evaluation tasks: MT, NER, and SF- Two training conditions: constrained (required) and unconstrained- Surprise language evaluation- Three evaluation checkpoints to gauge performance based on training resources given

2016 Schedule:

Feb 19 - May 16: Registration period (deadline extended)Jun 2 - 8: Dry run periodJun 29: Encrypted evaluation data releasedJul 6: Surprise language announced and evaluation beginsJul 13: Checkpoint 1 dueJul 20: Checkpoint 2 dueAug 3: Checkpoint 3 dueAug 28 – 29: Post evaluation workshop (Nashua, New Hampshire, US)

For more information about the LoReHLT16 evaluation, please see the evaluation plan at http://www.nist.gov/itl/iad/mig/lorehlt16.cfm. If you have any questions, contact us at [email protected].

Feel free to forward this message to your colleagues who may find the evaluation of interest.

Best Regards,

NIST LoReHLT team

Page 11: Low-resource NLP

DARPA BOLT (2011-2016)

• Operationalize language processing of all kinds (mostly for DoD)• Machine translation, sentiment analysis, dialect recognition, prevarication

detection, etc.• Beyond the current paradigms, language resources (cf. trained on newswire)• MT and CLIR (A), HCI English+Arabic (B), ST English+Arabic (C), Arabic dialects

(D)

• Activity E: language, agents, and robotics

Page 12: Low-resource NLP

BOLT Activity E

• Grounded language acquisition by robots• Deep semantics, visual+tactile input, experiential learning of objects,

actions, and consequences• Acquires language via grounding, hypothesizing, automated reasoning• Human guides acquisition via situated, inter-active instruction• Robot demonstrates understanding via performance

Page 14: Low-resource NLP

Syriac corpus annotation

Page 15: Low-resource NLP

• 100,000 words• 15 years

• 10,000,000 words• 1,500 years??

• Community of scholar-annotators• Issues: who? which annotations? user interface?

how trustworthy? annotation cost? return on investment?

Page 16: Low-resource NLP

Attribute Value English gloss

Word Token LMLCCON to your king

Prefix L to

Suffix CON your (masc. plural)

Stem MLC inflected “king”

Baseform MLCA noun form of “king”

Root MLC realm, kingdom, queen, promise counsel, deliberate, reign

Page 17: Low-resource NLP

Corpus annotation

Page 18: Low-resource NLP

UnannotatedData

Annotate

Annotated Data

Train Model

Model

SupervisedLearning

Page 19: Low-resource NLP

UnannotatedData

Score & Rank Instances

Best Instance

AnnotateAnnotated

Data

Update Model

Model

ActiveLearning

Page 20: Low-resource NLP

Training Cost3

Selection Cost1Unannotated

Data

Score & Rank Instances

Best Instance

Annotated Data

Update ModelAnnotation Cost2

Annotate

Model

Cost

Page 21: Low-resource NLP

Annotation time vs. pre-labeling accuracy

Page 22: Low-resource NLP

Language revitalization

Page 23: Low-resource NLP

23

tiʔəʔ dbad

Page 24: Low-resource NLP

24

kʷi dscapaʔ (Pete Lomsdalen)

Page 25: Low-resource NLP

25

kʷsi sc'abiqʷ Kate Brown

Page 26: Low-resource NLP

26

kʷsi ʔəkʷyíqʷ Jennie Patense

Page 27: Low-resource NLP

27

kʷsi dscəpyíqʷMary Patense (Friday Consauk)

Page 28: Low-resource NLP

Father Eugène Casimir Chirouse, OMI

• Born May 8, 1821 in France (near Lyon)• 1847: to WA Territory via Oregon Trail• Only white clergy for entire Puget

Sound area• Lived among Lushootseed tribes 1857-

1878• Documented lifestyle, work, languages

(especially)• translated religious materials• compiled dictionary, lexicons,

grammar fragments• “The Apostle of the Puget Sound

Indians”• Fondly remembered to this day

Page 29: Low-resource NLP

29

A Chirouse manuscript

Page 30: Low-resource NLP

Visualizing textual versions

30

Page 31: Low-resource NLP

Sample prayer (portion)• Misereatur

2 Debalh kwe tlas oshabētem ato de Sherhk-Siam! 2 dibəɬ kʷi ɬasʔusabitəb ʔə tudiʔ səq siʔab

3 lilzirh kwēhk twal mok stam, 3 liɬʤixʷ qʷiqʼʷ dxʷʔal bəkʼʷ stab.

4 klob orhtsedelh ku …lētsǐētobolh ateto skwādăchĕlh! 4 ƛ'ubəxʷ cədiɬ kʷi ʔubalicyitubuɬ ʔə tə tudᶻəkʼʷadadcəɬ.

5 alh kwe tlas orhtobolh twal kwi tlo Darhulēchelh skwākhēd! 5 haʔɬ kʷi ɬusʔuxʷtubuɬ dxʷʔal kʷi ɬudxʷhəliʔcəɬ ckʼʷaqid.

6 Klob assista! 6 ƛ'ub ʔəsʔistəʔ.

31

Page 32: Low-resource NLP

Sample catechism Q&A• 3. Abil gwat asra'tleldhu kwi gwas stsakus, gula asared kwi gwas krwebitsuts dhual stsaku?

3. ʔəbilʼ gʷat ʔəsxaƛ'ildxʷ kʷi gʷəscʼaʔkʷs, gʷəl ʔəsʔəxid kʷi gwəsqʷibicuc dxʷʔal cʼaʔkʷ.

Abil gwat asra'tleldhu kwi gwas stsakus, gula tlo haidhu, ʔəbilʼ gʷat ʔəsxaƛ'ildxʷ kʷi gʷəscʼaʔkʷs, gʷəl ɬuhaydxʷ,

gula tlo tleldhu gwalh Sherk-Siam sgwadgwad, gʷəl ɬutɬildxʷ gʷəɬ səq siʔab sgʷadgʷad,

gula tlo rahab dhual ku sgwas tskwadads, gʷəl ɬuxahəb dxʷʔal kʷi sgʷaʔs ʤəkʼʷadads,

gula tlo whobed aku tsoku ku boku sa; gʷəl ɬuxʷəbəd ʔə kʷi cukʷ kʷi bəkʷ saʔ;

gula tlo tibitsut dhual kwi tlo tskwakred tlhdahu halhs.gʷəl ɬutibicut dxʷʔal kʷi ɬuckʼʷaqid tɬdəxʷ haʔɬs.

32

Page 33: Low-resource NLP

33

Chirouse terms re-introduced

• dxʷsyayus (God, lit. “hard worker”)• tul šəq skayuʔ (saints, lit. “dead people from the sky”)• liǰub (devil, cf. French “le diable”)• sxʷətagʷəb (Milky Way, lit. “falls in the middle”)• ʔabsčups sčusəd (comet, lit. “has+fire+star”)• qʷəcxaʔ (lark, cf. Upper Chehalis)• paləkʷ (pig, lit. “dig up, root up”)

Page 34: Low-resource NLP

Chirouse re verbs, culture• “The following verbs express some of the principal wild

Indian habits (as they are to be well remembered)”• Sample verbs

• To run a quill through the nostril, for ornament• To flatten the head of a baby• To blow on a sick to frighten the evil spirit• To louse a friend and eat the lice, to share one's love• To take a steam bath underground

Page 35: Low-resource NLP

35

1) The source sentences

• Randomly chosen from several sources• Conversational turns from published narratives (Hilbert)• Sample sentences from pedagogical grammars (Hess/Hilbert)• Example usage sentences from dictionary (Bates/Hess/Hilbert)

• Typed or scanned into Romanized form• No systematically balanced coverage

Page 36: Low-resource NLP

36

2) Parse the sentences

• Parse the words from each sentence using the morphology engine described earlier

• Parse the sentences using the link grammar parser described earlier

Page 37: Low-resource NLP

37

Preprocessing1) Romanize the sentence.

tuLildExW kWi ?aciLtalbixW ?E kWi lEpEskWi? .

2) Morphologically parse the sentence.tu+ Lil +d +ExW kWi ?aciLtal =bixW ?E kWi lEpEskWi? .

Page 38: Low-resource NLP

Lushootseed morphological parsePC-KIMMO>recognize LubElEskWaxWyildutExWCELLu+bE+lEs+^kWaxW+yi+il+d+ut+ExW+CELFut+ANEW+PrgSttv+help+YI+il+Trx+Rfx+Inc+our

Word |

NWord_____________________________|_____________________________

VWord DET2 | +CEL

VTnsAsp +our __________|__________FUT VWordLu+ |Fut+ VAsp0

_____________|______________ANEW VWordbE+ |ANEW+ VAsp2

__________________|___________________PROGRSTAT VWordlEs+ |ProgrStatv+ VFrame

_______|________VFrame NOW

_______|________ +ExWVFrame VSUFRFX +Incho

_______|_______ +utVFrame VSUFTRX +Rfx

_____|______ +d VFrame ACHV +Trx___|____ +il

VFrame VSUFYI +il| +yi

ROOT +yi^kWaxWhelp

38

Page 39: Low-resource NLP

39

Sample LG Lushootseed parse

linkparser> tu+ Lil +d +ExW kWi ?aciLtalbixW ?E kWi lEpEskWi?.

Found 15 linkages (15 had no P.P. violations)Linkage 1, cost vector = (UNUSED=0 DIS=4 AND=0 LEN=19)

+---------------------------Xp--------------------------+| +-------------EX-------------+ || +---------SOs--------+ | |+----Wd----+--ASP--+ | +----P----+ || +-PT+-TX+ | +---DT---+ | +--DT--+ || | | | | | | | | | |

LEFT-WALL tu+ Lil +d +ExW kWi ?aciLtalbixW ?E kWi lEpEskWi? .

Press RETURN for the next linkage.

Page 40: Low-resource NLP

Parsing Lushootseedlinkparser> bE+ Lil +t +Eb +ExW ?ElgWE? ?E ti?E? bE+ ?Es+ istE?.

++++Time 0.07 seconds (0.20 total)

Found 2 linkages (2 had no P.P. violations)Linkage 1, cost vector = (UNUSED=0 DIS=2 AND=0 LEN=28)

+----------------------------Xp---------------------------+| +-----------EM----------+ || +--------PA-------+ +---------P--------+ || +----ASP----+ | | +------DT------+ |+----Wd----+--MD--+ | | | | +----AD---+ || +-AD+-TX+ | | | | | | +-STV+ || | | | | | | | | | | | |

LEFT-WALL bE+ Lil +t +Eb +ExW ?ElgWE? ?E ti?E? bE+ ?Es+ istE? .

40

Page 41: Low-resource NLP

41

Another parse

linkparser> ?u+ da?a +d ?ElgWE? ?E kWi s+ gWistalb ti?E? SukWE?.

Found 1 linkage (1 had no P.P. violations)Unique linkage, cost vector = (UNUSED=0 DIS=4 AND=0 LEN=24)

+----------------------------Xp---------------------------+| +-------------------SOo-------------------+ || +------EX------+------P-----+ | |+-----Wd----+---SOs--+ | +----DT---+ | || +-PRF+-TX+ | | | +--NZ-+ +--DT--+ || | | | | | | | | | | |

LEFT-WALL ?u+ da?a +d ?ElgWE? ?E kWi s+ gWistalb ti?E? SukWE? .

Page 42: Low-resource NLP

42

Another parse

linkparser> bE+ Lil +t +Eb +ExW ?ElgWE? ?E ti?E? bE+ ?Es+ istE?.

Found 2 linkages (2 had no P.P. violations)Linkage 1, cost vector = (UNUSED=0 DIS=2 AND=0 LEN=28)

+----------------------------Xp---------------------------+| +-----------EM----------+ || +--------PA-------+ +---------P--------+ || +----ASP----+ | | +------DT------+ |+----Wd----+--MD--+ | | | | +----AD---+ || +-AD+-TX+ | | | | | | +-STV+ || | | | | | | | | | | | |

LEFT-WALL bE+ Lil +t +Eb +ExW ?ElgWE? ?E ti?E? bE+ ?Es+ istE? .

Page 43: Low-resource NLP

43

Another parse

linkparser> q'ili +t +Eb +ExW ?E ti?E? s+ ?il =aXad ?E ti?E? captain.

Found 11 linkages (11 had no P.P. violations)Linkage 3, cost vector = (UNUSED=0 DIS=6 AND=0 LEN=23)

+------------------------------Xp------------------------------+| +-------EM-------+ || +-----ASP----+ +-----P-----+ || +---MD--+ | | +---DT--+----MV---+-----P----+ |+---Wd--+-TX-+ | | | | +NZ+-LX-+ | +--DT--+ || | | | | | | | | | | | | |

LEFT-WALL q'ili +t +Eb +ExW ?E ti?E? s+ ?il =aXad ?E ti?E? captain .

Page 44: Low-resource NLP

44

3) Load parses into database

• The link structure for each word pair can be loaded into a database record

• This allows use of database manipulation techniques• Querying, or asking about, the contents is a commonly

performed task with databases

Page 45: Low-resource NLP

45

Page 46: Low-resource NLP

46

4) Query over the analyses

• Once in database format, the parses can be queried by a user using SQL:

• Which predicates have both a negative and an aspectual marker?• Which sentences have two oblique complements?• Find questions with past tense.• Which words are the most complex?

Page 47: Low-resource NLP

47

Sample structures

• Longest link

linkparser> huy lEk'W +t +Eb +ExW ?E ti?E? dxWsT'alb ti?E? lEpEskWi.

+-----------------------------------Xp----------------------------------+

| +-------------------------PA------------------------+ |

| +--------EM-------+ | |

| +-----ASP-----+ | | |

| +---MD---+ | +-------P------+ | |

+---Wg--+--Wd--+--TX-+ | | | +----DT---+ +---DT---+ |

| | | | | | | | | | | |

LEFT-WALL huy.a lEk'W.r +t +Eb +ExW ?E ti?E?.d dxWsT'alb.r ti?E?.d lEpEskWi .

• Most complex predicateT’u+ tu+ s+ takW +yi +Eb +s

huy ləkʷtəbəxʷ ʔə tiʔəʔ dxʷsƛ alb tiʔəʔ ləpəskʷi

ƛutustakʷyitəb

Page 48: Low-resource NLP

48

Lexical suffixes frequency?10=bixW3 =igWEd

2 =ELdat2 =a?kW2 =aCi?

2 =aXad2 =ali1 =abac

1 =al?txW1 =alikW1 =alus

1 =aq1 =gWas1 =gWiL

1 =gWil1 =i1 =iC

1 =qid

1 =ucid

Page 49: Low-resource NLP

49

Corpus queries• Which sentences have a negative and an aspectual marker?• Which sentences have 2 oblique complements?• Find questions with past tense.

• Bitext• Find all sentences where a given word is translated into English

some way.• Show English passive middle in Lushootseed

• Frequencies of links• Which is more common: past or present tense?• Find verbs with out-of-control suffixes• List all reduplicated forms by pattern

Page 50: Low-resource NLP

50

Sample SQL query

• List sentences with respect to the complexity of their predicates

SELECT SentNum,COUNT(SentNum) from gramruth2 WHERE (LHLex LIKE "*+" OR LHLex LIKE "*=")OR(RHLex LIKE "+*" OR RHLex LIKE "=*")GROUP BY SentNum;

Page 51: Low-resource NLP

51

Sample statistics (tokens)

• # sentences: 500• # morphemes: 2954

• suffixes: 623• prefixes: 607• lexical suffixes: 58

• # words: 1625• # S’s with only monomorphemic words: 43

Page 52: Low-resource NLP

52

Sample statistics (links, partial)

777 punctuation763 determiner

629 PP618 subject528 PP-object

380 aspectual323 transitive228 middle

205 stative193 nominalizer188 past

122 adverbial (sentential)99 achievement97 perfective

85 possessive82 future

73 lexical suffix

66 oblique

61 habitual

53 subordinating

51 adverbial (predicate)

45 passive

39 dubitative

32 benefactive

29 progressive

29 causative

15 object

10 adjective

8 determiner (feminine)

7 partitive

2 reflexive

Page 53: Low-resource NLP

Living the language