1
RESOURCE-LIGHT BANTU PART-OF-SPEECH TAGGING Guy De Pauw (UA) Gilles-Maurice de Schryver (UGent) Janneke van de Loo (UA) Motivation There are many data-driven taggers available, but they need extensive annotated corpora. Unsupervised part-of-speech tagging techniques for resource-scarce languages exhibit limited results on Sub-Saharan languages Becoming increasingly available: digitally available dictionaries, lexicons, word lists, ... Research questions What information can we use for part-of-speech tagging? Can we use this information to bootstrap accurate part-of-speech taggers for the languages under investigation? How does this technique compare to the state-of- the-art in data-driven part-of-speech tagging? Bag-of-Substrings AdamPROPNAME alionekanaV chumbaniN kwakePRON hanaNEG fahamuN .FULL_STOP Train maximum entropy classifier and compare it to memory- based tagger Experimental Results Conclusion In the absence of large, annotated corpora, the bag-of-substrings approach established a low-resource, high accuracy bootstrapping method for part-of-speech tagging of conjunctively written Bantu languages. Demos

Resource-Light Bantu Part-of-Speech Tagging

Embed Size (px)

Citation preview

Page 1: Resource-Light Bantu Part-of-Speech Tagging

RESOURCE-LIGHT BANTU PART-OF-SPEECH TAGGING

Guy De Pauw (UA) Gilles-Maurice de Schryver (UGent) Janneke van de Loo (UA)

Motivation

There are many data-driven taggers available, but

they need extensive annotated corpora.

Unsupervised part-of-speech tagging techniques

for resource-scarce languages exhibit limited

results on Sub-Saharan languages

Becoming increasingly available: digitally

available dictionaries, lexicons, word lists, ...

Research questions

• What information can we use for part-of-

speech tagging?

• Can we use this information to bootstrap

accurate part-of-speech taggers for the

languages under investigation?

• How does this technique compare to the

state-of-the-art in data-driven part-of-

speech tagging?

Bag-of-SubstringsAdamPROPNAME alionekanaV chumbaniN kwakePRON hanaNEG fahamuN .FULL_STOP

Train maximum entropy classifier and compare it to memory-based

tagger

Experimental ResultsConclusion

In the absence of large, annotated corpora, the bag-

of-substrings approach established a low-resource,

high accuracy bootstrapping method for part-of-

speech tagging of conjunctively written Bantu

languages.

Demos