27
A Dependency Treebank of Classical Chinese Poems John Lee and Yin Hei Kong The Halliday Centre for Intelligent Applications of Language Studies Department of Chinese, Translation and Linguistics 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 191–199, Montre´al, Canada, June 3-8, 2012. c 2012 Association for Computational Linguistics

A Dependency Treebank of Classical Chinese Poems

  • Upload
    bertha

  • View
    65

  • Download
    0

Embed Size (px)

DESCRIPTION

A Dependency Treebank of Classical Chinese Poems. John Lee and Yin Hei Kong The Halliday Centre for Intelligent Applications of Language Studies Department of Chinese, Translation and Linguistics - PowerPoint PPT Presentation

Citation preview

Page 1: A Dependency Treebank of Classical Chinese Poems

A Dependency Treebank of Classical Chinese Poems

John Lee and Yin Hei KongThe Halliday Centre for Intelligent Applications of

Language StudiesDepartment of Chinese, Translation and Linguistics

2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human

Language Technologies, pages 191–199,Montre´al, Canada, June 3-8, 2012. c 2012 Association

for Computational Linguistics

Page 2: A Dependency Treebank of Classical Chinese Poems

Outline1. Abstract3. Treebank design4. Data5. Parallel Couplets

Page 3: A Dependency Treebank of Classical Chinese Poems

1. Abstract First large-scale dependency treebank for Classical

Chinese literature. Derived from the Stanford dependency type Over 32K characters 唐詩

Page 4: A Dependency Treebank of Classical Chinese Poems

3. Treebank designClassical Chinese and Modern

Chinese◦similarity

Vocabulary Grammar

POS tagset◦Based on Penn Chinese Treebank and

slight Revision of its 33 tags (Lee, 2012)

Page 5: A Dependency Treebank of Classical Chinese Poems
Page 6: A Dependency Treebank of Classical Chinese Poems

A dependency framework is chosen for two reasons.free word order.

◦Dependency grammars can handle this phenomenon well

helpful to students

Page 7: A Dependency Treebank of Classical Chinese Poems

dependency relations Our set of dependency relations is based on those

developed at Stanford University for Modern Chinese Our approach is to map their 44 dependency

relations, as much as possible, to Classical Chinese. Many of these function words do not exist in Classical

Chinese.◦ such as tense, voice, and case.

Page 8: A Dependency Treebank of Classical Chinese Poems

dependency relations

3.4 3.1

3.3

3.2

3.6

Page 9: A Dependency Treebank of Classical Chinese Poems

3.1 Locative modifiers preposition is frequently omitted bare locative noun phrase modifying the verb directly “hill” occupies the position normally reserved for the

subject , it actually indicates a location the locative noun ‘alley’ is placed after the verb.

Page 10: A Dependency Treebank of Classical Chinese Poems

3.2 Oblique objects mark nouns that directly modify a verb They typically come after the verb. the noun ‘cup’ is used in an instrumental sense to

modify ‘drunk’ in an obl relation.

Page 11: A Dependency Treebank of Classical Chinese Poems

3.3 Noun phrase as adverbial modifier floating reflexives

◦ (e.g., it is itself adequate) other PP-like NPs

◦ (e.g., two times a day) the noun ‘self’ as a reflexive the noun ‘year’ indicating repetition.

Page 12: A Dependency Treebank of Classical Chinese Poems

3.4 Indirect objects The double object construction contains two objects

in a verb phrase. direct object

◦ (e.g., “he gave me a book”); indirect object

◦ (“he gave me a book”) Classical Chinese does not have this linguistic device

◦ indirect object is unmarked;◦ we distinguish it with the “indirect object” label

(iobj). ‘word’ as the direct object ‘person’ as the indirect.

Page 13: A Dependency Treebank of Classical Chinese Poems

3.4 Indirect objects

Page 14: A Dependency Treebank of Classical Chinese Poems

3.5 Absence of copular verbs “A is B”, A is considered the “topic” (top) of the copular

verb “is” (Chang et al., 2009). The copular, however, is rarely used in Classical

Chinese (Pulleyblank, 1995) In some cases

◦ it is replaced by an adverb that functions as a copular verb

◦ If so, that adverb is POS-tagged as such (VC) in our treebank

In other cases,◦ the copular is absent altogether.◦ we expand the usage of the top relation.◦ the relation top(‘capable’, ‘general’) would be

assigned.

Page 15: A Dependency Treebank of Classical Chinese Poems

3.6 Discourse relations Even in the absence of these connectives, however,

two adjacent clauses can still hold an implicit discourse relation.

Page 16: A Dependency Treebank of Classical Chinese Poems

3.6 Discourse relations

Page 17: A Dependency Treebank of Classical Chinese Poems

4 Data The Complete Shi Poetry of the Tang (Peng, 1960) nearly 50,000 poems more than two thousand poets

Page 18: A Dependency Treebank of Classical Chinese Poems

4.1 Material over 32,000 characters in 521 poems

◦ Wang Wei ( 王維 ) and Meng Haoran ( 孟浩然 )dependency relations

◦ Word boundaries and POS tags metadata

◦ Level ( 平 ) or oblique (ze 仄 ).◦ title, author, and genre

‘recent-style’ ( 近體詩 ) or ‘ancient-style’ ( 古體詩 ).

Page 19: A Dependency Treebank of Classical Chinese Poems

4.2 Inter-annotator agreement Two annotators, both university graduates with a

degree in Chinese, created this treebank. To measure inter-annotator agreement, we set

apart a subset of about 1050 characters◦ three tasks: agreement rate

POS tagging 95.1% head selection 92.3% dependency labeling 91.2%

Page 20: A Dependency Treebank of Classical Chinese Poems

For POS tagging the three main error categories are the confusion

◦ between adverbs (AD) and verbs with an adverbial force,

◦ between measure words (M) and nouns (NN) ◦ between adjectives (JJ) and nouns.

These differences in POS tags trickle down to head selection and dependency labeling.

Page 21: A Dependency Treebank of Classical Chinese Poems

Polysemy◦ 簞食伊何◦ ‘bowl / blanket’

‘What food is contained in that bowl?’ the relation clf is required for 簞 dan, and 伊 yi is the root

word. ‘food’,

‘What food is placed on the blanket?’ Here, dan takes on the relation nn, and the root

word would be 何 he instead.

Page 22: A Dependency Treebank of Classical Chinese Poems

5. Parallel Couplets

Character-level parallelism.Phrase-level parallelism.

Page 23: A Dependency Treebank of Classical Chinese Poems

Character-level parallelism.

exactly matched POS tags yields a parallel rate of only 74% in the corpus as a whole.

‘equivalence sets’ of POS ◦ Two tags in the same set are considered parallel,

even though they do not match.◦ the parallel rate increases to 87%.

‘equivalence sets’ of POS is Not perfect polysemous character with a ‘out-of-context’ meaning

(jieyi 借義 ).◦ Instance : “ 欲就終焉志,恭聞智者名 ,”◦ Since 焉 is a sentence particle and 者 is a noun.◦ However, the poet apparently viewed them as

parallel, because zhe can also function as a sentence particle in other contexts.

Page 24: A Dependency Treebank of Classical Chinese Poems

Character-level parallelism.

Page 25: A Dependency Treebank of Classical Chinese Poems

Phrase-level parallelism.

The character-level metric, however, still rejects some couplets that would be deemed parallel by scholars.

Most of these couplets are parallel not at the character level, but at the phrase level.

pentasyllabic (5-character) line◦ = disyllabic unit (the first two characters) ◦ + trisyllabic unit (the last three characters)

Ex : Consider two corresponding disyllabic units◦ 抱琴 垂釣◦ 抱 /VV 琴 /NN 垂 /AD 釣 /VV

both units are verb phrases describing an activity (‘to hold a violin’ and ‘to fish while looking down’)

Page 26: A Dependency Treebank of Classical Chinese Poems

5.3 Results

Page 27: A Dependency Treebank of Classical Chinese Poems

Conclusion We have presented the first large-scale dependency

treebank of Classical Chinese literature, which encodes works by two poets in the Tang Dynasty.

We have described how the dependency grammar framework has been derived from existing treebanks for Modern Chinese, and shown a high level of inter-annotator agreement. Finally, we have illustrated the utility of the treebank with a study on parallelism in Classical Chinese poetry.

Future work will focus on parsing Classical Chinese poems of other poets, and on enriching the corpus with semantic information, which would facilitate not only deeper study of parallelism but also other topics such as imagery and metaphorical coherence (Zhu and Cui, 2010).