30
Asialex 2011 Kyoto, Japan 1 Development of the Thesaurus of Classical Japanese Poetic Vocabulary Hilofumi Yamamoto Tokyo Institute of Technology Makiro Tanaka National Institute of Japanese Language and Linguistics 22nd Aug. 2011

Asialex201103slide02

Embed Size (px)

Citation preview

Page 1: Asialex201103slide02

Asialex 2011 Kyoto, Japan 1

Development of the Thesaurus of ClassicalJapanese Poetic Vocabulary

Hilofumi Yamamoto

Tokyo Institute of Technology

Makiro Tanaka

National Institute of Japanese Language and Linguistics

22nd Aug. 2011

Page 2: Asialex201103slide02

Asialex 2011 Kyoto, Japan 2

Outline

1. Purpose of Study

• Connotation of classical poetic vocabulary

• Longitudinal study of transition of vocabulary

2. Development of Thesaurus

3. Applications

Page 3: Asialex201103slide02

Asialex 2011 Kyoto, Japan 3

Waka: Japanese Poetry

Tatsuta-Hime..

tamukuru KAMI no / arebakosoaki no konoha no / nusa to chirurame

because Princess Tatsutahas a god to whom she offers brocades,the leaves of treesin autumn will scatteras an offering.

Prince KanemiNo. 298 in the Kokinshu

Page 4: Asialex201103slide02

Asialex 2011 Kyoto, Japan 4

Problem: Orthography

in hiragana

たつた

in Chinese characters

立田竜田龍田

→ All Tatsuta (place name)

Page 5: Asialex201103slide02

Asialex 2011 Kyoto, Japan 5

Problem: Unit size / attribution

The unit size and meaning of a word depends on a context.

• unit → 卯の花 or 卯/の/花 (Nakano, 1998)

• orthography → さびしい/さみしい/寂しい/淋しい(sad)

• attributions → 卯の花 ∈ plant or 卯の花 ∈ food

(unohana = a deutzia or bean curd refuse)

Page 6: Asialex201103slide02

Asialex 2011 Kyoto, Japan 6

An Item of Thesaurus: God

BG-01-2030-01-030-A-かみ-神↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑

(1) (2) (3) (4) (5) (6) (7) (8)

Figure 1: Structure of an item of BG database in the case of kami (god):

(1) database ID (BG = short-unit general vocabulary);

(2) part of speech ID (01 = noun);

(3) group ID (2030 = Shinto deities and Buddhas);

(4) field ID;

(5) exact ID (030 = god);

(6) era-flag (A = contemporary, C = classic);

(7) Chinese character reading;

(8) Chinese character

Page 7: Asialex201103slide02

Asialex 2011 Kyoto, Japan 7

Development: Thesaurus, KH, and t2c

• Thesaurus for classical poetic vocabulary

• KH (tokenizer)

• t2c (token to code converter)

Page 8: Asialex201103slide02

Asialex 2011 Kyoto, Japan 8

Materials: the Hachidaishu

• The Hachidaishu: eight anthologies compiled by

imperial orders during ca. 905–2105.

• The database: compiled by the National Institute of

Japanese Literature, Japan.

• Old texts taken based on Shohobonban version of the

Hachidaishu

900

Kok

inshu

(•90

5)

46

950

Gosen

shu

(•95

1)

56

1000

Juishu

(•10

07)

79

1050

Goshu

ishu

(108

6)

38

1100⊲

Kinyo

shu

(•11

24)

20

Shikashu

(•11

44)

44

1150

Senz

aishu

(118

8)

17

1200

Shinko

kins

hu(1

205)

1250

Page 9: Asialex201103slide02

Asialex 2011 Kyoto, Japan 9

Methods: Flowchart of data processing

ACorp

us development

BToke

nisation

CMeta-

code con

version

DMath

ematical modellin

g

ESubtrac

tion: CT −

OP

FVisualis

ation

Page 10: Asialex201103slide02

Asialex 2011 Kyoto, Japan 10

Development: Thesaurus, KH, and t2c

• Thesaurus for classical poetic vocabulary

• KH (tokenizer)

• t2c (token to code converter)

Page 11: Asialex201103slide02

Asialex 2011 Kyoto, Japan 11

Table 1: An example of input for KH / Gosenshu No. 664

input: 000664 わすられて思ふなげきのしげるをや身をはづかしのもりといふらんoutput:000664

わすら (ラ四-未:忘る:わする:忘ら:わすら)れ (自可受-用:る:る:れ:れ)て (接助:て:て)思ふ (ハ四-終体:思ふ:おもふ:思ふ:おもふ)なげき (カ四-用:嘆く:なげく:嘆き:なげき)の (格助:の:の)しげる (ラ四-終体:茂る:しげる:茂る:しげる)を (*助:を:を)や (係助:や:や)身 (名:身:み)を (*助:を:を)---はづかし (名-地名:羽束師:はづかし)の (格助:の:の)---はづかし (形シク-終:恥づかし:はづかし:恥づかし:はづかし)の (格助:の:の)---もり (名:森:もり)と (格助-引用:と:と)いふ (ハ四-終体:言ふ:いふ:言ふ:いふ)らん (推-終体:らむ:らむ:らむ:らむ)

Page 12: Asialex201103slide02

Asialex 2011 Kyoto, Japan 12

Development: Thesaurus

Poem Texts kh t2c

Thesauruscode taggerTokeniser

HachidaishuThesaurus

(A) (B)

add new thesaurus codes

Dictionary General, Place NamePersonal Name, etc

add unknown entries

Page 13: Asialex201103slide02

Asialex 2011 Kyoto, Japan 13

(A) Corpus: Poems (OP)

KW00029800|A|KANEMI NO O=kanemi no o

KW00029800|B|Tatsutahime[NOUN-PLNAME:TATSUTAHIME]/→tamukuru[KASHIMO2-ATTR:TAMUkuru],kami[NOUN:KAMI]→no[SUB]are[RAHEN-REAL]ba[CAUS]koso[KP]/→aki[NOUN:AKI]no[CON],konoha[NOUN:KOnoHA]no[SUB]/→nusa[NOUN:NUSA]to[P-CRD],chiru[RA4DAN-FF:CHIru]→rame[CJR-REAL]/

Figure 2: Format of the database of a poem: → indicates continuing to the

next line without breaks; the first line, which includes |A|, indicates

the name of the poet; the second line which includes |B|, indicates

the contents of the poem and added information.

Page 14: Asialex201103slide02

Asialex 2011 Kyoto, Japan 14

(A) Corpus: Translations (CT)

$A|000298

$B|秋の末近くなって帰り道についた龍田姫が、道中の無事を願って手向け→をする神があるからこそ、秋の木の葉が幣となって散っているのだろう。

$C|秋の歌$D|秋の末近くなって帰り道についた龍田姫が、道中の無事を願って手向け→をする神があるからこそ、秋の木の葉が幣となって散っているのだろう。

$I|あきのすえちかくなってかえりみちについたたつたひめが、どうちゅう→のぶじをねがってたむけをするかみがあるからこそ、あきのこのはがぬさ→となってちっているのだろう。

Figure 3: Format of the database of a CT

Page 15: Asialex201103slide02

Asialex 2011 Kyoto, Japan 15

(B) Tokenisation:

original text

立田姫手向ける神の有ればこそ秋の木の葉の幣と散るらめ

↓tokenising

立田姫/手向ける/神/の/[有れ]/ば/こそ/秋/の/木の葉/の/幣/と/散る/[らめ]

↓converting into predicative form

立田姫/手向ける/神/の/[有り]/ば/こそ/秋/の/木の葉/の/幣/と/散る/[らむ]

Figure 4: Tokenisation of poem texts

Page 16: Asialex201103slide02

Asialex 2011 Kyoto, Japan 16

(C) meta-code conversion

CH-29-2130-01-010-A たつたひめ 立田姫 Tatsutahime Princess-Tatsuta

CH-29-0000-14-010-A -- 立田 -- Tatsuta Tatsuta

BG-01-2030-01-101-A -- 姫 -- hime princess

BG-02-3770-04-080-C たむくる 手向く tamukuru present(verb)

BG-01-5730-02-010-A -- 手 -- te hand

BG-02-1700-01-040-A -- 向ける -- mukeru for

BG-01-2030-01-030-A かみ 神 kami god

BG-08-0061-07-010-A の の no SUB (particle)

BG-02-1200-01-010-C あれ 有り are be

BG-08-0064-26-010-A ば ば ba because (particle)

BG-04-1120-05-150-A -- ば -- ba because (reason)

BG-08-0065-01-010-A こそ こそ koso KP (emphasis)

Figure 5: Meta-code conversion in case of OP

Page 17: Asialex201103slide02

Asialex 2011 Kyoto, Japan 17

(C) Structure of meta-code-1

BG-01-2030-01-030-A-かみ-神↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑

(1) (2) (3) (4) (5) (6) (7) (8)

Figure 6: Structure of an item of BG database in the case of kami (god):

(1) database ID (BG = short-unit general vocabulary);

(2) part of speech ID (01 = noun);

(3) group ID (2030 = Shinto deities and Buddhas);

(4) field ID;

(5) exact ID (030 = god);

(6) era-flag (A = contemporary, C = classic);

(7) Chinese character reading;

(8) Chinese character

Page 18: Asialex201103slide02

Asialex 2011 Kyoto, Japan 18

(C) Structure of the meta-code-2

BG-01-2600-01-020-Ayononaka (world)

(1) = BG-01-2610-01-040-Ayo (world)

(2)

+ BG-08-0010-01-021-Ano (of)

(3)

+ BG-01-1770-01-080-Anaka (inside)

(4)

Figure 7: Structure of an item of the semantic table in the caseof a compound word, yononaka (world)

Page 19: Asialex201103slide02

Asialex 2011 Kyoto, Japan 19

(C) meta-code conversion-3

CH-29-2130-01-010-A たつたひめ 立田姫 Tatsutahime Princess-Tatsuta

CH-29-0000-14-010-A -- 立田 -- Tatsuta Tatsuta

BG-01-2030-01-101-A -- 姫 -- hime princess

BG-02-3770-04-080-C たむくる 手向く tamukuru present(verb)

BG-01-5730-02-010-A -- 手 -- te hand

BG-02-1700-01-040-A -- 向ける -- mukeru for

BG-01-2030-01-030-A かみ 神 kami god

BG-08-0061-07-010-A の の no SUB (particle)

BG-02-1200-01-010-C あれ 有り are be

BG-08-0064-26-010-A ば ば ba because (particle)

BG-04-1120-05-150-A -- ば -- ba because (reason)

BG-08-0065-01-010-A こそ こそ koso KP (emphasis)

Figure 8: Meta-code conversion in case of OP

Page 20: Asialex201103slide02

Asialex 2011 Kyoto, Japan 20

poet write OP read expert reader

write

CT

read

novice reader

compare

10th century

Field of experience

20th century

Field of experience (expert)

20th centuryField of experience

(novice)

Figure 9: Schema of relationship between OP and CT

Page 21: Asialex201103slide02

Asialex 2011 Kyoto, Japan 21

+-------- # of pair| +----- value of matching level, exact=17, field=13, group=10| | +-- # of POS| | || | | # of element of OP ----+ +- # of element of CT| | | element of OP -+ | | +--- element of CT| | | | | | |1 17 11 立田姫 00 <-> 12 龍田姫 (Tatsutahime)2 17 47 手 04 <-> 25 手 (hand)3 17 47 向ける 05 <-> 26 向ける (toward)4 17 2 神 06 <-> 32 神 (god)5 10 61 の 07 <-> 33 が (SUB)6 17 47 有り 08 <-> 34 ある (be)7 10 64 ば 09 <-> 35 から (because)8 17 65 こそ 11 <-> 36 こそ (EM)9 17 2 秋 12 <-> 38 秋 (autumn)10 17 71 の 13 <-> 39 の (CON)11 17 2 木の葉 14 <-> 40 木の葉 (leaf of tree)12 17 2 幣 19 <-> 45 幣 (present)13 17 61 と 20 <-> 46 と (CRD)14 17 47 散る 21 <-> 49 散る (fall)15 13 74 らむ 22 <-> 54 う (CJR)

Figure 10: Example of the matching process

Page 22: Asialex201103slide02

Asialex 2011 Kyoto, Japan 22

Residual

CT (秋の末近くなって帰り道についた)龍田姫(が道中の無事を願って)手 向け

OP ——— — — — — — — —立田姫— — — — — — —手向ける

CT (をする)神があるからこそ秋の木の葉(が)幣(となって)散っ(ているのだろ) う

OP — — 神のあれ ば こそ秋の木の葉 [の]幣と — —散る— — — — らめ

Figure 11: Example of the matching process in the case of kks 298 in Ko-

machiya (1982)

Page 23: Asialex201103slide02

Asialex 2011 Kyoto, Japan 23

Components of OP

Table 2: Result of subtracting the elements of OP(298) from thoseof CT(298, koma): it indicates the ratio of the ingredientsof OP(298).

OP (valid number of element) = 16E (ratio of exact match) 12/16 = 0.750F (ratio of field match) 1/16 = 0.062G (ratio of group match) 2/16 = 0.125T (ratio of total match) 15/16 = 0.938U (ratio of unmatched OP) 1 - T = 0.062

Page 24: Asialex201103slide02

Asialex 2011 Kyoto, Japan 24

Calculation of Residual Rate

D = 1 − P

T(1)

= 1 − 16

41(2)

= 0.61 (3)

Page 25: Asialex201103slide02

Asialex 2011 Kyoto, Japan 25

Components of CT

Table 3: Component of CT in case of kks 298 by Komachiya (1982):

fabs(D-H) stands for the function of the absolute value of the prac-

tical value, D, minus the theoretical value, H.

CT (valid number of element) =41

W (ratio of original word use) 12/41=0.293(E/CT)

A (ratio of annotation) 1-0.293=0.707(1-W)

---breakdown of the annotation---

P1(ratio of FG paraphrased) (0.62+0.12)/0.707=0.073(F+G)/A

P2(ratio of U paraphrased) (0.707-0.073)*0.062=0.040(A-P1)*U

D (ratio of purely added) 0.707-(0.073+0.040)=0.595A-(P1+P2)

H (theoretical value of D) 1-16/41=0.6101-OP/CT

Gap fabs(0.595-0.610)=0.015fabs(D-H)

Page 26: Asialex201103slide02

Asialex 2011 Kyoto, Japan 26

Subtraction: CT - OP

Exact 12 (75.0%)

Field 1 (6.2%)

Group 2 (12.5%)

Unmatched 1 (6.2%)

W 12 (29.3%)

P1 3 (7.3%)

P2 1 (4.0%)

D 25 (59.5%)

OP : 16 elements CT : 41 elements(298) (298,koma)

Figure 12: Pie-charts illustrating the components of OP(298) and CT(298,

koma)

Page 27: Asialex201103slide02

Asialex 2011 Kyoto, Japan 27

(E) Mathematical modelling

cw(t1, t2)=(1+log ctf(t1, t2))√

idf(t1) idf(t2) (4)

idf(t) = logN

df(t)(5)

Page 28: Asialex201103slide02

Asialex 2011 Kyoto, Japan 28

warbler-CT-23-229-3.73-15 cuckoo-CT-40-370-3.27-16

every morning

field

8

warbler

17

old age

woven hat

6

10

green willow

4

wear in (my) hair

4

sew.26

spring

88

10

Tatsuta.PN

10

branch35

flower

138

stop.vi.1

15

break off22

cry.vi29

sing.vi145

yet.1

30

summerside 8

cuckoo39

a cry

8

May

42

Otowa.PN

20

voice

174

mountain110

261

singing voice21

midsummer rain14

hear

69

be heard.1

37

last year

10

iris.1

7

treetop

9

1220

20

11

this morning

29

9

19

go over

10

regret

10treetop high.3

4

10

near

6

6226

reason.1

8

6

guidance.1

lure

4

9

send4

separation

7

4

fragrance.1

7

20

10

spring haze

9

stand.vi

10

summer mountains

11

force

6

plum10

5623

44

mountain cuckoo

9

hide.vi.2

76

10

scatter.1

52

10

touch

10

hand

10

attach

5

flutter.2

66

borrow

19

imperceptibly

9

treetop high.1

7

7

far

5

Page 29: Asialex201103slide02

Asialex 2011 Kyoto, Japan 29

Conclusion

The thesaurus annotated with meta-codes allows researchers

1. to identify different orthographies as the same word;

2. to attach an alternative semantic ID to a word which has thesame form but has more than one meaning (polysemic word);

3. to attach meta-codes not only to tokens recognised as asingle/simple word but also to attach it to a longer size token

4. to indicate a similarity between tokens.

5. to detect common or different tokens among more than one text,which will tell us the similarities or differences between texts.

6. to indicate the relative differences between two words in literaryworks.

Page 30: Asialex201103slide02

Asialex 2011 Kyoto, Japan 30

Questions

• Computer Modelling of Classical Japanese Poetic

Vocabulary

http://etymology.jp/waka/poem.cgi

• Inquiry:

Hilofumi Yamamoto

[email protected]

• Thank you.