Upload
hilo-yamamoto
View
1.855
Download
0
Tags:
Embed Size (px)
Citation preview
Asialex 2011 Kyoto, Japan 1
Development of the Thesaurus of ClassicalJapanese Poetic Vocabulary
Hilofumi Yamamoto
Tokyo Institute of Technology
Makiro Tanaka
National Institute of Japanese Language and Linguistics
22nd Aug. 2011
Asialex 2011 Kyoto, Japan 2
Outline
1. Purpose of Study
• Connotation of classical poetic vocabulary
• Longitudinal study of transition of vocabulary
2. Development of Thesaurus
3. Applications
Asialex 2011 Kyoto, Japan 3
Waka: Japanese Poetry
Tatsuta-Hime..
tamukuru KAMI no / arebakosoaki no konoha no / nusa to chirurame
because Princess Tatsutahas a god to whom she offers brocades,the leaves of treesin autumn will scatteras an offering.
Prince KanemiNo. 298 in the Kokinshu
Asialex 2011 Kyoto, Japan 4
Problem: Orthography
in hiragana
たつた
in Chinese characters
立田竜田龍田
→ All Tatsuta (place name)
Asialex 2011 Kyoto, Japan 5
Problem: Unit size / attribution
The unit size and meaning of a word depends on a context.
• unit → 卯の花 or 卯/の/花 (Nakano, 1998)
• orthography → さびしい/さみしい/寂しい/淋しい(sad)
• attributions → 卯の花 ∈ plant or 卯の花 ∈ food
(unohana = a deutzia or bean curd refuse)
Asialex 2011 Kyoto, Japan 6
An Item of Thesaurus: God
BG-01-2030-01-030-A-かみ-神↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
(1) (2) (3) (4) (5) (6) (7) (8)
Figure 1: Structure of an item of BG database in the case of kami (god):
(1) database ID (BG = short-unit general vocabulary);
(2) part of speech ID (01 = noun);
(3) group ID (2030 = Shinto deities and Buddhas);
(4) field ID;
(5) exact ID (030 = god);
(6) era-flag (A = contemporary, C = classic);
(7) Chinese character reading;
(8) Chinese character
Asialex 2011 Kyoto, Japan 7
Development: Thesaurus, KH, and t2c
• Thesaurus for classical poetic vocabulary
• KH (tokenizer)
• t2c (token to code converter)
Asialex 2011 Kyoto, Japan 8
Materials: the Hachidaishu
• The Hachidaishu: eight anthologies compiled by
imperial orders during ca. 905–2105.
• The database: compiled by the National Institute of
Japanese Literature, Japan.
• Old texts taken based on Shohobonban version of the
Hachidaishu
900
⊲
Kok
inshu
(•90
5)
46
950
⊲
Gosen
shu
(•95
1)
56
1000
⊲
Juishu
(•10
07)
79
1050
⊲
Goshu
ishu
(108
6)
38
1100⊲
Kinyo
shu
(•11
24)
20
⊲
Shikashu
(•11
44)
44
1150
⊲
Senz
aishu
(118
8)
17
1200
⊲
Shinko
kins
hu(1
205)
1250
Asialex 2011 Kyoto, Japan 9
Methods: Flowchart of data processing
ACorp
us development
BToke
nisation
CMeta-
code con
version
DMath
ematical modellin
g
ESubtrac
tion: CT −
OP
FVisualis
ation
Asialex 2011 Kyoto, Japan 10
Development: Thesaurus, KH, and t2c
• Thesaurus for classical poetic vocabulary
• KH (tokenizer)
• t2c (token to code converter)
Asialex 2011 Kyoto, Japan 11
Table 1: An example of input for KH / Gosenshu No. 664
input: 000664 わすられて思ふなげきのしげるをや身をはづかしのもりといふらんoutput:000664
わすら (ラ四-未:忘る:わする:忘ら:わすら)れ (自可受-用:る:る:れ:れ)て (接助:て:て)思ふ (ハ四-終体:思ふ:おもふ:思ふ:おもふ)なげき (カ四-用:嘆く:なげく:嘆き:なげき)の (格助:の:の)しげる (ラ四-終体:茂る:しげる:茂る:しげる)を (*助:を:を)や (係助:や:や)身 (名:身:み)を (*助:を:を)---はづかし (名-地名:羽束師:はづかし)の (格助:の:の)---はづかし (形シク-終:恥づかし:はづかし:恥づかし:はづかし)の (格助:の:の)---もり (名:森:もり)と (格助-引用:と:と)いふ (ハ四-終体:言ふ:いふ:言ふ:いふ)らん (推-終体:らむ:らむ:らむ:らむ)
Asialex 2011 Kyoto, Japan 12
Development: Thesaurus
Poem Texts kh t2c
Thesauruscode taggerTokeniser
HachidaishuThesaurus
(A) (B)
add new thesaurus codes
Dictionary General, Place NamePersonal Name, etc
add unknown entries
Asialex 2011 Kyoto, Japan 13
(A) Corpus: Poems (OP)
KW00029800|A|KANEMI NO O=kanemi no o
KW00029800|B|Tatsutahime[NOUN-PLNAME:TATSUTAHIME]/→tamukuru[KASHIMO2-ATTR:TAMUkuru],kami[NOUN:KAMI]→no[SUB]are[RAHEN-REAL]ba[CAUS]koso[KP]/→aki[NOUN:AKI]no[CON],konoha[NOUN:KOnoHA]no[SUB]/→nusa[NOUN:NUSA]to[P-CRD],chiru[RA4DAN-FF:CHIru]→rame[CJR-REAL]/
Figure 2: Format of the database of a poem: → indicates continuing to the
next line without breaks; the first line, which includes |A|, indicates
the name of the poet; the second line which includes |B|, indicates
the contents of the poem and added information.
Asialex 2011 Kyoto, Japan 14
(A) Corpus: Translations (CT)
$A|000298
$B|秋の末近くなって帰り道についた龍田姫が、道中の無事を願って手向け→をする神があるからこそ、秋の木の葉が幣となって散っているのだろう。
$C|秋の歌$D|秋の末近くなって帰り道についた龍田姫が、道中の無事を願って手向け→をする神があるからこそ、秋の木の葉が幣となって散っているのだろう。
$I|あきのすえちかくなってかえりみちについたたつたひめが、どうちゅう→のぶじをねがってたむけをするかみがあるからこそ、あきのこのはがぬさ→となってちっているのだろう。
Figure 3: Format of the database of a CT
Asialex 2011 Kyoto, Japan 15
(B) Tokenisation:
original text
立田姫手向ける神の有ればこそ秋の木の葉の幣と散るらめ
↓tokenising
立田姫/手向ける/神/の/[有れ]/ば/こそ/秋/の/木の葉/の/幣/と/散る/[らめ]
↓converting into predicative form
立田姫/手向ける/神/の/[有り]/ば/こそ/秋/の/木の葉/の/幣/と/散る/[らむ]
Figure 4: Tokenisation of poem texts
Asialex 2011 Kyoto, Japan 16
(C) meta-code conversion
CH-29-2130-01-010-A たつたひめ 立田姫 Tatsutahime Princess-Tatsuta
CH-29-0000-14-010-A -- 立田 -- Tatsuta Tatsuta
BG-01-2030-01-101-A -- 姫 -- hime princess
BG-02-3770-04-080-C たむくる 手向く tamukuru present(verb)
BG-01-5730-02-010-A -- 手 -- te hand
BG-02-1700-01-040-A -- 向ける -- mukeru for
BG-01-2030-01-030-A かみ 神 kami god
BG-08-0061-07-010-A の の no SUB (particle)
BG-02-1200-01-010-C あれ 有り are be
BG-08-0064-26-010-A ば ば ba because (particle)
BG-04-1120-05-150-A -- ば -- ba because (reason)
BG-08-0065-01-010-A こそ こそ koso KP (emphasis)
Figure 5: Meta-code conversion in case of OP
Asialex 2011 Kyoto, Japan 17
(C) Structure of meta-code-1
BG-01-2030-01-030-A-かみ-神↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
(1) (2) (3) (4) (5) (6) (7) (8)
Figure 6: Structure of an item of BG database in the case of kami (god):
(1) database ID (BG = short-unit general vocabulary);
(2) part of speech ID (01 = noun);
(3) group ID (2030 = Shinto deities and Buddhas);
(4) field ID;
(5) exact ID (030 = god);
(6) era-flag (A = contemporary, C = classic);
(7) Chinese character reading;
(8) Chinese character
Asialex 2011 Kyoto, Japan 18
(C) Structure of the meta-code-2
BG-01-2600-01-020-Ayononaka (world)
(1) = BG-01-2610-01-040-Ayo (world)
(2)
+ BG-08-0010-01-021-Ano (of)
(3)
+ BG-01-1770-01-080-Anaka (inside)
(4)
Figure 7: Structure of an item of the semantic table in the caseof a compound word, yononaka (world)
Asialex 2011 Kyoto, Japan 19
(C) meta-code conversion-3
CH-29-2130-01-010-A たつたひめ 立田姫 Tatsutahime Princess-Tatsuta
CH-29-0000-14-010-A -- 立田 -- Tatsuta Tatsuta
BG-01-2030-01-101-A -- 姫 -- hime princess
BG-02-3770-04-080-C たむくる 手向く tamukuru present(verb)
BG-01-5730-02-010-A -- 手 -- te hand
BG-02-1700-01-040-A -- 向ける -- mukeru for
BG-01-2030-01-030-A かみ 神 kami god
BG-08-0061-07-010-A の の no SUB (particle)
BG-02-1200-01-010-C あれ 有り are be
BG-08-0064-26-010-A ば ば ba because (particle)
BG-04-1120-05-150-A -- ば -- ba because (reason)
BG-08-0065-01-010-A こそ こそ koso KP (emphasis)
Figure 8: Meta-code conversion in case of OP
Asialex 2011 Kyoto, Japan 20
poet write OP read expert reader
write
CT
read
novice reader
compare
10th century
Field of experience
20th century
Field of experience (expert)
20th centuryField of experience
(novice)
Figure 9: Schema of relationship between OP and CT
Asialex 2011 Kyoto, Japan 21
+-------- # of pair| +----- value of matching level, exact=17, field=13, group=10| | +-- # of POS| | || | | # of element of OP ----+ +- # of element of CT| | | element of OP -+ | | +--- element of CT| | | | | | |1 17 11 立田姫 00 <-> 12 龍田姫 (Tatsutahime)2 17 47 手 04 <-> 25 手 (hand)3 17 47 向ける 05 <-> 26 向ける (toward)4 17 2 神 06 <-> 32 神 (god)5 10 61 の 07 <-> 33 が (SUB)6 17 47 有り 08 <-> 34 ある (be)7 10 64 ば 09 <-> 35 から (because)8 17 65 こそ 11 <-> 36 こそ (EM)9 17 2 秋 12 <-> 38 秋 (autumn)10 17 71 の 13 <-> 39 の (CON)11 17 2 木の葉 14 <-> 40 木の葉 (leaf of tree)12 17 2 幣 19 <-> 45 幣 (present)13 17 61 と 20 <-> 46 と (CRD)14 17 47 散る 21 <-> 49 散る (fall)15 13 74 らむ 22 <-> 54 う (CJR)
Figure 10: Example of the matching process
Asialex 2011 Kyoto, Japan 22
Residual
CT (秋の末近くなって帰り道についた)龍田姫(が道中の無事を願って)手 向け
OP ——— — — — — — — —立田姫— — — — — — —手向ける
CT (をする)神があるからこそ秋の木の葉(が)幣(となって)散っ(ているのだろ) う
OP — — 神のあれ ば こそ秋の木の葉 [の]幣と — —散る— — — — らめ
Figure 11: Example of the matching process in the case of kks 298 in Ko-
machiya (1982)
Asialex 2011 Kyoto, Japan 23
Components of OP
Table 2: Result of subtracting the elements of OP(298) from thoseof CT(298, koma): it indicates the ratio of the ingredientsof OP(298).
OP (valid number of element) = 16E (ratio of exact match) 12/16 = 0.750F (ratio of field match) 1/16 = 0.062G (ratio of group match) 2/16 = 0.125T (ratio of total match) 15/16 = 0.938U (ratio of unmatched OP) 1 - T = 0.062
Asialex 2011 Kyoto, Japan 24
Calculation of Residual Rate
D = 1 − P
T(1)
= 1 − 16
41(2)
= 0.61 (3)
Asialex 2011 Kyoto, Japan 25
Components of CT
Table 3: Component of CT in case of kks 298 by Komachiya (1982):
fabs(D-H) stands for the function of the absolute value of the prac-
tical value, D, minus the theoretical value, H.
CT (valid number of element) =41
W (ratio of original word use) 12/41=0.293(E/CT)
A (ratio of annotation) 1-0.293=0.707(1-W)
---breakdown of the annotation---
P1(ratio of FG paraphrased) (0.62+0.12)/0.707=0.073(F+G)/A
P2(ratio of U paraphrased) (0.707-0.073)*0.062=0.040(A-P1)*U
D (ratio of purely added) 0.707-(0.073+0.040)=0.595A-(P1+P2)
H (theoretical value of D) 1-16/41=0.6101-OP/CT
Gap fabs(0.595-0.610)=0.015fabs(D-H)
Asialex 2011 Kyoto, Japan 26
Subtraction: CT - OP
Exact 12 (75.0%)
Field 1 (6.2%)
Group 2 (12.5%)
Unmatched 1 (6.2%)
W 12 (29.3%)
P1 3 (7.3%)
P2 1 (4.0%)
D 25 (59.5%)
OP : 16 elements CT : 41 elements(298) (298,koma)
Figure 12: Pie-charts illustrating the components of OP(298) and CT(298,
koma)
Asialex 2011 Kyoto, Japan 27
(E) Mathematical modelling
cw(t1, t2)=(1+log ctf(t1, t2))√
idf(t1) idf(t2) (4)
idf(t) = logN
df(t)(5)
Asialex 2011 Kyoto, Japan 28
warbler-CT-23-229-3.73-15 cuckoo-CT-40-370-3.27-16
every morning
field
8
warbler
17
old age
woven hat
6
10
green willow
4
wear in (my) hair
4
sew.26
spring
88
10
Tatsuta.PN
10
branch35
flower
138
stop.vi.1
15
break off22
cry.vi29
sing.vi145
yet.1
30
summerside 8
cuckoo39
a cry
8
May
42
Otowa.PN
20
voice
174
mountain110
261
singing voice21
midsummer rain14
hear
69
be heard.1
37
last year
10
iris.1
7
treetop
9
1220
20
11
this morning
29
9
19
go over
10
regret
10treetop high.3
4
10
near
6
6226
reason.1
8
6
guidance.1
lure
4
9
send4
separation
7
4
fragrance.1
7
20
10
spring haze
9
stand.vi
10
summer mountains
11
force
6
plum10
5623
44
mountain cuckoo
9
hide.vi.2
76
10
scatter.1
52
10
touch
10
hand
10
attach
5
flutter.2
66
borrow
19
imperceptibly
9
treetop high.1
7
7
far
5
Asialex 2011 Kyoto, Japan 29
Conclusion
The thesaurus annotated with meta-codes allows researchers
1. to identify different orthographies as the same word;
2. to attach an alternative semantic ID to a word which has thesame form but has more than one meaning (polysemic word);
3. to attach meta-codes not only to tokens recognised as asingle/simple word but also to attach it to a longer size token
4. to indicate a similarity between tokens.
5. to detect common or different tokens among more than one text,which will tell us the similarities or differences between texts.
6. to indicate the relative differences between two words in literaryworks.
Asialex 2011 Kyoto, Japan 30
Questions
• Computer Modelling of Classical Japanese Poetic
Vocabulary
http://etymology.jp/waka/poem.cgi
• Inquiry:
Hilofumi Yamamoto
• Thank you.