4
NTCIR-15 QA Lab-PoliInfo2 Dialog Topic Detection based on Discussion Structure Graph Yuya HIRAI Faculty of Information and Communication Engineering [email protected] Yo AMANO Faculty of Information and Communication Engineering [email protected] Kazuhiro TAKEUCHI Faculty of Information and Communication Engineering [email protected] ABSTRACT This paper proposes a dialog summarization method using graph representation that shows the target discussion structure. Firstly our proposed method extracts words related to the opinion and po- sition of each participant based on co-occurrence. Then, LDA (La- tent Dirichlet Allocation) is applied to classify position and opinion words, while we determine the number of topics needed for LDA as its parameter by hierarchical clustering using co-occurrence fre- quency. Finally, topic phrases as output are generated by using the dependency structure analysis. TEAM NAME TKLB SUBTASKS Dialog Topic Detection, LDA (Latent Dirichlet Analysis), Discus- sion structure Graph 1 INTRODUCTION Facing the problem caused by the new coronavirus infection (COVID19), it is more important for the national and local governments to com- municate with the residents quickly and appropriately. QALab- PoliInfo-2[2] performs the tasks (stance classification, dialog sum- marization, entity linking, and topic detection) to explore the in- formation technology in these situations. There are already two existing media of communication between them: one version is- sued by local governments conveys the contents discussed in the parliament in an easy-to-understand manner, but it takes time to publish because it is created manually. The other version quickly conveys the contents of representative questions and general ques- tions from members who attended the discussion, but there is room for improvement in terms of the ease of understanding the agenda and issues. In this paper, we propose a way to show an equivalent representa- tion of the former version by an automated processing of the latter based on a graph where nodes correspond to the word uttered by the discussion participants. We assume the proposed graph struc- ture that shows conflicting opinions and positions in the discus- sion. In other words, our purpose in this paper is to show the con- flict points in a discussion. For making the structure, we employ LDA to identify the words to attention. 2 DISCUSSION STRUCTURE IN DISCUSSION Our primary strategy for summarizing is not based on text co- herence of discussion. Our proposed method focuses on express- ing the differences of word use between its participants. Figure 1 shows our graph representation of discussion, which we refer to as discussion structure graph in this paper. In this figure, the round nodes show the utterances expressed in natural language, the square nodes show the participants, and each link shows the relationship between which participant utters a particular word. As shown in this figure, this discussion has a common topic or sub- ject, but at the same time has different positions among the partic- ipants. Specifically, we can distinguish the nodes into two classes with the adjacency relations among them in this graph represen- tation: one is the subject or topic of the discussion (gray nodes), and the other is the participant’s position or opinion (yellow and green nodes). Figure 1: Argument structure graph 3 DATA As the preprocessing for summarizing the transcripted data from the QA Lab-PoliInfo-2 Committee, we divide the data into hier- archical units that we assume to correspond to written language units. The largest unit corresponds to the segment that each speaker argues in the discussion. The next largest unit is regarded as a para- graph, and the smallest unit corresponds to a sentence in written language. In the summarization process, we use two packages of free soft- ware. One is KHcoder[1], a tool that efficiently performs sophisti- cated analyses (such as correspondence analysis, cluster analysis, multidimensional scaling, self-organizing map, co-occurrence net- work, machine learning) by GUI operation. The other is Mallet[4], which we use for performing LDA. As processing by KHcoder, it conducts morphological analysis to the input data using Chasen[3]. In this process, we exclude the words shown in Table 1 as stop words. NTCIR 15 Conference: Proceedings of the 15th NTCIR Conference on Evaluation of Information Access Technologies, December 8-11, 2020 Tokyo Japan 168

NTCIR-15 QA Lab-PoliInfo2 Dialog Topic Detection based on ...research.nii.ac.jp/ntcir/workshop/OnlineProceedings15/...TKLB SUBTASKS Dialog Topic Detection, LDA (Latent Dirichlet Analysis),

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: NTCIR-15 QA Lab-PoliInfo2 Dialog Topic Detection based on ...research.nii.ac.jp/ntcir/workshop/OnlineProceedings15/...TKLB SUBTASKS Dialog Topic Detection, LDA (Latent Dirichlet Analysis),

NTCIR-15 QA Lab-PoliInfo2 Dialog Topic Detection based onDiscussion Structure Graph

Yuya HIRAIFaculty of Information andCommunication Engineering

[email protected]

Yo AMANOFaculty of Information andCommunication Engineering

[email protected]

Kazuhiro TAKEUCHIFaculty of Information andCommunication Engineering

[email protected]

ABSTRACTThis paper proposes a dialog summarization method using graphrepresentation that shows the target discussion structure. Firstlyour proposedmethod extracts words related to the opinion and po-sition of each participant based on co-occurrence. Then, LDA (La-tent Dirichlet Allocation) is applied to classify position and opinionwords, while we determine the number of topics needed for LDAas its parameter by hierarchical clustering using co-occurrence fre-quency. Finally, topic phrases as output are generated by using thedependency structure analysis.

TEAM NAMETKLB

SUBTASKSDialog Topic Detection, LDA (Latent Dirichlet Analysis), Discus-sion structure Graph

1 INTRODUCTIONFacing the problem caused by the new coronavirus infection (COVID19),it is more important for the national and local governments to com-municate with the residents quickly and appropriately. QALab-PoliInfo-2[2] performs the tasks (stance classification, dialog sum-marization, entity linking, and topic detection) to explore the in-formation technology in these situations. There are already twoexisting media of communication between them: one version is-sued by local governments conveys the contents discussed in theparliament in an easy-to-understand manner, but it takes time topublish because it is created manually. The other version quicklyconveys the contents of representative questions and general ques-tions frommemberswho attended the discussion, but there is roomfor improvement in terms of the ease of understanding the agendaand issues.In this paper, we propose a way to show an equivalent representa-tion of the former version by an automated processing of the latterbased on a graph where nodes correspond to the word uttered bythe discussion participants. We assume the proposed graph struc-ture that shows conflicting opinions and positions in the discus-sion. In other words, our purpose in this paper is to show the con-flict points in a discussion. For making the structure, we employLDA to identify the words to attention.

2 DISCUSSION STRUCTURE IN DISCUSSIONOur primary strategy for summarizing is not based on text co-herence of discussion. Our proposed method focuses on express-ing the differences of word use between its participants. Figure

1 shows our graph representation of discussion, which we referto as discussion structure graph in this paper. In this figure, theround nodes show the utterances expressed in natural language,the square nodes show the participants, and each link shows therelationship between which participant utters a particular word.As shown in this figure, this discussion has a common topic or sub-ject, but at the same time has different positions among the partic-ipants. Specifically, we can distinguish the nodes into two classeswith the adjacency relations among them in this graph represen-tation: one is the subject or topic of the discussion (gray nodes),and the other is the participant’s position or opinion (yellow andgreen nodes).

Figure 1: Argument structure graph

3 DATAAs the preprocessing for summarizing the transcripted data fromthe QA Lab-PoliInfo-2 Committee, we divide the data into hier-archical units that we assume to correspond to written languageunits. The largest unit corresponds to the segment that each speakerargues in the discussion. The next largest unit is regarded as a para-graph, and the smallest unit corresponds to a sentence in writtenlanguage.In the summarization process, we use two packages of free soft-ware. One is KHcoder[1], a tool that efficiently performs sophisti-cated analyses (such as correspondence analysis, cluster analysis,multidimensional scaling, self-organizing map, co-occurrence net-work, machine learning) by GUI operation. The other is Mallet[4],which we use for performing LDA.As processing by KHcoder, it conducts morphological analysis tothe input data using Chasen[3]. In this process, we exclude thewords shown in Table 1 as stop words.

NTCIR 15 Conference: Proceedings of the 15th NTCIR Conference on Evaluation of Information Access Technologies, December 8-11, 2020 Tokyo Japan

168

Page 2: NTCIR-15 QA Lab-PoliInfo2 Dialog Topic Detection based on ...research.nii.ac.jp/ntcir/workshop/OnlineProceedings15/...TKLB SUBTASKS Dialog Topic Detection, LDA (Latent Dirichlet Analysis),

Table 1: List of stopword

KHcoder 品詞名 POS名詞 B 普通名詞 がん,みずから,きっかけ動詞 B 動詞 よい,ふさわしい形容詞 B 形容詞 する,ある,ない,いたす副詞 B 副詞 かつて,とりわけ,これから否定助動詞 助動詞 ない,ん,ぬ形容詞 (非自立) 形容詞 やすい,ほしい,ない,いたす

4 PROPOSED METHOD4.1 Discussion structure GraphFigure 2 shows an example of the exacted discussion structuregraphs from the transcription data. For any time interval of a givendiscussion, one can make this kind of graph, where it shows thewords and speakers as nodes and who said the words as links. Thegiven duration to Figure 2 is a specific day. Each ellipse in Figure2 shows three structures similar to that shown in Figure 1 that weassume to represent conflicts among participants in the previoussection.

Figure 2: An example of exact argument structure graphfrom the discussion

4.2 Clustering the wordsAs explained so far, our proposed method detects topics based on agraph representation with word nodes. To find topics expressed indifferent words in all given data, we conduct word clustering thatmight be known as the most fundamental method.Figure 3 shows a part of the results of a tree of clustering usingtheWard method. In most hierarchical clustering methods, includ-ing the clustering shown here, the criteria for whether a givenpair is similar or not is estimated using metrics based on the co-occurrence frequencies in an individual unit of text or discourse.Word co-occurrence can be a fundamental metric, but we regardthe word similarity based on the raw co-occurrence frequenciesas a rough guideline. In concrete, we use this simple clustering toestimate the number of topics in the argument. In Figure 4, thehorizontal axis shows the process of integrating the clusters from

leaves to the root in the hierarchical tree, and the number of clus-ters is plotted for each integration until the number of clusters be-comes one. Since the most massive change in the slope of plotsoccurred when the number of clusters was six in Figure 4, we re-gard that all of the given discussions can consist of six topics. Weset the number of topics specified for LDA to six. (In other words,from this examination, we assume that six topics in the LDAmodelgenerate the given discussions through 8 days.)

関係

さまざま

実現

都民

東京

推進

新た

質問

お答え

事業

活用

支援

今後

行う

実施

図る

連携

取り組む

重要

進める

取り組み

向ける

感染

ウイルス

コロナ

新型

拡大

防止

対策

影響

受ける

生活

緊急

職員

協力

多く

情報

提供

対象

利用

センター

専門

相談

ほか

応じる

充実

一層

環境

都内

活動

初め

来年度

加える

具体

効果

運営

可能

体制

強化

現在

課題

制度

地域

向上

市町村

設置

安全

整備

施設

確保

対応

状況

必要

検討

踏まえる

審査

請願

決定

委員

報告

議案

条例

児童

家庭

子供

保護

都立

学校

教育

医療

機関

検査

病院

患者

介護

高齢

人材

行政

民間

サービス

調査

結果

管理

導入

補助

促進

障害

大会

開催

引き続き

団体

意見

被害

場合

災害

発生

公園

交通

昨年

都市

計画

継続

今回

財政

求める

知事

予算

三月

都議会

特別

負担

軽減

要望

方針

示す

雇用

企業

中小

見解

伺う

社会

経済

施策

考える

戦略

世界

関係

さまざま

実現

都民

東京

推進

新た

質問

お答え

事業

活用

支援

今後

行う

実施

図る

連携

取り組む

重要

進める

取り組み

向ける

感染

ウイルス

コロナ

新型

拡大

防止

対策

影響

受ける

生活

緊急

職員

協力

多く

情報

提供

対象

利用

センター

専門

相談

ほか

応じる

充実

一層

環境

都内

活動

初め

来年度

加える

具体

効果

運営

可能

体制

強化

現在

課題

制度

地域

向上

市町村

設置

安全

整備

施設

確保

対応

状況

必要

検討

踏まえる

審査

請願

決定

委員

報告

議案

条例

児童

家庭

子供

保護

都立

学校

教育

医療

機関

検査

病院

患者

介護

高齢

人材

行政

民間

サービス

調査

結果

管理

導入

補助

促進

障害

大会

開催

引き続き

団体

意見

被害

場合

災害

発生

公園

交通

昨年

都市

計画

継続

今回

財政

求める

知事

予算

三月

都議会

特別

負担

軽減

要望

方針

示す

雇用

企業

中小

見解

伺う

社会

経済

施策

考える

戦略

世界

人0 1 2050010001500

Figure 3: A part of hierarchical clustering tree

110 120 130 140 150

1.0

1.5

2.0

Cluster merger stage (Last 50 times)

Merged

level(

Dissimilarity)

1

5

10

1520

2530

3540

4550

※プロット内の数値ラベルは 併合後のクラスター総数

Figure 4: Hierarchical process of integrating clusters

4.3 Words for each topicWe employ a generative model that assumes a probabilistic mech-anism of how participants generate their utterances, assuming thetopics to discuss. For the purpose, we use Mallet: a Java-basedpackage for machine learning applications that can perform statis-tical natural language processing, document classification, cluster-ing, topic modeling, etc. Specifically, we estimated that each wordcontributes to constructing a certain topic using topic modelinganalysis in Mallet[4], which employs LDA (Latent Dirichlet Allo-cation).Table 2 shows the results of LDA. For the number of topics, we usedsix, which results from the discussion in the previous section. Us-ing the word list in Table 2, we extract each participant’s position,and opinion nodes strongly related to constructing a specific topicfrom the discussion structure graph explained in Section 4.1.

NTCIR 15 Conference: Proceedings of the 15th NTCIR Conference on Evaluation of Information Access Technologies, December 8-11, 2020 Tokyo Japan

169

Page 3: NTCIR-15 QA Lab-PoliInfo2 Dialog Topic Detection based on ...research.nii.ac.jp/ntcir/workshop/OnlineProceedings15/...TKLB SUBTASKS Dialog Topic Detection, LDA (Latent Dirichlet Analysis),

Table 2: Word-topic list acquired with LDA

Topic Number topic words

1 東京大会取り組み実現都民企業重要活用都市社会推進支援事業今後新たさまざま戦略スポーツ連携向上2 感染コロナ新型対策支援ウイルス事業拡大対応体制検査必要防止医療対し都民状況今後確保影響3 支援学校医療教育子供相談区市児童事業対応実施今後取り組み必要保護施設おけ障害活用地域4 知事伺い考え見解求め必要東京都民党議会日本予算小池病院要望学校でき負担声つ5 整備対策地域公園事業計画推進災害防災今後都市取り組み道路安全交通進め重要連携質問施設6 委員報告議案東京令和意見審査結果条例請願議会関する決定陳情予算議長付託良一石川財政

4.4 Topic phrase generationOur output is based on the differences discovery of differences ineach participant’s positions and opinions through the graph struc-ture analysis and the weighting of words with LDA. Specifically,we extract the most weighted position or opinion node of eachparticipant in a discussion structure graph.For example, consider Figure 5 that is an enlarged graph showingthe words used by one of the participants shown in Figure 2. Inthe figure, yellow nodes indicate the words used frequently by thisparticipant, and there are many words (The more frequently thewords indicated by a node, the larger the node size). Looking upwith Table 2, which is the result of LDA; among these many words,one can find the word ’取り組み’ as the highest weight.In this way, we focus on the word ’ 取り組み’ of the participantand extract the participant’s sentence that firstly appears with theword in the discussion. We generate topic phrases for the targetparticipant based on the dependency structure of the sentence an-alyzed with Cabocha[5]. Specifically, we remove the punctuationand the trailing morphemes in segments of dependency structureto generate its proper topic phrase. The segment connected by anarrow has a dependency from the segment to the segment. In thefigure, the segment ’制度周知の’ depends on the segment ’取り組みは’ that includes the focused word. With our manually designedscripts that reforms the last segments (removing the postpositionin this case), the topic phrase ’制度周知の取り組み’ is generatedas an output.

推進

関係

安全

ほか

取り組む

一層

取り組み

百十八番(大津ひろ子君)

社会

都民

被害

情報

技術

政策

事務

実施

活用

確保

意見

強化

発生

補助

整備向上

施設

提供

目指す踏まえる

企業

予算

Figure 5: Enlarged graph of a participant

制度周知の

取り組みは

⼗分か、

政策⽬的を

達成する

ための

⼿段として

必要かを

検討する

ことも

重要です。

本当に

Figure 6: An example of dependency structure of a sentencewith ’取り組み’

5 CONCLUSIONThis paper proposed extracting topic phrases based on the co-occurrencegraph and finding conflicts of opinions and positions among par-ticipants. In the proposed method, the graph (we refer to it asdiscussion structure graph) generated with the information whospoke each word, defines the words that characterize each par-ticipant’s opinion and position based on their adjacent relations.We also conducted a topic analysis with LDA to grasp the broadertopics in all of the given discussion. The essential sentence of eachparticipant was detected using the results of LDA. In the outputprocess, the dependency structure was used for the detected essen-tial sentences, and the opinions and positions of each participantwere generated as topic phrases by the transformation rules thatwe created manually.

REFERENCES[1] Koichi Higuchi. 2016. A two-step approach to quantitative content analysis: KH

Coder tutorial using Anne of Green Gables (Part I). Ritsumeikan Social ScienceReview 52, 3 (2016), 77–91.

NTCIR 15 Conference: Proceedings of the 15th NTCIR Conference on Evaluation of Information Access Technologies, December 8-11, 2020 Tokyo Japan

170

Page 4: NTCIR-15 QA Lab-PoliInfo2 Dialog Topic Detection based on ...research.nii.ac.jp/ntcir/workshop/OnlineProceedings15/...TKLB SUBTASKS Dialog Topic Detection, LDA (Latent Dirichlet Analysis),

[2] Yasutomo Kimura, Hideyuki Shibuki, Hokuto Ototake, Yuzu Uchida, KeiichiTakamaru, Madoka Ishioroshi, Teruko Mitamura, Masaharu Yoshioka, To-moyoshi Akiba, Yasuhiro Ogawa, Minoru Sasaki, Kenichi Yokote, Tatsunori Mori,Kenji Araki, Satoshi Sekine, and Noriko Kando. 2020. Overview of the NTCIR-15QA Lab-PoliInfo-2 Task. Proceedings of The 15th NTCIR Conference (12 2020).

[3] Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, Yoshitaka Hirano, HiroshiMatsuda, KazumaTakaoka, andMasayuki Asahara. 1999. Japanesemorphological

analysis system ChaSen version 2.0 manual. NAIST Techinical Report (1999).[4] Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language

Toolkit. (2002). http://mallet.cs.umass.edu.[5] Yuji Matsumoto Taku Kudo. 2002. Japanese DependencyAnalysis using Cascaded

Chunking. (2002), 63–69.

NTCIR 15 Conference: Proceedings of the 15th NTCIR Conference on Evaluation of Information Access Technologies, December 8-11, 2020 Tokyo Japan

171