13
PROCEEDINGS 2018 5th International Conference on Information Science and Control Engineering — ICISCE 2018 — 20–22 July 2018 Zhengzhou, Henan, China

2018 5th International Conference on Information Science

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 2018 5th International Conference on Information Science

PROCEEDINGS

2018 5th International Conference

on Information Science and Control Engineering

— ICISCE 2018 —

20–22 July 2018 Zhengzhou, Henan, China

Page 2: 2018 5th International Conference on Information Science

PROCEEDINGS

2018 5th International Conference on Information Science and

Control Engineering

— ICISCE 2018 —

20–22 July 2018 Zhengzhou, Henan, China

Edited by Shaozi Li Ying Dai

Yun Cheng

Los Alamitos, California Washington • Tokyo

Page 3: 2018 5th International Conference on Information Science

Copyright © 2018 by The Institute of Electrical and Electronics Engineers, Inc. All rights reserved.

Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries may photocopy beyond the limits of US copyright law, for private use of patrons, those articles in this volume that carry a code at the bottom of the first page, provided that the per-copy fee indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. Other copying, reprint, or republication requests should be addressed to: IEEE Copyrights Manager, IEEE Service Center, 445 Hoes Lane, P.O. Box 133, Piscataway, NJ 08855-1331. The papers in this book comprise the proceedings of the meeting mentioned on the cover and title page. They reflect the authors’ opinions and, in the interests of timely dissemination, are published as presented and without change. Their inclusion in this publication does not necessarily constitute endorsement by the editors, the IEEE Computer Society, or the Institute of Electrical and Electronics Engineers, Inc.

BMS Part Number: CFP18C46-CDR ISBN-13: 978-1-5386-5500-9

Additional copies may be ordered from:

IEEE Computer Society IEEE Service Center IEEE Computer Society Customer Service Center 445 Hoes Lane Asia/Pacific Office

10662 Los Vaqueros Circle P.O. Box 1331 Watanabe Bldg., 1-4-2 P.O. Box 3014 Piscataway, NJ 08855-1331 Minami-Aoyama

Los Alamitos, CA 90720-1314 Tel: + 1 732 981 0060 Minato-ku, Tokyo 107-0062 Tel: + 1 800 272 6657 Fax: + 1 732 981 9667 JAPAN Fax: + 1 714 821 4641 http://shop.ieee.org/store/ Tel: + 81 3 3408 3118

http://computer.org/cspress [email protected]

[email protected] Fax: + 81 3 3408 3553 [email protected]

Individual paper REPRINTS may be ordered at: <[email protected]>

Editorial production by Randall Bilof Cover art production by Annie Jiu

IEEE Computer Society

Conference Publishing Services (CPS) http://www.computer.org/cps

Page 4: 2018 5th International Conference on Information Science

2018 5th InternationalConference on Information

Science and Control Engineering

ICISCE 2018

Table of Contents

Message from the ICISCE 2018 General Chairs xxxMessage from the ICISCE 2018 Program Chairs xxxiICISCE 2018 Organizing Committee xxxiiICISCE 2018 Program Committee xxxiiiICISCE 2018 Reviewers xxxvii

Computer and Information Science

3D Waveform Oscilloscope Implemented on Coupled FPGA-GPU Embedded System 1 Nan Hu (University of Science and Technology of China), Xuehai Zhou (University of Science and Technology of China), Xi Li (University of Science and Technology of China), and Chao Wang (University of Science and Technology of China)

A 2.5D Cancer Segmentation for MRI Images Based on U-Net 6 Ke Hu (Chengdu University), Chang Liu (Chengdu University), Xi Yu (Chengdu University), Jian Zhang (Sichuan Normal University), Yu He (Chengdu University), and Hongchao Zhu (Chengdu University)

A Cheat-Proof Secret Sharing Scheme Based on Algebraic Codes 11 Gang Du (Liaoning Institute Science and Technology School)

A Deep Fully Convolution Neural Network for Semantic Segmentation Based on Adaptive Feature Fusion 16 Anbang Liu (Shandong University, Weihai), Yiqin Yang (Shandong University, Weihai), Qingyu Sun (Shandong University, Weihai), and Qingyang Xu (Shandong University, Weihai)

A Fast Image Matching Algorithm and the Application on Steel-Label Recognition 21 Fei Deng (China Academy of Engineering Physics), Linbo Wu (China Academy of Engineering Physics), Chunlei Li (China Academy of Engineering Physics), Feng Gao (China Academy of Engineering Physics), and Yunqiang Yan (China Academy of Engineering Physics)

A Feasible Method of Virtual Flow Field Simulation - Part I: An Interface from Fluent to RTT 25 Qiang Li (Jilin University), Chengxuan Jiao (Jilin University), Changhai Yang (Jilin University), Zhao Zhang (Jilin University), and Liyan Yang (Jilin University)

v

Page 5: 2018 5th International Conference on Information Science

Improved Semantic Similarity Method Based on HowNet for Text Clustering 266 Hongmei Nie (Zhejiang Normal University), Jiaqing Zhou (Zhejiang Normal University), Qi Guo (Zhejiang Normal University), and Zhiqi Huang (Zhejiang Normal University)

Indonesians' Song Lyrics Topic Modelling Using Latent Dirichlet Allocation 270 Enrico Laoh (Universitas Indonesia), Isti Surjandari (Universitas Indonesia), and Limisgy Ramadhina Febirautami (Universitas Indonesia)

Influence of Complex Terrains on Electric Field under Overhead Lines 275 Zhenguang Liang (Shandong University), Yuze Jiang (State Grid Shandong Electric Power Research Institute), and Jinge Hong (State Grid Technology College)

Iterative Reweighted Quantile Regression Using Augmented Lagrangian Optimization for BaselineCorrection 280 Quanjie Han (Chinese Academy of Sciences), Silong Peng (Chinese Academy of Sciences), Qiong Xie (Chinese Academy of Sciences), Yifan Wu (Chinese Academy of Sciences), and Genwei Zhang (Beijing Institute of Pharmaceutical Chemistry)

Labeled Bilingual Topic Model for Cross-Lingual Text Classification and Label Recommendation 285 Ming-Jie Tian (Yanbian University), Zheng-Hao Huang (Yanbian University), and Rong-Yi Cui (Yanbian University)

Lane Keeping of Intelligent Vehicle under Crosswind Based on Visual Navigation 290 Zeng Li (Changchun University of Technology), Shaosong Li (Changchun University of Technology), Zheng Li (Changchun University of Technology), Gaojian Cui (Changchun University of Technology), and Xiaodong Wu (Changchun University of Technology)

Locust Image Segmentation by Fusion of Multiple Colors 295 Wenxia Zhang (Inner Mongolia University) and Fengshan Bai (Inner Mongolia University)

Long Term Causality Analyses of Industrial Pollutants and Meteorological Factors on PM2.5Concentrations in Zhejiang Province 301 Bocheng Wang (Communication University of Zhejiang), Shengjuan Liu (Communication University of Zhejiang), Quanquan Du (Communication University of Zhejiang), and Yanqin Yan (Communication University of Zhejiang)

Low Bits: Binary Neural Network for Vad and Wakeup 306 Dandan Song (Tsinghua University), Shouyi Yin (Tsinghua University), Peng Ouyang (Beihang University), leibo liu (Tsinghua University), and Shaojun Wei (Tsinghua University)

Low Slow Small Aircraft Surveillance System Based on Computer Vision 312 Yifu Xu (Beihang University), Xiling Luo (Beihang University), and Feixiang Luo (Beihang University)

Mental Fatigue Detection Based on Multi-Inter-Domain Optical Flow Characteristics 316 Qiumin Ji (Guangdong University of Technology) and Ling Zhang (Guangdong University of Technology)

Mining Association Rules in Seasonal Transaction Data 321 Sabrina Kusuma Ayu (Universitas Indonesia), Isti Surjandari (Universitas Indonesia), and Zulkarnain Zulkarnain (Universitas Indonesia)

x

CuiRY
椭圆形
Page 6: 2018 5th International Conference on Information Science

Labeled Bilingual Topic Model for Cross-Lingual Text Classification and Label Recommendation

Ming-Jie TIAN, Zheng-Hao HUANG, Rong-Yi CUI* Intelligent Information Processing Lab

Yanbian UniversityYanji, Jilin, China

*Corresponding author

Abstract—Aiming at the increasingly rich multi language information resources and multi-label data in news reports and scientific literatures, in order to mining the relevance between languages and the correlation between data, this paper proposed labeled bilingual topic model, applied on cross-lingual text classification and label recommendation. First of all, it could assume that the keywords in the scientific literature are relevant to the abstract in same article, then extracted the keywords and regarded it as labels, and aligned the labels with topics in topic model, instantiated the “latent” topic. Secondly, trained the abstracts in article through the topic model proposed by this paper. Finally, classified the new documents by cross-lingual text classifier, also recommended the labels. The experiment result show that Micro-F1 measure reaches 94.81% in cross-lingual text classification task, and the recommended labels also reflects the sematic relevance with documents.

Keywords-topic model; label; cross-lingual text classification; label recommendation; latent topic

I. INTRODUCTION

Without any doubt, the Web is growing rapidly, which can be reflected by the amount of online Web content. One challenging but very desirable task accompanying the Web growth is to organize information written in different languages, to make them easily accessible for all users. Recently there has been a new trend that from news to scientific literatures, a significant proportion of the world’s textual data is labeled with multiple human-provided tags. It make possible for us to relate different documents, explore the relevance between documents and attributes.

The diversity of language has enriched information resources, but differences between languages have inevitably hindered users to use them. Take text classification as an example, it would be very helpful to leverage the data of one language for classifying the contents of another language. This work is called cross-lingual text classification and many research works has been conducted to solve this problem. The cross-lingual text classification method based on bilingual dictionary [1, 2] collect Top n words of each class in one language document as domain feature, then translate feature words into another language through bilingual dictionary, finally, classify the document by comparing similarities. The method based on machine translation [3, 4]

translate the target language texts into source languagethrough machine translation system for classification. Mimno [5] proposed Poly-lingual Topic Models to model the different alignment multiple language corpus, and apply in machine translation and tracking topic trend. Ni [6] proposed a method for mining multi-lingual topics from Wikipedia, map the different language documents into the same vector space for cross-lingual text classification.

We combine the multi-label in news or scientific literatures with the application of topic models in cross-lingual text classification, and propose the Labeled Bilingual Topic Model (LBTM) to solve the implicit meaning of the topics in “latent” topic models, make the topic has defined semantic and be better explained. At the same time, mining the relevance among documents, finally recommending the labels to new document from the existing topics.

II. LATENT DIRICHLET ALLOCATION

The LDA topic model is a document generation model proposed by Blei [7]. It is also known as a three levels Bayesian probability model, which contains three levels of word, topic and document. The LDA topic model introduce Dirichlet prior parameters [8] on the distribution of document-topic and topic-word, which solves the over-fitting problem that occurs when processing the large-scale corpus.

The LDA topic model is a directed probabilistic graphical model [9], as is shown in Fig. 1.

Given document dataset D contains number of Mdocuments, the m-th document contains number of Nm words, assuming that the number of topics in document dataset D is K totally, the generative process can be described by the following steps:1 For each document, get the length ~ ( )N Poission �

1.1 Choose the distribution over topics ~ ( )Dir� �2 For each position n ( n = 1 to Nm ) in document

2.1 Choose topic ~ ( )nz Multinomial �2.2 Choose word nw with probability ( | , )n np w z �

Here, ( )Poission � , ( )Dir � and ( )Multinomial � represent possion, dirichlet and multinomial distributions respectively.

285

2018 5th International Conference on Information Science and Control Engineering

978-1-5386-5500-9/18/$31.00 ©2018 IEEEDOI 10.1109/ICISCE.2018.00067

Page 7: 2018 5th International Conference on Information Science

� wm,n�m zm,n k �k�[1,K]

n�[1,Nm]

m�[1,M]

Figure 1. Probabilistic Graphical Model of LDA

The parameters should be estimated during the modelling of the dataset through LDA topic model, the commonly used methods is Variational Bayesian Inference [10], EM algorithm [11] or Collapsed Gibbs Sampling [12, 13]. The method based on Collapsed Gibbs Sampling can effectively sample topics from large-scale document dataset. The parameter estimation process can be considered as the inverse process of the document generation, the parameters are estimated with the documents distribution have beenknown. The conditional probability of the topic sequence under the known word sequence is calculated as follow.

, ,

1 1

( | , )

( , )( , ) ( ) 1 ( ) 1

i it kk i t m i k

V Lv lik v m l

v l

p z k z w

n np w zp w z n n

� �

� �

� �

� �� �

��� ��

�� �

�� ��� ���

III. LABELED BILINGUAL TOPIC MODEL

The LDA topic model replace the high-dimensional term by the low-dimensional “latent” topic to capture the semantic information of the document. However, each “latent” topic is implicit defined and lack of explanations. We make flexible use of the multi-label data in scientific literatures or newssuch as keywords in the literatures to improve the LDA topic model, propose the Labeled Bilingual Topic Model. Regardthe labels in the literatures or news as the “topic”, make the topics can be explained, and instantiate the “latent” topics and assign an explicit meaning. After modelling the document dataset by the model proposed by us, the documents are represented by the meaningful topics. Each document in the dataset has a probability distribution on the topics, and can be represented as a vector in the vector space to implement cross-lingual text classification and labelrecommendations.

A. The ModelAssume that the document dataset consists of M

documents, the content of each document is described by two languages L1 and L2, the content described by each language is the same. The model has a set of language-independent “common” topics to describe the two languages documents, each “common” topic has two different representations, each corresponding to on languages. The probabilistic graphical model of the Labeled Bilingual Topic Model is shown in Fig. 2.

� is the hyper parameter of Dirichlet distribution between document and topics, Lj� is the hyper parameter of Dirichlet distribution between topic and words in language Lj

( j = 1 , 2 ), � is the constraint relationship between document and topics, and the relationship between document

� �m L1

k�[1,K]

wm,nzm,n

n�[1,Nm,L1]L1

wm,nzm,n

n�[1,Nm,L2]L2

k,L1

L2

k�[1,K]

k,L2

m� [1,M]

Figure 2. Probabilistic Graphical Model of LBTM

and topics are specific , � is the hyper parameter of the Bernoulli distribution constrained by the document and topics, � and � are given empirically. ,m nw represents the n-th word in the m-th document, ,m nz is the topic corresponding to the word ,m nw , the parameter m� is the distribution of the m-th document on the topic, the parameter

,k Lj� is the distribution of the k-th topic on the language Lj’s word; Given document dataset D contains number of Mdocuments, the m-th document’s language Lj’s part contains number of Nm,Lj words, assuming that the number of topics in document dataset D is K totally, the generative process of Labeled Bilingual Topic Model can be described by the following steps:1 For each “common” topic z, z=1,2,…,K

1.1 For each language Lj, j = 1 or 21.1.1Choose topic’s distribution on words

, ~ ( )j jz L LDir� �

2 For the document m in document dataset2.1 Choose document-topic constraint

, {0,1} ~ ( )m k bernoulli �� �

2.2 Choose the distribution over “common” topics ~ ( ) |m Dir� � �

3 For each position n( n = 1 to Nm ) in document3.1 Choose topic , ~ ( )m n mz Multinomial �

3.2 Choose word , , , ,~ ( )j jm n L m n Lw Multinomial �

B. EstimationIn parameter estimation phase, the Gibbs Sampling

method should be modified to fit the bilingual and labeled features. The conditional probability of the topic sequence under the word sequence is extended from monolingual tobilingual. The conditional probability can be calculated as follow.

,,

2

, ,, , 1

2

,,1 11

( | , )

( ) 1( ) 1

j jj

jj j

L j

jj j

i L Li L

kt t m i L kk i L L j

V Llv vm L lk L L

l jv

p z k z w

nn

nn

��

��

��

� �

� ��

� ��

���

286

Page 8: 2018 5th International Conference on Information Science

Where , , j

tk i Ln � is the number of times word t of language

Lj assigned to topic k except t’s current assignment, ,1

1L j

j

V vk Lv

n

�� is the total number of words in language Lj

assigned to topic k except t’s current assignment, VLj is the vocabulary of language Lj, , , j

km i Ln � is the number of words in

language Lj in document m assigned to topic k except t’s current assignment, 2

,1 11

j

L lm Ll j

n

�� � is the total number of words in all languages in document m except the current word t.

Finally, we can obtain document’s distribution on the topics� as follow.

,, ,

,1

( )

j

j

j

km L k

m k L Llm L l

l

n

n

��

� ���

C. New Document InferenceFor a new document, distribution on the topic can be

predicted from model parameters that have already been trained, and map the new document to the vector space that the dimension is each topic, the probability of current word in document d assigned to the topic k is as follow.

, ,,

, ,, , ,

,,, ,

11

( | , , , )

( ) 1( ) 1

j jj jj

j j j

L j

j j j

d ddi L Li L Li L

t t d t d kk L L i k L i k

V Ld lv v d v

lk L L k Llv

p z k z w z w

n n n

nn n

� �

��

� �

� �

� � ��

� �� � ��

���

Finally, new document’s distribution 1 2{ , ,..., }d d d d

K� � � ��

on the topic can be calculated, where each component is represented as follow.

,

,

1( )

d kd kk L

d ll

l

n

n

���

� ���

D. Differences with “Latent” Topic ModelCompare to the multi-lingual models [5, 6] derived from

“latent” topic model [7], Labeled Bilingual Topic Model takes advantage of the multi-label data in the documents,instantiate the "latent" topics so that the meaning of the topic is no longer "implicit" but "explicit". The differences with the method derived from "latent" topic model are as follows:1 Topic size K: The determination of the number of

topics is one of the difficulties in the LDA topic model, the value of K needs to be choose from the experimental results. The number of topics in LBTM is determined, that is, the number of unique labels in the document dataset;

2 Vector representation of the document: The topic sampling range for each word in each document at“latent” topic model is 1~K, documents are represented by the assignment of word and topic in the document, therefore, the value of each topic component in the vector representation of the document may not be zero.

In LBTM, each document has a constraint with the fixed labels, the topic component value that is constrained to the document is not zero, and the rest must be zero.

3 Range of topic sampling: The conditional probability between each word in each document and all topicsneed to be calculated in each iteration. Due to no constraint between the document and the topic in “latent” topic model, the range of topic sampling for each word in each document is K. In LBTM, there is afixed constraint relationship between each document and topics, the range of topic sampling of word in a document is a collection of topics (labels) that have a constraint with the document;

4 Sampling computational complexity: Due to the range of sampling, the “latent” topic model needs to calculate the conditional probability between each word and all the topics during each iteration. In LBTM, it is only necessary to calculate the conditional probabilities each word with topics that have fixed constraint withthe document. The model proposed by us has advantages in the computational efficiency at sampling process.

5 Infer the topic distribution of new document: For a new document, the distribution of the topic is inferred from the trained model parameters. The topic samplingrange for each word in new document at “latent” topic model is 1~K. In LBTM, the sampling range is the topics that the current word is assigned during the training phase.

IV. THE EXPERIMENT

A. Cross-Lingual Text ClassificationSeveral previous researches such as method based on

bilingual dictionary (DICT) [1, 2], machine translation (MT)[3, 4] and “Latent” Topic Model [6] focused on solving cross-lingual text classification task. The methods depend on external knowledge base such as bilingual dictionary andmachine translation are expensive to scale up and not easy to adapt to any languages. The principle of these methods is converting cross-lingual text classification task to Monolingual with the help of external knowledge base, with risk of losing translation accuracy and semantics. In the method based on latent topic model, each “latent” topic is implicit defined and lack of explanations. The comparison of several methods with LBTM is shown in Table 1.

To verify the validity and feasibility of LBTM, we carried out cross-lingual text classification experiment. Wechoose Naïve Bayes as the default classifier. We train the classifier with the bilingual training dataset, and tested in bilingual test dataset separately. The bilingual dataset is sentence aligned, so the content of bilingual documents is semantically aligned. We compare LBTM with the “latent” topic model based multi-lingual LDA [6] model in cross-lingual text classification task.

287

Page 9: 2018 5th International Conference on Information Science

TABLE I. COMPARISON OF SEVERAL METHODS

CLTCMethods

Attributes

Required Resources

Reserving Semantic Languages Difficulty

DICT Bilingual Dictionary No

Rely on support of Dictionary

Ambiguity and

Polysemy of translated

word

MTMachine

Translation System

Yes, But rely on MT

System

Rely on support of

MT System

Depend on Accuracy of MT System

LatentTopic

Method

Bilingual Corpus Yes

Any languages

with Bilingual Corpus

Implicit meaning of

topics

LBTM Bilingual Corpus Yes

Any languages

with Bilingual Corpus

No

1) DatasetThe bilingual corpus used in the experiment is the

parallel corpus of Chinese-Korean scientific literature supported by Science and Technology Bureau of Korean Autonomous Prefecture of Yanbian. The dataset contains keywords and Abstracts of 9000 Chinese-Korean parallel literatures with the sentence level aligned, consists of 6000 Ecological class and 3000 Aerospace class literatures. The definition criteria for class are the categories of journals that contain the literatures, and the ratio of training and test set is 9:1.

2) Evaluaton MetricTo evaluate the average performance across multiple

categories, two conventional averaging methods: Macro-F1 and Micro-F1 [14] are used in experiment. Macro-F1measure does statistics on each class, and then performs an arithmetic mean on all of them, the result of Macro-F1measure is easily affected by the small instance class. Micro-F1 measure counts each instance of the test set, build a global confusion matrix, and then calculates the corresponding indicator totally. Micro-F1 is the arithmetic mean of each document. The measure of Micro-F1 is easily affected by the large scale class in the dataset [15]. Macro-F1 and Micro-F1 are widely used in text classification, we also adopt these two evaluation metrics to verify the classification performance of LBTM.

3) Parameters settingThe size of topics K, Dirichlet hyper parameters �

and � , training and test iterations need to be determined in advance. The parameters of Labeled Bilingual Topic Model and the “latent” topic model based bilingual model are shown in Table 2.

TABLE II. COMPARATIVE EXPERIMENT PARAMETERS SETTINGS

ParametersComparative Model

LBTM Method in Literature[6]

K 18170 400

� 50/K 50/K

� 0.01 0.01

Training iterations 1000 1000

Test iterations 100 100

TABLE III. RESULT OF EXPERIMENT

MethodMicro-F1 Macro-F1 Time

SpendingKOR-CHN

CHN-KOR

KOR-CHN

CHN-KOR

LBTM 94.79% 94.81% 92.31% 92.41% 8hoursMethod in [6] 92.32% 92.76% 94.04% 94.37% 48hours

4) Results and AnalysisThe cross-lingual text classification task includes:

Classify Chinese documents by classifier trained in Korean;Classify Korean documents by classifier trained in Chinese.The classification results and time consumed are shown in Table 3.

KOR-CHN indicates classify the Chinese documents by the classifier trained in Korean documents. CHN-KOR indicates classify the Korean documents by the classifier trained in Chinese documents.

LBTM achieves 94.81% Micro-F1 measure and 92.41%Macro-F1 measure in cross-lingual text classification. Indicate that LBTM can be applied to practical classification task. Compared with the method in literature [6], LBTM has a higher Micro-F1 measure than it, and the Macro-F1measure is lower than it. In the 6000 Ecological literatures, the accuracy of LBTM is higher than comparative experiments. In the 3000 Aerospace literatures, the accuracy of LBTM is lower than comparative experiments. Just verifying that the Micro-F1 measure is easily affected by the large scale class in the dataset, and the Macro-F1 measure is easily affected by the small instance class. Two metrics need to be integrated in model evaluation.

Our method took 8hours to train and test the model, and the method in literature [6] took 48hours in total, the time consumption ratio is 1:6. At the training phase, the range of topic sampling in LBTM is given, which is generally the number of keywords (3~6) in per literature, but in comparative experiment, there is no constraint between the document and the topics, the conditional probabilities withall topics should be calculated in whole phase, which increases the computational complexity.

In summary, LBTM saves a lot of time with high classification accuracy.

288

Page 10: 2018 5th International Conference on Information Science

TABLE IV. RESULT OF LITERATURE LABEL RECOMMENDATIONS

Literatures Recommended LabelsTitle Keywords

- ;;

Bossel ;

; ;

N2O ;;

;;

4 ; ;

; ;

;;

;;

;

LFM ;

;

; ;

Abies georgei var. smithii forest

;;

; ;

; ;

B. Label RecommendationThe distribution of document on the topics can be

obtained after inferring. The value of each component in vector representing the document is: the proportion of words assign to this topic in the document. The greater the component value, the more relevant the document is to the topic. We will use the document dataset that used in cross-lingual text classification. The detail is: Inferring the test document, and representing the document as the distribution on the topics, then extracting the top 2 component values topics as the recommended labels and compare with the original labels. The result of label recommendations are shown in Table 4. We can find that both the keyword and the label recommended contain “ ” (cotton) in 1st literature and “ ” (compressive sensing) in 5thliterature. And there are universal semantic associations between the keywords and the recommendation label in other literatures.

The keywords in literature are added by the different author, and the views on the things are inevitably different. The labels recommended cannot be accurately same with the keywords added by the author. Therefore, our model can be used for auxiliary label recommendations.

V. CONCLUSION

In this paper, we combine the multi-label in scientific literatures with the application of topic models in cross-lingual text classification, propose the Labeled Bilingual Topic Model (LBTM). Compared with the “latent” topic model, we instantiate the "latent" topics so that the meaning of the topic is no longer "implicit" but "explicit". In the phase of training parameters and inferring the distribution of the document on the topics, due to the given range of topics sampling, in terms of efficiency, LBTM is superior to the “latent” topic model based multi-lingual LDA model. The

result in cross-lingual text classification task, indicate that LBTM can be applied to practical applications. Depending on the explicit meaning of each topic, it can be used for auxiliary label recommendations. Limited by the corpus, it can only be applied to the bilingual topic modelling at this stage. If there are multi-lingual multi-labeled parallel corpus in more than two language, we will extend our model to the multi-lingual.

ACKNOWLEDGMENT

This research was financially supported by State Language Commission of China. We would like to thank editor and referee for their careful reading and valuable comments.

REFERENCES

[1] N. Bel, C.H.A. Koster and M. Villegas, “Cross-Lingual Text Categorization,” Lecture Notes in Computer Science, 2003, pp.126-139.

[2] J.S. Olsson and D.W. Oard, “Cross-language text classification,” Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR05), 2005, pp.645-646.

[3] L. Rigutini, M. Maggini and B. Liu, “An EM based training algorithm for cross-language text categorization,” Proceedings of 2005 IEEE/WIC/ACM International Conference on Web Intelligence(WI 05), 2005, pp.529-535.

[4] B. Wei and C. Pal, “Cross lingual adaptation: An experiment on sentiment classifications,” Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics(ACL 2010), 2010, pp.258-262.

[5] D. Mimno, H.M. Wallach, J. Naradowsky, D.A. Smith and A.McCallum, “Polylingual Topic Models,” Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 2009, pp.880-889.

[6] X.C. Ni, J.T. Sun, J. Hu and Z. Chen, “Mining multilingual topics from Wikipedia,” Proceedings of the 18th International World Wide Web Conference, 2009, pp.1155-1156.

[7] D.M. Blei, A.Y. Ng and M.I. Jordan, “Latent dirichlet allocation,” Journal of Machine Learning Research, 2003, pp.993-1022.

[8] Q. Xu, J.S. Zhou and J.J. Chen, “Dirichlet Process and Its Applications in Natural Language Processing,” Journal of Chinese Information Processing, 2009, pp.25-33.

[9] G. Xu and H.F. Wang, “The development of topic models in natural language processing,” Chinese Journal of Computers, 2011, pp.1423-1436.

[10] A. Fang, C. Macdonald, I. Ounis, P. Habel and X. Yang, “Exploring Time-Sensitive Variational Bayesian Inference LDA for Social Media Data,” Proceedings of the 39th European Conference on Information Retrieval, 2017, pp.252-265.

[11] A.P. Wang, G.Y. Zhang, and F. Liu, “Research and Application of EM Algorithm,” Computer Technology and Development, 2009, pp.108-110.

[12] G. Heinrich. “Parameter Estimation for Text Analysis,” University of Leipzig, 2008.

[13] H.Z. Yerebakan and M. Dundar, “Partially collapsed parallel Gibbs sampler for Dirichlet process mixture models,” Pattern Recognition Letters, 2017, pp.22-27.

[14] Y. Yang, “An Evaluation of Statistical Approaches to Text Categorization,” Information Retrieval, 1999, pp.69-90.

[15] Q.R. Zhang, L. Zhang, S.B. Dong, and J.H. Tan, “Effects of category distribution in a training set on text categorization, Journal of Tsinghua University(Science and Technology), 2005, pp.1802-1805.

289

Page 11: 2018 5th International Conference on Information Science

1.Accession number: 20190906564164

Title: Labeled Bilingual Topic Model for Cross-Lingual Text Classification and Label Recommendation

Authors: Tian, Ming-Jie1 ; Huang, Zheng-Hao1 ; Cui, Rong-Yi1

Author affiliation: 1 Intelligent Information Processing Lab, Yanbian University, Yanji, Jilin, China

Source title: Proceedings - 2018 5th International Conference on Information Science and Control Engineering, ICISCE 2018

Abbreviated source title:

Proc. - Int. Conf. Inf. Sci. Control Eng., ICISCE

Part number: 1 of 1

Issue title: Proceedings - 2018 5th International Conference on Information Science and Control Engineering, ICISCE 2018

Issue date: January 14, 2019

Publication year: 2019

Pages: 285-289

Article number: 8612565

Language: English

ISBN-13: 9781538655009

Document type: Conference article (CA)

Conference name: 5th International Conference on Information Science and Control Engineering, ICISCE 2018

Conference date: July 20, 2018 - July 22, 2018

Conference location: Zhengzhou, Henan, China

Conference code: 144426

Sponsor: et al.; Henan University of Science and Technology; Hunan University; Minjiang University; Swinburne University of Technology; Wayne State University

Publisher: Institute of Electrical and Electronics Engineers Inc.

Abstract: Labeled Bilingual Topic Model for Cross-Lingual Text Classification and Label Recommendation.© 2018 IEEE.

Number of references:

15

Main heading: Text processing

Controlled terms: Classification (of information) - Labeling

页码,1/2Print Record(s)

2019/3/9https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=cb166db9e18...

Page 12: 2018 5th International Conference on Information Science

Uncontrolled terms: Cross-lingual - latent topic - Text classification - Topic Modeling

Classification code: 694.1 Packaging, General - 716.1 Information Theory and Signal Processing - 903.1 Information Sources and Analysis

DOI: 10.1109/ICISCE.2018.00067

Database: Compendex

Compilation and indexing terms, © 2019 Elsevier Inc.

Copyright © 2019 Elsevier B.V. All rights reserved.

Terms and Conditions Privacy Policy

页码,2/2Print Record(s)

2019/3/9https://www.engineeringvillage.com/search/doc/detailed.url?SEARCHID=cb166db9e18...

Page 13: 2018 5th International Conference on Information Science

���������� ���� ����������������� ������

��� !��" #�$���%��$&�'(����)��*��*�' �+�����'�,"�����-(�.'�# &"/0���'1��#-�+*�.'�# &"/2�)�#0���'-�+*�.'�# &"/3#"(�4���-�+*�.��5 ���

67 8 9 :;< =>8 ?@A ?BCDEFEGHIJK 9IJLMNOPQR STUVSTUVOWXYVZ [\ ]XYVZ L^_TYR`abcZdefghijbcklmnopmnqrmnstmnuvwDExyzs{|}DEs{ ~��� �IJwIJws{ k�����hGHIJK9�WZ���v�����v�� ��h�Iqr67 8 9 :;< =>

<

< ��~� ���� ���� ���� ¡¢ £¤¥¦ 6§¨© ª« ª¬