11
Research Article Unsupervised Chunking Based on Graph Propagation from Bilingual Corpus Ling Zhu, Derek F. Wong, and Lidia S. Chao Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory, Department of Computer and Information Science, University of Macau, Macau Correspondence should be addressed to Ling Zhu; [email protected] Received 30 August 2013; Accepted 8 December 2013; Published 19 March 2014 Academic Editors: J. Shu and F. Yu Copyright © 2014 Ling Zhu et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. is paper presents a novel approach for unsupervised shallow parsing model trained on the unannotated Chinese text of parallel Chinese-English corpus. In this approach, no information of the Chinese side is applied. e exploitation of graph-based label propagation for bilingual knowledge transfer, along with an application of using the projected labels as features in unsupervised model, contributes to a better performance. e experimental comparisons with the state-of-the-art algorithms show that the proposed approach is able to achieve impressive higher accuracy in terms of F-score. 1. Introduction Shallow parsing, is also called text chunking, plays a signifi- cant role in natural language processing (NLP) community. It can be regarded as a classification task, as a process of training a classifier to segment a sentence and labeling each partitioned phrase with an accurate chunk tag. e classifier takes the words, their POS tags, and surrounding context as features of an instance, whereas the chunk tag is the class label. e goal of shallow parsing is to divide a sentence into labeled, nonoverlapping, and nonrecursive chunks based on different methodologies. Plenty of classification algorithms have been applied in the field of shallow parsing. e models are broadly categorized into three types: rule-based chunking model, machine learning-based model, and memory-based model. Particularly, in recent decades we have witnessed the remark- able advancement of the state of the art on chunking task by applying supervised learning approaches. Supervised chunking model, for example, MEMs (maximum entropy models), which is employed by [1] solves the ambiguous tagging problem by training on a corpus to compute the probability information of each word in the input context and get a good performance. Although supervised learning algorithms have resulted in the state of the art and high accuracy systems on varieties of tasks in the NLP domain, the performance in source-poor language is still unreasonable. A fundamental obstacle of sta- tistical shallow parsing for the quantities of world’s language is the shortage of annotated training data. Furthermore, the work of well-understand hand annotation has proved to be expansive and time consuming. For example, over one million dollars have been invested in the development of Penn Treebank [2], and the lack of developed Treebank and tagged corpus in the majority of languages illustrate that it is difficult to raise the investment for annotation projects. However, unannotated parallel text data is broadly available because of the explosive growth of multilingual website, news streams, and human translations of books and documents. ese suggest that unsupervised methods appear to be a useful solution for inducing chunking taggers, as they do not need any annotated text for training. Unfortunately, the existing unsupervised chunking systems do not have a reasonable performance to make its practical usability questionable at best. To bridge the gap of the accuracy between source-rich languages and source-poor languages, we would like to leverage the existing resources of source-rich languages like English when doing the shallow parsing for the source-poor foreign languages such as Chinese. In this work, the system assumes that there is no labeled data available for training but Hindawi Publishing Corporation e Scientific World Journal Volume 2014, Article ID 401943, 10 pages http://dx.doi.org/10.1155/2014/401943

Research Article Unsupervised Chunking Based on Graph ...downloads.hindawi.com/journals/tswj/2014/401943.pdf · Unsupervised Chunking Based on Graph Propagation from Bilingual Corpus

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Research Article Unsupervised Chunking Based on Graph ...downloads.hindawi.com/journals/tswj/2014/401943.pdf · Unsupervised Chunking Based on Graph Propagation from Bilingual Corpus

Research ArticleUnsupervised Chunking Based on Graph Propagation fromBilingual Corpus

Ling Zhu Derek F Wong and Lidia S Chao

Natural Language Processing amp Portuguese-Chinese Machine Translation Laboratory Department of Computer andInformation Science University of Macau Macau

Correspondence should be addressed to Ling Zhu mb05448umacmo

Received 30 August 2013 Accepted 8 December 2013 Published 19 March 2014

Academic Editors J Shu and F Yu

Copyright copy 2014 Ling Zhu et alThis is an open access article distributed under the Creative CommonsAttribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

This paper presents a novel approach for unsupervised shallow parsing model trained on the unannotated Chinese text of parallelChinese-English corpus In this approach no information of the Chinese side is applied The exploitation of graph-based labelpropagation for bilingual knowledge transfer along with an application of using the projected labels as features in unsupervisedmodel contributes to a better performance The experimental comparisons with the state-of-the-art algorithms show that theproposed approach is able to achieve impressive higher accuracy in terms of F-score

1 Introduction

Shallow parsing is also called text chunking plays a signifi-cant role in natural language processing (NLP) communityIt can be regarded as a classification task as a process oftraining a classifier to segment a sentence and labeling eachpartitioned phrase with an accurate chunk tag The classifiertakes the words their POS tags and surrounding context asfeatures of an instance whereas the chunk tag is the classlabel The goal of shallow parsing is to divide a sentence intolabeled nonoverlapping and nonrecursive chunks based ondifferent methodologies

Plenty of classification algorithms have been appliedin the field of shallow parsing The models are broadlycategorized into three types rule-based chunking modelmachine learning-based model and memory-based modelParticularly in recent decades we have witnessed the remark-able advancement of the state of the art on chunkingtask by applying supervised learning approaches Supervisedchunking model for example MEMs (maximum entropymodels) which is employed by [1] solves the ambiguoustagging problem by training on a corpus to compute theprobability information of eachword in the input context andget a good performance

Although supervised learning algorithms have resulted inthe state of the art and high accuracy systems on varieties of

tasks in the NLP domain the performance in source-poorlanguage is still unreasonable A fundamental obstacle of sta-tistical shallow parsing for the quantities of worldrsquos languageis the shortage of annotated training data Furthermore thework of well-understand hand annotation has proved tobe expansive and time consuming For example over onemillion dollars have been invested in the development ofPenn Treebank [2] and the lack of developed Treebank andtagged corpus in the majority of languages illustrate that itis difficult to raise the investment for annotation projectsHowever unannotated parallel text data is broadly availablebecause of the explosive growth of multilingual website newsstreams and human translations of books and documentsThese suggest that unsupervised methods appear to be auseful solution for inducing chunking taggers as they donot need any annotated text for training Unfortunatelythe existing unsupervised chunking systems do not havea reasonable performance to make its practical usabilityquestionable at best

To bridge the gap of the accuracy between source-richlanguages and source-poor languages we would like toleverage the existing resources of source-rich languages likeEnglish when doing the shallow parsing for the source-poorforeign languages such as Chinese In this work the systemassumes that there is no labeled data available for training but

Hindawi Publishing Corporatione Scientific World JournalVolume 2014 Article ID 401943 10 pageshttpdxdoiorg1011552014401943

2 The Scientific World Journal

that we have access to parallel English dataThis part of workis closest to [3] but there are twomain differences to improvethe weakness of dealing with the unaligned words

First a novel graph-based framework is applied to projectsyntactic information across language boundaries for severalreasons as follows

(i) Graphs can indicate complicated relationshipsbetween classes and instances For instance anambiguous instance interest could belong to the classof both NP and VP

(ii) Chunks from multiple sources such as different treebanks web sitesand texts can be represented in asingle graph

(iii) The paths of information propagation of graphsmake it possible to explicit the potential commoninformation among different instances That is to saythe algorithm relies on the fact that neighbor casesshould belong to the same class and the relationshipsbetween data points are captured in the form of asimilarity graph with vertices corresponding to thecases and edge weights to their similarity

In the end a bilingual graph is constructed to represent therelationships between English and Chinese

Second rather than directly using these projected labelsas features in supervised training phases we prefer to employthem in unsupervised model Additionally to facilitate bilin-gual unsupervised chunking research and standardize bestpractice a tag set consists of eight universal chunk categoriesis proposed in this work

The paper is organized as follows Section 2 introducesthe overall approach of our work Section 3 focuses on howto establish the graph and carry out the label propagationSection 4 describes the unsupervised chunk induction andfeature selection in detail In Section 5 the evaluation resultsare presented followed by a conclusion to end this paper

2 Approach Overview

The central aim of our work is to build a chunk taggerfor Chinese sentences assuming that there is an Englishchunk tagger and some parallel corpuses between these twolanguages Hence we can conclude that the emphasis of ourapproach is how to build a bilingual graph from the sentence-aligned Chinese-English corpus Two types of vertices willbe employed in our graph trigram types are used on theChinese side corresponding to different vertices while onthe English side the vertices are individual word typesDuring the graph construction no labeled data is neededbut does require two similarity functions A cooccurrencebased similarity function is applied to compute edge weightsbetween Chinese trigrams This function is designed toindicate the syntactical similarity of the middle words ofthe neighbor trigrams A second similarity function whichleverages standard unsupervised word alignment statistics isemployed to establish a soft correspond between Chinese andEnglish

NPa-I NPa-I NPa-I

[NP-B

The

NP-I

key

NP-I

of

NP-I]a

mine

(my) (of) (key)

Figure 1 Direct label projection from English to Chinese withposition information

NP-B NP-I NP-I

[NP-BThe

NP-Ikey

NP-Iof

NP-I]amine

(my) (of) (key)

Figure 2 Adjust the Cchunk tag based on position information

Based on the reality that we do not have labeled Chinesedata we aim to project the syntactic information from theEnglish side to the Chinese side Before initializing the grapha supervised chunking model is used to tag the English sideof the parallel text The label distributions for the Englishvertices are generated by aggregating the chunk labels ofthe English tokens to types Consider the crossover problemand different word orders between Chinese and English theposition information is also considered in our work Basedon this idea the chunk tags are projected directly fromEnglish side to Chinese along with the position as shownin Figure 1 Based on the position information we can getthe exact boundary of each chunk at the Chinese side Thenan adjustment is made to assign the correct label as shownin Figure 2 After graph initialization label propagation isapplied to transfer the chunk tags to the peripheral Chinesevertices firstly (ie the ones which are adjacent to the Englishvertices) followed by further propagating among all theChinese vertices It is worth mentioning that the chunkdistributions over the Chinese trigram types are regarded asfeatures for learning a better unsupervised chunking taggerThe following sections will explain these steps in detail(see Algorithm 1)

3 Graph Construction

In graph learning approaches one constructs a graph whosevertices are labeled and unlabeled examples and whoseweighted edges encode the degree to which examples theylink have the same label [4] Notice that graph constructionused for the problems of structured prediction such as

The Scientific World Journal 3

Bilingual Chunking InductionRequire Parallel English and foreign language data119863

119890 and119863119888 unlabeled foreign training data Γ

119888 English taggerEnsure 120579119888 a set of parameters learned using a constrainedunsupervised model(1)119863119890harr119888 larr word-align-bitext (119863119890119863119888)(2)119863119890 larr chunk-tag-supervised (119863119890)(3) 119860 (119863119890harr119888119863119890)(4) 119866 larr construct-graph (Γ119888119863119888 119860)(5) 119866 larr graph-propagate (119866)(6) Δ larr extract-word-constraints (119866)(7) 120579119891 larr chunk-induce-constraints (Γ119888 Δ)(8) return 120579

119891

Algorithm 1 Graph-based unsupervised chunking approach

shallow parsing is nontrivial for the following two reasonsFirst it is necessary for resolving ambiguous problem byemploying individual words as the vertices instead of thecontext Besides if we use the whole sentences as the verticesduring graph construction it is unclear how to calculatethe sequence similarity Altun et al [5] proposed a graph-based semisupervised learning approach by using the sim-ilarity between labeled and unlabeled structured data in adiscriminative framework Recently a graph over the cliquesin an underlying structured model was defined by [6] Intheir work a new idea has been proposed that one can usea graph over trigram types combing with the weights basedon distribution similarity to improve the supervised learningalgorithms

31 Graph Vertices In this work we extend the intuitions of[6] to set up the bilingual graph Two different types of ver-tices are applied for each language because the informationtransformation in our graph is asymmetric (from English toChinese) On the Chinese side the vertices (denoted by 119863

119888)correspond to trigram types which are the same as in [6]The English vertices (denoted by 119863

119890) however correspondto word types The reason we do not need the trigram typesin English side to disambiguate them is that each Englishword are going to be labeled Additionally in the labelpropagation or graph construction we do not connect thevertices between any English words but only to the Chinesevertices

Furthermore there are two kinds of vertices consisting ofones extracted from the different sides of word aligned bitext(119863119890 119863119888) and an additional unlabeled Chinese monolingualcorpus Γ

119888 whichwill be used later in the training phase Suchas noted two different similarities will be employed to definethe edge weights between vertices from these two languagesand among the Chinese vertices

32 Monolingual Similarity Function In this section a briefintroduction is given as follows in terms of computingthe monolingual similarity between the connecting pairs ofChinese trigram type Specifically the set 119881119888 consists of all

Table 1 Various features for computing edge weights betweenChinese trigram types

Description FeatureTrigram + context 119909

11199092119909311990941199095

Trigram 119909211990931199094

Left context 11990911199092

Right context 11990941199095

Center word 1199093

Trigram minus center word 11990921199094

Left word + right context 119909211990941199095

Right word + left context 119909111990921199094

Suffix Has suffix (1199093)

Prefix Has prefix (1199093)

the word 119899-grams that occur in the text Given a symmetricsimilarity function between types defined below we linktypes 119898

119894and 119898

119895(119898119894 119898119895

isin 119881119888) with an edge weight 119908

119898119894119898119895

as follows

119908119898119894119898119895

= sim (119898

119894 119898119895) if 119898

119894isin 120581 (119898

119895) or 119898

119895isin 120581 (119898

119894)

0 otherwise(1)

where 120581(119898119895) is the set of 119896-nearest neighbors of119898

119894according

to the given similarity For all the parameters in this paper wedefine 119896 = 5

To define the similarity function for each trigram type as119898119894isin 119881119888 and we rely on the cooccurrence statistics of ten

features illustrated in Table 1Based on this monolingual similarity function a nearest

neighbor graph could be achieved In this graph the edgeweight for the 119899 most similar trigram types is set to the PMIvalues and is 0 for all other ones Finally we apply the function(119898) to denote the neighborhood of vertex 119898 and set themaximum number of 5 in our experiments

33 Bilingual Similarity Function We rely on high-confidence word alignments to define a similarity function

4 The Scientific World Journal

between the English and Chinese vertices Since the graphvertices are extracted from a parallel corpus a standardword alignment technique GIZA++ [7] is applied to alignthe English sentences 119863

119890 and their Chinese translations 119863119888

Based on the idea that the label propagation process in thegraph will provide coverage and high recall only a highconfidence alignment119863119890harr119888 gt 09 is considered

Therefore we can extract tuples of the form [119905 harr 119906] basedon these high-confidence word alignments where the Chi-nese trigram types that the middle word aligns to an Englishunigram type 119906 Relying on computing the proportion ofthese tuple counts we set the edge weights between the twolanguages by our bilingual similarity function

34 Graph Initialization So far we have introduced how toconstruct a graph But the graph here is unlabeled completelyFor the sake of label propagation we have to use a supervisedEnglish chunking tagger to label the English side of theparallel corpus in the graph initialization phase Afterwardswe simply count the individual labels of each English tokenand then normalize the counts to get the tag distribution overthe English unigram types These distributions are in the useof initializing the English verticesrsquo distribution in the graphConsidering that we extract the English vertices from a bitextall vertices in English word types are assigned with an initiallabel distribution

35 Graph Example Figure 3 shows a simple small excerptfrom a Chinese-English graph We see that only the Chinesetrigrams [一项重要选择] [因素是这样] and [不断深入]are connected to the English translated words All theseEnglish vertices are labeled by a supervised tagger In thisparticular case the neighborhoods can be more diverse anda soft label distribution is allowed over these vertices As onecan see Figure 3 is composed of three subgraphs In each sub-graph it is worth noting that the middle of Chinese trigramtypes has the same chunking types (with the labeled one)This exhibits the fact that themonolingual similarity functionguarantees the connected vertices having the same syntacticcategory The label propagation process then spreads theEnglish wordsrsquo tags to the corresponding Chinese trigramvertices After that labels are further propagated among theChinese vertices This kind of propagation is used to conveythese tags inwards and results in tag distributions for themiddle word for each Chinese trigram type More details onhow to project the chunks and propagate the labels will bedescribed in the following section

36 Label Propagation Label propagation operates on theconstructed graph The primary objective is to propagatelabels from a few labeled vertices to the entire graph byoptimizing a loss function based on the constraints or prop-erties derived from the graph for example smoothness [4 8]or sparsity [9] State-of-the-art label propagation algorithmsinclude LP-ZGL [4] adsorption [10] MAD [8] and sparsity-inducing penalties [9]

Label propagation is applied in two phases to generate softlabel distributions over all the vertices in the bilingual graph

In the first stage the label propagation is used to project theEnglish chunk labels to the Chinese side In detail we simplytransfer the label distributions from the English word typesto the connected Chinese vertices (ie 119881119897

119888) at the periphery

of the graph Note that not all the Chinese vertices are fullyconnected with English words if we consider only the high-confidence alignments In this stage a tag distribution 119889

119894over

the labels 119910 is generated which represents the proportion oftimes the center word 119888

119894isin 119881119888aligns to English words 119890

119910

tagged with label 119910

119889119894(119910) =

sum119890119910

[119888119894larrrarr 119890

119910]

sum1199101015840 sum1198901199101015840 [119888119894larrrarr 119890

1199101015840]

(2)

The second phase is the conventional label propagation topropagate the labels from the peripheral vertices 119881

119897

119888to all

Chinese vertices in the constructed graph The optimizationon the similarity graph is based on the objective function

119875 (119902) = sum

119888119894isin119881119888119862119897

119888

119908119894119895

10038171003817100381710038171003817119902119894minus 119902119895

10038171003817100381710038171003817

2

+ 1205821003817100381710038171003817119902119894 minus 119880

1003817100381710038171003817

2

st sum

119910

119902119894(119910) = 1 forall119888

119894

119902119894(119910) ge 0 forall119888

119894 119910

119902119894= 119889119894

forall119888119894isin 119881119897

119888

(3)

where 119902119894(119894 = 1 |119881

119888|) are the label distributions over

all Chinese vertices and 120582 is the hyperparameter thatwill be discussed in Section 4 Consider that 119902

119894minus 1199021198952

=

sum119910(119902119894(119910) minus 119902

119895(119910))2 is a squared loss which is used to penalize

the neighbor vertices to make sure that the connectedneighbors have different label distributions Furthermore theadditional second part of the regulation makes it possiblethat the label distribution over all possible labels 119910 is towardsthe uniform distribution 119880 All these show the fact that thisobjective function is convex

As we know the first term in (3) is a smoothness regu-larizer which can be used to encourage the similar verticeswhich have the large 119908

119894119895in between to be much more

similar Moreover the second part is applied to regularizeand encourage all marginal types to be uniform Additionallythis term also ensures that the unlabeled converged marginalvertices will be uniform over all tags if these types do not havea path to any labeled vertices This part makes it possible thatthe middle word of this kind unlabeled trigram takes on anypossible tag as well

However although a closed form solution can be derivedthrough the objective function mentioned above it would beimpossible without the inversion of the matrix of order |119881

119888|

To solve this problem we rely on an iterative update basedalgorithm instead The formula is as follows

119902(119898)

119894(119910) =

119888119894(119910) if 119888

119894isin 119881119897

119888

120574119894(119910)

120581119894

otherwise (4)

The Scientific World Journal 5

[Yixiang zhongyao xuanze]

[Yixiang cuowu xuanze]

[Yixiang mingzhi xuanze]

[Yixiang zhengque xuanze]

NP-INP-IImportantImportant

[Ta shi yi]

[Wo shi yi]

[Yinsu shi zheyang]

[Zhi shi zheyang] [Women shi nansheng]

[Buduan tigao]

[Buduan tansuo]

[Buduan shenru]

[Buduan bianhua]

ExploreVP-I

Figure 3 An example of similarity graph over trigram on labeled and unlabeled data

where forall119888119894isin 119881119888 119881119897

119888 120574119894(119910) and 119896

119894are defined as follows

120574119894(119910) = sum

119888119895isin119873(119888119894)

119908119894119895119902(119898minus1)

119894(119910) + 120582119880 (119910)

119896119894= 120582 + sum

119888119895isin119873(119888119894)

119908119894119895

(5)

This procedure will be processed 10 iterations in our experi-ment

4 Unsupervised Chunk Induction

In Section 3 the bilingual graph construction is describedsuch that English chunk tags can be projected to Chineseside by simple label propagation This relies on the high-confidence word alignments Many Chinese vertices couldnot be fully projected from the English words To comple-ment label propagation is used to further propagate amongthe Chinese trigram types This section introduces the usageof unsupervised approach to build a practical system inthe task of shallow parsing The implementation of how to

establish the chunk tag identification systemwill be discussedand presented in detail

41 Data Representation

411 Universal Chunk Tag The universal tag was first pro-posed by [11] that consists of twelve universal part-of-speechcategories for the sake of evaluating their cross-lingual POSprojection system for six different languages In this work wefollow the idea but focus on the universal chunk tags betweenEnglish and Chinese for several reasons First it is usefulfor building and evaluating unsupervised and cross-lingualchunk taggers Additionally taggers constructed based onuniversal tag set can result in a more reasonable comparisonacross different supervised chunking approaches Since twokinds of tagging standards are applied for the differentlanguages it is vacuous to state that ldquoshallow parsing forlanguage119860 ismuch harder than that for language119861rdquowhen thetag sets are incomparable Finally it also permits our modelto train the chunk taggers with a common tag set acrossmultiple languages which does not need tomaintain language

6 The Scientific World Journal

specific unification rules due to the differences in Treebankannotation guidelines

The following compares the definition of tags used in theEnglish Wall Street Journal (WSJ) [2] and the Chinese PennTreebank (CTB) 70 [12] that are used in our experiment

(i) The Phrasal Tag of Chinese In the present corpus CTB7each chunk is labeled with one syntactic category A tag set of13 categories is used to represent shallow parsing structuralinformation as follows

ADJPmdashadjective phrase phrase headed by an adjec-tiveADVPmdashadverbial phrase phrasal category headed byan adverbCLPmdashclassifiers phrase for example (QP (CD 一)(CLP (M系列))) (a series)DPmdashdeterminer phrase in CTB7 corpus a DP isa modifier of the NP if it occurs within a NP Forexample (NP (DP (DT 任何) (NP (NN 人))) (anypeople)DNPmdashphrase formed by XP plus (DEG的) (lsquos) thatmodifies a NP the XP can be an ADJP DP QP NPPP or LCPDVPmdashphrase formed by XP plus ldquo地 (-ly)rdquo thatmodifies a VPFRAGmdashfragment used to mark fragmented elementsthat cannot be built into a full structure by using nullcategoriesLCPmdashused to mark phrases formed by a localizerand its complement For example (LCP (NP (NR皖南) (NN 事变)) (LC 中)) (in the Southern AnhuiIncident)LSTmdashlist marker numbered list letters or numberswhich identify items in a list and their surroundingpunctuation is labeled as LSTNPmdashnoun phrases phrasal category that includes allconstituents depending on a head nounPPmdashpreposition phrase phrasal category headed bya prepositionQPmdashquantifier phrase used with NP for example(QP (CD五百) (CLP (M辆))) (500)VPmdashverb phrase phrasal category headed by a verb

(ii) The Phrasal Tag of English In the respect English WSJcorpus there are just eight categories NP (noun phrase)PP (prepositional phrase) VP (verb phrase) ADVP (adverbphrase) ADJP (adjective phrase) SBAR (subordinating con-junction) PRT (particle) and INTJ (interjection)

As we can see the criterion of Chinese CTB7 has a lotof significant differences from English WSJ corpus Unlikethe English Treebank the fragment of Chinese continents isnormally smaller A chunkwhich aligned to a base-NPphrasein the English side could be divided into a QP tag a CLPtag and an NP tag For example after chunking ldquo五百辆

Table 2 The description of universal chunk tags

Tag Description Words Example

NP Noun phrase DET + ADV +ADJ + NOUN The strange birds

PP Prepositionphrase TO + IN In between

VP Verb phrase ADV + VB Was lookingADVP Adverb phrase ADV Also

ADJP Adjective phrase CONJ + ADV +ADJ Warm and cozy

SBAR Subordinatingconjunction IN Whether or not

车 (500 cars)rdquo will be tagged as (NP (QP (CD 五百) (CLP(M辆))) (NN车)) But at the English side ldquo500 carsrdquo will bejust denoted with an NP tag This nonuniform standard willresult in a mismatch projection during the alignment phaseand the difficulty of evaluation in the cross-lingual settingTo fix these problems mappings are applied in the universalchunking tag setThe categoryiesCLPDP andQP aremergedinto an NP phrase That is to say the phrase such as (NP (QP(CD五百) (CLP (M辆))) (NN车)) which corresponds tothe English NP chunk ldquo500 carsrdquo will be assigned with anNP tag in the universal chunk tag Additionally the phrasesDNP and DVP could be included in the ADJP and ADVPtags respectively

On the English side the occurrence of INTJ is 0according to statisticsThis evidence shows thatwe can ignorethe INTJ chunk tag Additionally the words belong to a PRTtag that is always regarded as a VP in Chinese For example(PRT (RP up)) (NP (DET the) (NNS stairs)) is aligned to ldquo上楼梯rdquo in Chinese where ldquo上rdquo is a VP

412 IBO2 Chunk Tag There are many ways to encodethe phrase structure of a chunk tag such as IBO1 IBO2IOE1 and IOE2 [13] The one used in this study is theIBO2 encoding This format ensures that all initial wordsof different base phrases will receive a B tag which is ableto identify the boundaries of each chunk In addition twoboundary types are defined as follows

(i) B-X represents the beginning of a chunk X(ii) I-X indicates a noninitial word in a phrase X(iii) O any words that are out of the boundary of any

chunks

Hence the input Chinese sentence can be representedusing these notations as follows

去年 (NP-B)实现 (VP-B)进出口 (NP-B)总值 (NP-I) 达 (VP-B) 一千零九十八点二亿 (NP-B) 美元(NP-I)

∘(O)

Based on the above discussion a tag set that consistsof six universal chunk categories is proposed as shown inTable 2 While there are a lot of controversies about universaltags and what the exact tag set should be used these six

The Scientific World Journal 7

chunk categories are sufficient to embrace the most frequentchunk tags that exist in English and Chinese Furthermorea mapping from fine-grained chunk tags for these two kindlanguages to the universal chunk set has been developedHence a dataset consisting of common chunk tag for Englishand Chinese is achieved by combining with the originalTreebank data and the universal chunk tag set and mappingbetween these two languages

42 Chunk Induction The label distribution of Chinese wordtypes 119909 can be computed by marginalizing the chunk tagdistribution of trigram types 119888

119894= 119909minus11199090119909+1

over the left andright context as follows

119901 (119910 | 119909) =

sum119909minus1119909+1

119902119894(119910)

sum119909minus1119909+1 119910

1015840 119902119894(1199101015840)

(6)

Then a set of possible tags 119905119909(119910) is extracted through a way

that eliminates labels whose probability is below a thresholdvalue 120591

119905119909(119910) =

1 if 119901 (119910 | 119909) ge 120591

0 otherwise(7)

The way how to choose 120591 will be described in Section 5Additionally the vector 119905

119909will cover every word in the

Chinese trigram types and will be employed as features in theunsupervised Chinese shallow parsing

Similar to the work of [14 15] our chunk inductionmodel is constructed on the feature-based hidden Markovmodel (HMM) A chunking HMM generates a sequenceof words in order In each generation step based on thecurrent chunk tag 119911

119894conditionally an observed word 119909

119894and a

hidden successor chunk tag 119911119894+1

are generated independentlyThere are two types of conditional distributions in themodel emission and transition probabilities which are bothmultinomial probability distributionsGiven a sentence119909 anda state sequence 119911 a first-order Markov model defines a jointdistribution as

119875120579= (119883 = 119909 119885 = 119911) = 119875

120579(1198851= 1199111)

sdot

|119909|

prod

119894=1

119875120579(119885119894+1

= 119911119894+1

| 119885119894= 119911119894)

sdot 119875120579(119883119894= 119909119894| 119885119894= 119911119894)

(8)

where the second part represents the transition probabilityand the following part is emission probability which isdifferent from the conventional Markov model where thefeature-based model replaces the emission part with a log-linear model such that

119875120579(119883 = 119909 119885 = 119911) =

exp 120579Τ

119891 (119909 119911)

sum1199091015840isinVal(119883) exp 120579Τ119891 (119909 119911)

(9)

which corresponds to the entire Chinese vocabularyThe advantage of using this locally normalized log-

linear model is the ability of looking at various aspects of

Table 3 Feature template used in unsupervised chunking

Basic (119909= 119911 = )

Contains digit check if x contains digit and conjoin with 119911

(contains digit(119909)= 119911 = )Contains hypen contains hypen(119909)= 119911 =

Suffix indicator features for character suffixes of up tolength 1 present in 119909

Prefix indicator features for character prefixes of up tolength 1 present in 119909

Pos tag indicator feature for word POS assigned to 119909

the observation 119909 incorporating features of the observationsThen the model is trained by applying the following objectivefunction

L (120579) =

119873

sum

119894=1

logsum119885

119875 (120579) (119883 = 119909(119894)

119885 = 119911(119894)

) minus 1198621205792

(10)

It is worth noting that this function includes marginal-izing out all the sentence 119909rsquos and all possible configurations119911 which leads to a nonconvex objective We optimize thisobjective function by using L-BFGS a quasi-Newtonmethodproposed by Liu and Nocedal [16] The evidences of pastexperiments show that this direct gradient algorithm per-formed better than using a feature-enhanced modification ofthe EM (expectation-maximization)

Moreover this state-of-the-art model also has an advan-tage that makes it easy to experiment with various waysof incorporating the constraint feature into the log-linearmodel This feature function 119891

119905consists of the relevant

information extracted from the smooth graph and eliminatesthe hidden states which are inconsistent with the thresholdvector 119905

119909

43 Unsupervised Chunking Features The feature selectionprocess in the unsupervised chunking model does affect theidentification result Feature selection is the task to identifyan optimal feature subset that best represents the underlyingclassification problem through filtering the irrelevant andredundant ones Unlike the CoNLL-2000 we shared super-vised chunking task which is using WSJ (the Wall StreetJournal) corpus as background knowledge to construct thechunking model Our corpus is composed of only wordswhich do not contain any chunk information that enables usto form an optimal feature subset for the underlying chunkpattern We use the set of features as the following featuretemplates These are all coarse features on emission contextsthat are extracted from words with certain orthographicproperties Only the basic feature is used for transitions Forany emission context with word 119909 and tag 119911 we construct thefollowing feature templates as shown in Table 3

Box 1 illustrates the example about the feature subsetThe feature set of each word in a sentence is a vector of6 dimensions which are the values of the correspondingfeatures and the label indicates which kind of chunk labelshould the word belong to

8 The Scientific World Journal

Sentence中国NRNP-B建筑业NNNP-I对PPP-B外NNNP-B开放VVVP-B呈现VVVP-B新JJNP-B格局NNNP-I

Word建筑业Instance (建筑业 NP-I) (N NP-I) (N NP-I) (建)

(业) NN

Box 1 Example of feature template

5 Experiments and Results

Before presenting the results of our experiments the eval-uation metrics the datasets and the baselines used forcomparisons are firstly described

51 Evaluation Metrics The metrics for evaluating NPchunking model constitute precision rate recall rate andtheir harmonic mean 119865

1-score The tagging accuracy is

defined as

Tagging Accuracy

=The number of correct tagged words

The number of words

(11)

Precision measures the percentage of labeled NP chunksthat are correctThe number of correct tagged words includesboth the correct boundaries of NP chunk and the correctlabel The precision is therefore defined as

Precision

=The number of correct proposed tagged words

The number of correct chunk tags

(12)

Recall measures the percentage of NP chunks presentedin the input sentence that are correctly identified by thesystem Recall is depicted as below

Recall =The number of correct proposed tagged words

The number of current chunk tags

(13)

The 119865-measure illustrates a way to combine previous twomeasures into onemetricThe formula of 119865-sore is defined as

119865120573-score =

(1205732

+ 1) times Recall times Precision1205732 times Recall + Precision

(14)

where 120573 is a parameter that weights the importance of recalland precision when 120573 = 1 precision and recall are equallyimportant

52 Dataset To facilitate bilingual graph construction andunsupervised chunking experiment two kinds of data sets areutilized in this work (1) monolingual Treebank for Chinesechunk induction and (2) large amounts of parallel corpus

Table 4 (a) English-Chinese parallel corpus (b) Graph instance (c)Chinese unlabeled dataset CTB7 corpora

(a)

Number of sentence pairs Number of seeds Number of words10000 27940 31678

(b)

Number of sentences Number of vertices17617 185441

(c)

Dataset Source Number of sentencesTraining dataset Xinhua 1ndash321 7617Testing dataset Xinhua 363ndash403 912

of English and Chinese for bilingual graph construction andlabel propagation For themonolingual Treebank data we relyon Penn Chinese Treebank 70 (CTB7) [12] It consists of overonemillion words of annotated and parsed text fromChinesenewswire magazines various broadcast news and broadcastconversation programs web newsgroups and weblogs Theparallel data came from the UM corpus [17] which contains100000 pair of English-Chinese sentences The trainingand testing sets are defined for the experiment and theircorresponding statistic information is shown in Table 4 Inaddition Table 4 also shows the detailed information of dataused for bilingual graph construction including the parallelcorpus for the tag projection and the label propagation ofChinese trigram

53 Chunk Tagset and HMM States As described the uni-versal chunk tag set is employed in our experiment Thisset 119880 consists of the following six coarse-grained tags NP(noun phrase) PP (prepositional phrase) VP (verb phrase)ADVP (adverb phrase) ADJP (adjective phrase) and SBAR(subordinating conjunction) Although there might be somecontroversies about the exact definition of such a tag set these6 categories cover the most frequent chunk tags that exist inone form or another in both of English and Chinese corpora

For each kind of languages under consideration a map-ping from the fine-grained language specific chunk tags in theTreebank to the universal chunk tags is provided Hence themodel of unsupervised chunking is trained on the datasetslabeled with the universal chunk tags

The Scientific World Journal 9

Table 5 Chunk tagging evaluation results for various baselines andproposed graph-based model

Tag Feature-HMM Projection Graph-basedPre Rec F1 Pre Rec F1 Pre Rec F1

NP 060 067 063 067 072 068 081 082 081VP 056 051 053 061 057 059 078 074 076PP 036 028 032 044 031 036 060 051 055ADVP 040 046 043 045 052 048 064 068 066ADJP 047 053 050 048 058 051 067 071 069SBAR 000 000 000 050 10 066 050 10 066All 049 051 050 057 062 059 073 075 074

In this work the number of latent HMM states forChinese is fixed to be a constant which is the number of theuniversal chunk tags

54 VariousModels To comprehensively probe our approachwith a thorough analysis we evaluated two baselines inaddition to our graph-based algorithmWewere intentionallylenient with these two baselines

(i) Feature-HMM The first baseline is the vanilla feature-HMM proposed by [9] which apply L-BFGS to estimate themodel parameters and use a greedy many-to-1 mapping forevaluation

(ii) ProjectionThe direct projection serves as a second base-line It incorporates the bilingual information by projectingchunk tags from English side to the aligned Chinese textsSeveral rules were added to fix the unaligned problem Wetrained a supervised feature-HMMwith the same number ofsentences from the bitext as it is in the Treebank

(iii) Our Graph-Based Model The full model uses bothstages of label propagation (3) before extracting the constrainfeatures As a result the label distribution of all Chinese wordtypes was added in the constrain features

55 Experimental Setup Theexperimental platform is imple-mented based on two toolkits Junto [18] and Ark [1415] Junto is a Java-based label propagation toolkit for thebilingual graph construction Ark is a Java-based package forthe feature-HMM training and testing

In the feature-HMM training we need to set a fewhyperparameters to minimize the number of free parametersin the model 119862 = 10 was used as regularization constantin (10) and trained L-BFGS for 1000 iterations Severalthreshold values for 120591 to extract the vector 119905

119909were tried and it

was found that 02 works best It indicates that 120591 = 02 couldbe used for Chinese trigram types

56 Results Table 5 shows the complete set of results Asexpected the projection baseline is better than the feature-HMM for being able to benefit from the bilingual infor-mation It greatly improves upon the monolingual baselineby 12 on 119865

1-score Comparing among the unsupervised

approaches the feature-HMM achieves only 50 of 1198651-

score on the universal chunk tags Overall the graph-basedmodel outperforms these two baselines That is to say theimprovement of feature-HMM with the graph-based settingis statistically significantwith respect to othermodels Graph-based modal performs 24 better than the state-of-the-art feature-HMM and 12 better than the direct projectionmodel

6 Conclusion

This paper presents an unsupervised graph-based Chinesechunking by using label propagation for projecting chunkinformation across languages Our results suggest that it ispossible for unlabeled text to learn accurate chunk tags fromthe bitext the data which has parallel English text In thefeature we propose an alternative graph-based unsupervisedapproach on chunking for languages that are lacking ready-to-use resource

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors are grateful to the Science and TechnologyDevelopment Fund of Macau and the Research Committeeof the University of Macau (Grant nos MYRG076(Y1-L2)-FST13-WF andMYRG070(Y1-L2)-FST12-CS) for the fundingsupport for their research

References

[1] R Koeling ldquoChunking with maximum entropy modelsrdquo inProceedings of the 2ndWorkshop on Learning Language in Logicand the 4th Conference on Computational Natural LanguageLerning vol 7 pp 139ndash141 2000

[2] M P Marcus M A Marcinkiewicz and B Santorini ldquoBuildinga large annotated corpus of English the Penn TreebankrdquoComputational Linguistics vol 19 no 2 pp 313ndash330 1993

[3] D Yarowsky and G Ngai ldquoInducing multilingual POS taggersandNPbracketers via robust projection across aligned corporardquoin Proceedings of the 2ndMeeting of the North American Chapterof the Association for Computational Linguistics (NAACL rsquo01)pp 200ndash207 2001

[4] X Zhu Z Ghahramani and J Lafferty ldquoSemi-supervisedlearning using Gaussian fields and harmonic functionsrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 912ndash919 August 2003

[5] Y Altun M Belkin and D A Mcallester ldquoMaximum marginsemi-supervised learning for structured variablesrdquo in Advancesin Neural Information Processing Systems pp 33ndash40 2005

[6] A Subramanya S Petrov and F Pereira ldquoEfficient graph-based semi-supervised learning of structured tagging modelsrdquoinProceedings of theConference onEmpiricalMethods inNaturalLanguage Processing pp 167ndash176 October 2010

10 The Scientific World Journal

[7] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

[8] A Subramanya and J Bilmes ldquoSemi-supervised learningwith measure propagationrdquo The Journal of Machine LearningResearch vol 12 pp 3311ndash3370 2011

[9] D Das and N A Smith ldquoSemi-supervised frame-semanticparsing for unknown predicatesrdquo in Proceedings of the 49thAnnualMeeting of the Association for Computational Linguisticspp 1435ndash1444 June 2011

[10] S Baluja R Seth D S Sivakumar et al ldquoVideo suggestion anddiscovery for you tube taking random walks through the viewgraphrdquo in Proceedings of the 17th International Conference onWorld Wide Web pp 895ndash904 April 2008

[11] S Petrov D Das and RMcDonald ldquoA universal part-of-speechtagsetrdquo 2011 httparxivorgabs11042086

[12] N Xue F Xia F-D Chiou and M Palmer ldquoThe Penn ChineseTreeBank phrase structure annotation of a large corpusrdquoNatural Language Engineering vol 11 no 2 pp 207ndash238 2005

[13] E F Tjong K Sang and H Dejean ldquoIntroduction to theCoNLL- shared task clause identificationrdquo in Proceedings of theWorkshop on Computational Natural Language Learning vol 7p 8 2001

[14] T Berg-Kirkpatrick A Bouchard-Cote J deNero andDKleinldquoPainless unsupervised learning with featuresrdquo in Proceedingsof the Human Language Technologies Conference of the NorthAmerican Chapter of the Association for Computational Linguis-tics pp 582ndash590 Los Angeles Calif USA June 2010

[15] A B Goldberg andX Zhu ldquoSeeing stars when there arenrsquotmanystars graph-based semi-supervised learning for sentiment cat-egorizationrdquo in Proceedings of the 1st Workshop on Graph BasedMethods for Natural Language Processing pp 45ndash52 2006

[16] DC Liu and JNocedal ldquoOn the limitedmemoryBFGSmethodfor large scale optimizationrdquoMathematical Programming B vol45 no 3 pp 503ndash528 1989

[17] L Tian F Wong and S Chao ldquoAn improvement of translationquality with adding key-words in parallel corpusrdquo in Proceed-ings of the International Conference on Machine Learning andCybernetics (ICMLC rsquo10) vol 3 pp 1273ndash1278 July 2010

[18] P P Talukdar and F Pereira ldquoExperiments in graph-based semi-supervised learning methods for class-instance acquisitionrdquo inProceedings of the 48th Annual Meeting of the Association forComputational Linguistics (ACL rsquo10) pp 1473ndash1481 July 2010

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 2: Research Article Unsupervised Chunking Based on Graph ...downloads.hindawi.com/journals/tswj/2014/401943.pdf · Unsupervised Chunking Based on Graph Propagation from Bilingual Corpus

2 The Scientific World Journal

that we have access to parallel English dataThis part of workis closest to [3] but there are twomain differences to improvethe weakness of dealing with the unaligned words

First a novel graph-based framework is applied to projectsyntactic information across language boundaries for severalreasons as follows

(i) Graphs can indicate complicated relationshipsbetween classes and instances For instance anambiguous instance interest could belong to the classof both NP and VP

(ii) Chunks from multiple sources such as different treebanks web sitesand texts can be represented in asingle graph

(iii) The paths of information propagation of graphsmake it possible to explicit the potential commoninformation among different instances That is to saythe algorithm relies on the fact that neighbor casesshould belong to the same class and the relationshipsbetween data points are captured in the form of asimilarity graph with vertices corresponding to thecases and edge weights to their similarity

In the end a bilingual graph is constructed to represent therelationships between English and Chinese

Second rather than directly using these projected labelsas features in supervised training phases we prefer to employthem in unsupervised model Additionally to facilitate bilin-gual unsupervised chunking research and standardize bestpractice a tag set consists of eight universal chunk categoriesis proposed in this work

The paper is organized as follows Section 2 introducesthe overall approach of our work Section 3 focuses on howto establish the graph and carry out the label propagationSection 4 describes the unsupervised chunk induction andfeature selection in detail In Section 5 the evaluation resultsare presented followed by a conclusion to end this paper

2 Approach Overview

The central aim of our work is to build a chunk taggerfor Chinese sentences assuming that there is an Englishchunk tagger and some parallel corpuses between these twolanguages Hence we can conclude that the emphasis of ourapproach is how to build a bilingual graph from the sentence-aligned Chinese-English corpus Two types of vertices willbe employed in our graph trigram types are used on theChinese side corresponding to different vertices while onthe English side the vertices are individual word typesDuring the graph construction no labeled data is neededbut does require two similarity functions A cooccurrencebased similarity function is applied to compute edge weightsbetween Chinese trigrams This function is designed toindicate the syntactical similarity of the middle words ofthe neighbor trigrams A second similarity function whichleverages standard unsupervised word alignment statistics isemployed to establish a soft correspond between Chinese andEnglish

NPa-I NPa-I NPa-I

[NP-B

The

NP-I

key

NP-I

of

NP-I]a

mine

(my) (of) (key)

Figure 1 Direct label projection from English to Chinese withposition information

NP-B NP-I NP-I

[NP-BThe

NP-Ikey

NP-Iof

NP-I]amine

(my) (of) (key)

Figure 2 Adjust the Cchunk tag based on position information

Based on the reality that we do not have labeled Chinesedata we aim to project the syntactic information from theEnglish side to the Chinese side Before initializing the grapha supervised chunking model is used to tag the English sideof the parallel text The label distributions for the Englishvertices are generated by aggregating the chunk labels ofthe English tokens to types Consider the crossover problemand different word orders between Chinese and English theposition information is also considered in our work Basedon this idea the chunk tags are projected directly fromEnglish side to Chinese along with the position as shownin Figure 1 Based on the position information we can getthe exact boundary of each chunk at the Chinese side Thenan adjustment is made to assign the correct label as shownin Figure 2 After graph initialization label propagation isapplied to transfer the chunk tags to the peripheral Chinesevertices firstly (ie the ones which are adjacent to the Englishvertices) followed by further propagating among all theChinese vertices It is worth mentioning that the chunkdistributions over the Chinese trigram types are regarded asfeatures for learning a better unsupervised chunking taggerThe following sections will explain these steps in detail(see Algorithm 1)

3 Graph Construction

In graph learning approaches one constructs a graph whosevertices are labeled and unlabeled examples and whoseweighted edges encode the degree to which examples theylink have the same label [4] Notice that graph constructionused for the problems of structured prediction such as

The Scientific World Journal 3

Bilingual Chunking InductionRequire Parallel English and foreign language data119863

119890 and119863119888 unlabeled foreign training data Γ

119888 English taggerEnsure 120579119888 a set of parameters learned using a constrainedunsupervised model(1)119863119890harr119888 larr word-align-bitext (119863119890119863119888)(2)119863119890 larr chunk-tag-supervised (119863119890)(3) 119860 (119863119890harr119888119863119890)(4) 119866 larr construct-graph (Γ119888119863119888 119860)(5) 119866 larr graph-propagate (119866)(6) Δ larr extract-word-constraints (119866)(7) 120579119891 larr chunk-induce-constraints (Γ119888 Δ)(8) return 120579

119891

Algorithm 1 Graph-based unsupervised chunking approach

shallow parsing is nontrivial for the following two reasonsFirst it is necessary for resolving ambiguous problem byemploying individual words as the vertices instead of thecontext Besides if we use the whole sentences as the verticesduring graph construction it is unclear how to calculatethe sequence similarity Altun et al [5] proposed a graph-based semisupervised learning approach by using the sim-ilarity between labeled and unlabeled structured data in adiscriminative framework Recently a graph over the cliquesin an underlying structured model was defined by [6] Intheir work a new idea has been proposed that one can usea graph over trigram types combing with the weights basedon distribution similarity to improve the supervised learningalgorithms

31 Graph Vertices In this work we extend the intuitions of[6] to set up the bilingual graph Two different types of ver-tices are applied for each language because the informationtransformation in our graph is asymmetric (from English toChinese) On the Chinese side the vertices (denoted by 119863

119888)correspond to trigram types which are the same as in [6]The English vertices (denoted by 119863

119890) however correspondto word types The reason we do not need the trigram typesin English side to disambiguate them is that each Englishword are going to be labeled Additionally in the labelpropagation or graph construction we do not connect thevertices between any English words but only to the Chinesevertices

Furthermore there are two kinds of vertices consisting ofones extracted from the different sides of word aligned bitext(119863119890 119863119888) and an additional unlabeled Chinese monolingualcorpus Γ

119888 whichwill be used later in the training phase Suchas noted two different similarities will be employed to definethe edge weights between vertices from these two languagesand among the Chinese vertices

32 Monolingual Similarity Function In this section a briefintroduction is given as follows in terms of computingthe monolingual similarity between the connecting pairs ofChinese trigram type Specifically the set 119881119888 consists of all

Table 1 Various features for computing edge weights betweenChinese trigram types

Description FeatureTrigram + context 119909

11199092119909311990941199095

Trigram 119909211990931199094

Left context 11990911199092

Right context 11990941199095

Center word 1199093

Trigram minus center word 11990921199094

Left word + right context 119909211990941199095

Right word + left context 119909111990921199094

Suffix Has suffix (1199093)

Prefix Has prefix (1199093)

the word 119899-grams that occur in the text Given a symmetricsimilarity function between types defined below we linktypes 119898

119894and 119898

119895(119898119894 119898119895

isin 119881119888) with an edge weight 119908

119898119894119898119895

as follows

119908119898119894119898119895

= sim (119898

119894 119898119895) if 119898

119894isin 120581 (119898

119895) or 119898

119895isin 120581 (119898

119894)

0 otherwise(1)

where 120581(119898119895) is the set of 119896-nearest neighbors of119898

119894according

to the given similarity For all the parameters in this paper wedefine 119896 = 5

To define the similarity function for each trigram type as119898119894isin 119881119888 and we rely on the cooccurrence statistics of ten

features illustrated in Table 1Based on this monolingual similarity function a nearest

neighbor graph could be achieved In this graph the edgeweight for the 119899 most similar trigram types is set to the PMIvalues and is 0 for all other ones Finally we apply the function(119898) to denote the neighborhood of vertex 119898 and set themaximum number of 5 in our experiments

33 Bilingual Similarity Function We rely on high-confidence word alignments to define a similarity function

4 The Scientific World Journal

between the English and Chinese vertices Since the graphvertices are extracted from a parallel corpus a standardword alignment technique GIZA++ [7] is applied to alignthe English sentences 119863

119890 and their Chinese translations 119863119888

Based on the idea that the label propagation process in thegraph will provide coverage and high recall only a highconfidence alignment119863119890harr119888 gt 09 is considered

Therefore we can extract tuples of the form [119905 harr 119906] basedon these high-confidence word alignments where the Chi-nese trigram types that the middle word aligns to an Englishunigram type 119906 Relying on computing the proportion ofthese tuple counts we set the edge weights between the twolanguages by our bilingual similarity function

34 Graph Initialization So far we have introduced how toconstruct a graph But the graph here is unlabeled completelyFor the sake of label propagation we have to use a supervisedEnglish chunking tagger to label the English side of theparallel corpus in the graph initialization phase Afterwardswe simply count the individual labels of each English tokenand then normalize the counts to get the tag distribution overthe English unigram types These distributions are in the useof initializing the English verticesrsquo distribution in the graphConsidering that we extract the English vertices from a bitextall vertices in English word types are assigned with an initiallabel distribution

35 Graph Example Figure 3 shows a simple small excerptfrom a Chinese-English graph We see that only the Chinesetrigrams [一项重要选择] [因素是这样] and [不断深入]are connected to the English translated words All theseEnglish vertices are labeled by a supervised tagger In thisparticular case the neighborhoods can be more diverse anda soft label distribution is allowed over these vertices As onecan see Figure 3 is composed of three subgraphs In each sub-graph it is worth noting that the middle of Chinese trigramtypes has the same chunking types (with the labeled one)This exhibits the fact that themonolingual similarity functionguarantees the connected vertices having the same syntacticcategory The label propagation process then spreads theEnglish wordsrsquo tags to the corresponding Chinese trigramvertices After that labels are further propagated among theChinese vertices This kind of propagation is used to conveythese tags inwards and results in tag distributions for themiddle word for each Chinese trigram type More details onhow to project the chunks and propagate the labels will bedescribed in the following section

36 Label Propagation Label propagation operates on theconstructed graph The primary objective is to propagatelabels from a few labeled vertices to the entire graph byoptimizing a loss function based on the constraints or prop-erties derived from the graph for example smoothness [4 8]or sparsity [9] State-of-the-art label propagation algorithmsinclude LP-ZGL [4] adsorption [10] MAD [8] and sparsity-inducing penalties [9]

Label propagation is applied in two phases to generate softlabel distributions over all the vertices in the bilingual graph

In the first stage the label propagation is used to project theEnglish chunk labels to the Chinese side In detail we simplytransfer the label distributions from the English word typesto the connected Chinese vertices (ie 119881119897

119888) at the periphery

of the graph Note that not all the Chinese vertices are fullyconnected with English words if we consider only the high-confidence alignments In this stage a tag distribution 119889

119894over

the labels 119910 is generated which represents the proportion oftimes the center word 119888

119894isin 119881119888aligns to English words 119890

119910

tagged with label 119910

119889119894(119910) =

sum119890119910

[119888119894larrrarr 119890

119910]

sum1199101015840 sum1198901199101015840 [119888119894larrrarr 119890

1199101015840]

(2)

The second phase is the conventional label propagation topropagate the labels from the peripheral vertices 119881

119897

119888to all

Chinese vertices in the constructed graph The optimizationon the similarity graph is based on the objective function

119875 (119902) = sum

119888119894isin119881119888119862119897

119888

119908119894119895

10038171003817100381710038171003817119902119894minus 119902119895

10038171003817100381710038171003817

2

+ 1205821003817100381710038171003817119902119894 minus 119880

1003817100381710038171003817

2

st sum

119910

119902119894(119910) = 1 forall119888

119894

119902119894(119910) ge 0 forall119888

119894 119910

119902119894= 119889119894

forall119888119894isin 119881119897

119888

(3)

where 119902119894(119894 = 1 |119881

119888|) are the label distributions over

all Chinese vertices and 120582 is the hyperparameter thatwill be discussed in Section 4 Consider that 119902

119894minus 1199021198952

=

sum119910(119902119894(119910) minus 119902

119895(119910))2 is a squared loss which is used to penalize

the neighbor vertices to make sure that the connectedneighbors have different label distributions Furthermore theadditional second part of the regulation makes it possiblethat the label distribution over all possible labels 119910 is towardsthe uniform distribution 119880 All these show the fact that thisobjective function is convex

As we know the first term in (3) is a smoothness regu-larizer which can be used to encourage the similar verticeswhich have the large 119908

119894119895in between to be much more

similar Moreover the second part is applied to regularizeand encourage all marginal types to be uniform Additionallythis term also ensures that the unlabeled converged marginalvertices will be uniform over all tags if these types do not havea path to any labeled vertices This part makes it possible thatthe middle word of this kind unlabeled trigram takes on anypossible tag as well

However although a closed form solution can be derivedthrough the objective function mentioned above it would beimpossible without the inversion of the matrix of order |119881

119888|

To solve this problem we rely on an iterative update basedalgorithm instead The formula is as follows

119902(119898)

119894(119910) =

119888119894(119910) if 119888

119894isin 119881119897

119888

120574119894(119910)

120581119894

otherwise (4)

The Scientific World Journal 5

[Yixiang zhongyao xuanze]

[Yixiang cuowu xuanze]

[Yixiang mingzhi xuanze]

[Yixiang zhengque xuanze]

NP-INP-IImportantImportant

[Ta shi yi]

[Wo shi yi]

[Yinsu shi zheyang]

[Zhi shi zheyang] [Women shi nansheng]

[Buduan tigao]

[Buduan tansuo]

[Buduan shenru]

[Buduan bianhua]

ExploreVP-I

Figure 3 An example of similarity graph over trigram on labeled and unlabeled data

where forall119888119894isin 119881119888 119881119897

119888 120574119894(119910) and 119896

119894are defined as follows

120574119894(119910) = sum

119888119895isin119873(119888119894)

119908119894119895119902(119898minus1)

119894(119910) + 120582119880 (119910)

119896119894= 120582 + sum

119888119895isin119873(119888119894)

119908119894119895

(5)

This procedure will be processed 10 iterations in our experi-ment

4 Unsupervised Chunk Induction

In Section 3 the bilingual graph construction is describedsuch that English chunk tags can be projected to Chineseside by simple label propagation This relies on the high-confidence word alignments Many Chinese vertices couldnot be fully projected from the English words To comple-ment label propagation is used to further propagate amongthe Chinese trigram types This section introduces the usageof unsupervised approach to build a practical system inthe task of shallow parsing The implementation of how to

establish the chunk tag identification systemwill be discussedand presented in detail

41 Data Representation

411 Universal Chunk Tag The universal tag was first pro-posed by [11] that consists of twelve universal part-of-speechcategories for the sake of evaluating their cross-lingual POSprojection system for six different languages In this work wefollow the idea but focus on the universal chunk tags betweenEnglish and Chinese for several reasons First it is usefulfor building and evaluating unsupervised and cross-lingualchunk taggers Additionally taggers constructed based onuniversal tag set can result in a more reasonable comparisonacross different supervised chunking approaches Since twokinds of tagging standards are applied for the differentlanguages it is vacuous to state that ldquoshallow parsing forlanguage119860 ismuch harder than that for language119861rdquowhen thetag sets are incomparable Finally it also permits our modelto train the chunk taggers with a common tag set acrossmultiple languages which does not need tomaintain language

6 The Scientific World Journal

specific unification rules due to the differences in Treebankannotation guidelines

The following compares the definition of tags used in theEnglish Wall Street Journal (WSJ) [2] and the Chinese PennTreebank (CTB) 70 [12] that are used in our experiment

(i) The Phrasal Tag of Chinese In the present corpus CTB7each chunk is labeled with one syntactic category A tag set of13 categories is used to represent shallow parsing structuralinformation as follows

ADJPmdashadjective phrase phrase headed by an adjec-tiveADVPmdashadverbial phrase phrasal category headed byan adverbCLPmdashclassifiers phrase for example (QP (CD 一)(CLP (M系列))) (a series)DPmdashdeterminer phrase in CTB7 corpus a DP isa modifier of the NP if it occurs within a NP Forexample (NP (DP (DT 任何) (NP (NN 人))) (anypeople)DNPmdashphrase formed by XP plus (DEG的) (lsquos) thatmodifies a NP the XP can be an ADJP DP QP NPPP or LCPDVPmdashphrase formed by XP plus ldquo地 (-ly)rdquo thatmodifies a VPFRAGmdashfragment used to mark fragmented elementsthat cannot be built into a full structure by using nullcategoriesLCPmdashused to mark phrases formed by a localizerand its complement For example (LCP (NP (NR皖南) (NN 事变)) (LC 中)) (in the Southern AnhuiIncident)LSTmdashlist marker numbered list letters or numberswhich identify items in a list and their surroundingpunctuation is labeled as LSTNPmdashnoun phrases phrasal category that includes allconstituents depending on a head nounPPmdashpreposition phrase phrasal category headed bya prepositionQPmdashquantifier phrase used with NP for example(QP (CD五百) (CLP (M辆))) (500)VPmdashverb phrase phrasal category headed by a verb

(ii) The Phrasal Tag of English In the respect English WSJcorpus there are just eight categories NP (noun phrase)PP (prepositional phrase) VP (verb phrase) ADVP (adverbphrase) ADJP (adjective phrase) SBAR (subordinating con-junction) PRT (particle) and INTJ (interjection)

As we can see the criterion of Chinese CTB7 has a lotof significant differences from English WSJ corpus Unlikethe English Treebank the fragment of Chinese continents isnormally smaller A chunkwhich aligned to a base-NPphrasein the English side could be divided into a QP tag a CLPtag and an NP tag For example after chunking ldquo五百辆

Table 2 The description of universal chunk tags

Tag Description Words Example

NP Noun phrase DET + ADV +ADJ + NOUN The strange birds

PP Prepositionphrase TO + IN In between

VP Verb phrase ADV + VB Was lookingADVP Adverb phrase ADV Also

ADJP Adjective phrase CONJ + ADV +ADJ Warm and cozy

SBAR Subordinatingconjunction IN Whether or not

车 (500 cars)rdquo will be tagged as (NP (QP (CD 五百) (CLP(M辆))) (NN车)) But at the English side ldquo500 carsrdquo will bejust denoted with an NP tag This nonuniform standard willresult in a mismatch projection during the alignment phaseand the difficulty of evaluation in the cross-lingual settingTo fix these problems mappings are applied in the universalchunking tag setThe categoryiesCLPDP andQP aremergedinto an NP phrase That is to say the phrase such as (NP (QP(CD五百) (CLP (M辆))) (NN车)) which corresponds tothe English NP chunk ldquo500 carsrdquo will be assigned with anNP tag in the universal chunk tag Additionally the phrasesDNP and DVP could be included in the ADJP and ADVPtags respectively

On the English side the occurrence of INTJ is 0according to statisticsThis evidence shows thatwe can ignorethe INTJ chunk tag Additionally the words belong to a PRTtag that is always regarded as a VP in Chinese For example(PRT (RP up)) (NP (DET the) (NNS stairs)) is aligned to ldquo上楼梯rdquo in Chinese where ldquo上rdquo is a VP

412 IBO2 Chunk Tag There are many ways to encodethe phrase structure of a chunk tag such as IBO1 IBO2IOE1 and IOE2 [13] The one used in this study is theIBO2 encoding This format ensures that all initial wordsof different base phrases will receive a B tag which is ableto identify the boundaries of each chunk In addition twoboundary types are defined as follows

(i) B-X represents the beginning of a chunk X(ii) I-X indicates a noninitial word in a phrase X(iii) O any words that are out of the boundary of any

chunks

Hence the input Chinese sentence can be representedusing these notations as follows

去年 (NP-B)实现 (VP-B)进出口 (NP-B)总值 (NP-I) 达 (VP-B) 一千零九十八点二亿 (NP-B) 美元(NP-I)

∘(O)

Based on the above discussion a tag set that consistsof six universal chunk categories is proposed as shown inTable 2 While there are a lot of controversies about universaltags and what the exact tag set should be used these six

The Scientific World Journal 7

chunk categories are sufficient to embrace the most frequentchunk tags that exist in English and Chinese Furthermorea mapping from fine-grained chunk tags for these two kindlanguages to the universal chunk set has been developedHence a dataset consisting of common chunk tag for Englishand Chinese is achieved by combining with the originalTreebank data and the universal chunk tag set and mappingbetween these two languages

42 Chunk Induction The label distribution of Chinese wordtypes 119909 can be computed by marginalizing the chunk tagdistribution of trigram types 119888

119894= 119909minus11199090119909+1

over the left andright context as follows

119901 (119910 | 119909) =

sum119909minus1119909+1

119902119894(119910)

sum119909minus1119909+1 119910

1015840 119902119894(1199101015840)

(6)

Then a set of possible tags 119905119909(119910) is extracted through a way

that eliminates labels whose probability is below a thresholdvalue 120591

119905119909(119910) =

1 if 119901 (119910 | 119909) ge 120591

0 otherwise(7)

The way how to choose 120591 will be described in Section 5Additionally the vector 119905

119909will cover every word in the

Chinese trigram types and will be employed as features in theunsupervised Chinese shallow parsing

Similar to the work of [14 15] our chunk inductionmodel is constructed on the feature-based hidden Markovmodel (HMM) A chunking HMM generates a sequenceof words in order In each generation step based on thecurrent chunk tag 119911

119894conditionally an observed word 119909

119894and a

hidden successor chunk tag 119911119894+1

are generated independentlyThere are two types of conditional distributions in themodel emission and transition probabilities which are bothmultinomial probability distributionsGiven a sentence119909 anda state sequence 119911 a first-order Markov model defines a jointdistribution as

119875120579= (119883 = 119909 119885 = 119911) = 119875

120579(1198851= 1199111)

sdot

|119909|

prod

119894=1

119875120579(119885119894+1

= 119911119894+1

| 119885119894= 119911119894)

sdot 119875120579(119883119894= 119909119894| 119885119894= 119911119894)

(8)

where the second part represents the transition probabilityand the following part is emission probability which isdifferent from the conventional Markov model where thefeature-based model replaces the emission part with a log-linear model such that

119875120579(119883 = 119909 119885 = 119911) =

exp 120579Τ

119891 (119909 119911)

sum1199091015840isinVal(119883) exp 120579Τ119891 (119909 119911)

(9)

which corresponds to the entire Chinese vocabularyThe advantage of using this locally normalized log-

linear model is the ability of looking at various aspects of

Table 3 Feature template used in unsupervised chunking

Basic (119909= 119911 = )

Contains digit check if x contains digit and conjoin with 119911

(contains digit(119909)= 119911 = )Contains hypen contains hypen(119909)= 119911 =

Suffix indicator features for character suffixes of up tolength 1 present in 119909

Prefix indicator features for character prefixes of up tolength 1 present in 119909

Pos tag indicator feature for word POS assigned to 119909

the observation 119909 incorporating features of the observationsThen the model is trained by applying the following objectivefunction

L (120579) =

119873

sum

119894=1

logsum119885

119875 (120579) (119883 = 119909(119894)

119885 = 119911(119894)

) minus 1198621205792

(10)

It is worth noting that this function includes marginal-izing out all the sentence 119909rsquos and all possible configurations119911 which leads to a nonconvex objective We optimize thisobjective function by using L-BFGS a quasi-Newtonmethodproposed by Liu and Nocedal [16] The evidences of pastexperiments show that this direct gradient algorithm per-formed better than using a feature-enhanced modification ofthe EM (expectation-maximization)

Moreover this state-of-the-art model also has an advan-tage that makes it easy to experiment with various waysof incorporating the constraint feature into the log-linearmodel This feature function 119891

119905consists of the relevant

information extracted from the smooth graph and eliminatesthe hidden states which are inconsistent with the thresholdvector 119905

119909

43 Unsupervised Chunking Features The feature selectionprocess in the unsupervised chunking model does affect theidentification result Feature selection is the task to identifyan optimal feature subset that best represents the underlyingclassification problem through filtering the irrelevant andredundant ones Unlike the CoNLL-2000 we shared super-vised chunking task which is using WSJ (the Wall StreetJournal) corpus as background knowledge to construct thechunking model Our corpus is composed of only wordswhich do not contain any chunk information that enables usto form an optimal feature subset for the underlying chunkpattern We use the set of features as the following featuretemplates These are all coarse features on emission contextsthat are extracted from words with certain orthographicproperties Only the basic feature is used for transitions Forany emission context with word 119909 and tag 119911 we construct thefollowing feature templates as shown in Table 3

Box 1 illustrates the example about the feature subsetThe feature set of each word in a sentence is a vector of6 dimensions which are the values of the correspondingfeatures and the label indicates which kind of chunk labelshould the word belong to

8 The Scientific World Journal

Sentence中国NRNP-B建筑业NNNP-I对PPP-B外NNNP-B开放VVVP-B呈现VVVP-B新JJNP-B格局NNNP-I

Word建筑业Instance (建筑业 NP-I) (N NP-I) (N NP-I) (建)

(业) NN

Box 1 Example of feature template

5 Experiments and Results

Before presenting the results of our experiments the eval-uation metrics the datasets and the baselines used forcomparisons are firstly described

51 Evaluation Metrics The metrics for evaluating NPchunking model constitute precision rate recall rate andtheir harmonic mean 119865

1-score The tagging accuracy is

defined as

Tagging Accuracy

=The number of correct tagged words

The number of words

(11)

Precision measures the percentage of labeled NP chunksthat are correctThe number of correct tagged words includesboth the correct boundaries of NP chunk and the correctlabel The precision is therefore defined as

Precision

=The number of correct proposed tagged words

The number of correct chunk tags

(12)

Recall measures the percentage of NP chunks presentedin the input sentence that are correctly identified by thesystem Recall is depicted as below

Recall =The number of correct proposed tagged words

The number of current chunk tags

(13)

The 119865-measure illustrates a way to combine previous twomeasures into onemetricThe formula of 119865-sore is defined as

119865120573-score =

(1205732

+ 1) times Recall times Precision1205732 times Recall + Precision

(14)

where 120573 is a parameter that weights the importance of recalland precision when 120573 = 1 precision and recall are equallyimportant

52 Dataset To facilitate bilingual graph construction andunsupervised chunking experiment two kinds of data sets areutilized in this work (1) monolingual Treebank for Chinesechunk induction and (2) large amounts of parallel corpus

Table 4 (a) English-Chinese parallel corpus (b) Graph instance (c)Chinese unlabeled dataset CTB7 corpora

(a)

Number of sentence pairs Number of seeds Number of words10000 27940 31678

(b)

Number of sentences Number of vertices17617 185441

(c)

Dataset Source Number of sentencesTraining dataset Xinhua 1ndash321 7617Testing dataset Xinhua 363ndash403 912

of English and Chinese for bilingual graph construction andlabel propagation For themonolingual Treebank data we relyon Penn Chinese Treebank 70 (CTB7) [12] It consists of overonemillion words of annotated and parsed text fromChinesenewswire magazines various broadcast news and broadcastconversation programs web newsgroups and weblogs Theparallel data came from the UM corpus [17] which contains100000 pair of English-Chinese sentences The trainingand testing sets are defined for the experiment and theircorresponding statistic information is shown in Table 4 Inaddition Table 4 also shows the detailed information of dataused for bilingual graph construction including the parallelcorpus for the tag projection and the label propagation ofChinese trigram

53 Chunk Tagset and HMM States As described the uni-versal chunk tag set is employed in our experiment Thisset 119880 consists of the following six coarse-grained tags NP(noun phrase) PP (prepositional phrase) VP (verb phrase)ADVP (adverb phrase) ADJP (adjective phrase) and SBAR(subordinating conjunction) Although there might be somecontroversies about the exact definition of such a tag set these6 categories cover the most frequent chunk tags that exist inone form or another in both of English and Chinese corpora

For each kind of languages under consideration a map-ping from the fine-grained language specific chunk tags in theTreebank to the universal chunk tags is provided Hence themodel of unsupervised chunking is trained on the datasetslabeled with the universal chunk tags

The Scientific World Journal 9

Table 5 Chunk tagging evaluation results for various baselines andproposed graph-based model

Tag Feature-HMM Projection Graph-basedPre Rec F1 Pre Rec F1 Pre Rec F1

NP 060 067 063 067 072 068 081 082 081VP 056 051 053 061 057 059 078 074 076PP 036 028 032 044 031 036 060 051 055ADVP 040 046 043 045 052 048 064 068 066ADJP 047 053 050 048 058 051 067 071 069SBAR 000 000 000 050 10 066 050 10 066All 049 051 050 057 062 059 073 075 074

In this work the number of latent HMM states forChinese is fixed to be a constant which is the number of theuniversal chunk tags

54 VariousModels To comprehensively probe our approachwith a thorough analysis we evaluated two baselines inaddition to our graph-based algorithmWewere intentionallylenient with these two baselines

(i) Feature-HMM The first baseline is the vanilla feature-HMM proposed by [9] which apply L-BFGS to estimate themodel parameters and use a greedy many-to-1 mapping forevaluation

(ii) ProjectionThe direct projection serves as a second base-line It incorporates the bilingual information by projectingchunk tags from English side to the aligned Chinese textsSeveral rules were added to fix the unaligned problem Wetrained a supervised feature-HMMwith the same number ofsentences from the bitext as it is in the Treebank

(iii) Our Graph-Based Model The full model uses bothstages of label propagation (3) before extracting the constrainfeatures As a result the label distribution of all Chinese wordtypes was added in the constrain features

55 Experimental Setup Theexperimental platform is imple-mented based on two toolkits Junto [18] and Ark [1415] Junto is a Java-based label propagation toolkit for thebilingual graph construction Ark is a Java-based package forthe feature-HMM training and testing

In the feature-HMM training we need to set a fewhyperparameters to minimize the number of free parametersin the model 119862 = 10 was used as regularization constantin (10) and trained L-BFGS for 1000 iterations Severalthreshold values for 120591 to extract the vector 119905

119909were tried and it

was found that 02 works best It indicates that 120591 = 02 couldbe used for Chinese trigram types

56 Results Table 5 shows the complete set of results Asexpected the projection baseline is better than the feature-HMM for being able to benefit from the bilingual infor-mation It greatly improves upon the monolingual baselineby 12 on 119865

1-score Comparing among the unsupervised

approaches the feature-HMM achieves only 50 of 1198651-

score on the universal chunk tags Overall the graph-basedmodel outperforms these two baselines That is to say theimprovement of feature-HMM with the graph-based settingis statistically significantwith respect to othermodels Graph-based modal performs 24 better than the state-of-the-art feature-HMM and 12 better than the direct projectionmodel

6 Conclusion

This paper presents an unsupervised graph-based Chinesechunking by using label propagation for projecting chunkinformation across languages Our results suggest that it ispossible for unlabeled text to learn accurate chunk tags fromthe bitext the data which has parallel English text In thefeature we propose an alternative graph-based unsupervisedapproach on chunking for languages that are lacking ready-to-use resource

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors are grateful to the Science and TechnologyDevelopment Fund of Macau and the Research Committeeof the University of Macau (Grant nos MYRG076(Y1-L2)-FST13-WF andMYRG070(Y1-L2)-FST12-CS) for the fundingsupport for their research

References

[1] R Koeling ldquoChunking with maximum entropy modelsrdquo inProceedings of the 2ndWorkshop on Learning Language in Logicand the 4th Conference on Computational Natural LanguageLerning vol 7 pp 139ndash141 2000

[2] M P Marcus M A Marcinkiewicz and B Santorini ldquoBuildinga large annotated corpus of English the Penn TreebankrdquoComputational Linguistics vol 19 no 2 pp 313ndash330 1993

[3] D Yarowsky and G Ngai ldquoInducing multilingual POS taggersandNPbracketers via robust projection across aligned corporardquoin Proceedings of the 2ndMeeting of the North American Chapterof the Association for Computational Linguistics (NAACL rsquo01)pp 200ndash207 2001

[4] X Zhu Z Ghahramani and J Lafferty ldquoSemi-supervisedlearning using Gaussian fields and harmonic functionsrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 912ndash919 August 2003

[5] Y Altun M Belkin and D A Mcallester ldquoMaximum marginsemi-supervised learning for structured variablesrdquo in Advancesin Neural Information Processing Systems pp 33ndash40 2005

[6] A Subramanya S Petrov and F Pereira ldquoEfficient graph-based semi-supervised learning of structured tagging modelsrdquoinProceedings of theConference onEmpiricalMethods inNaturalLanguage Processing pp 167ndash176 October 2010

10 The Scientific World Journal

[7] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

[8] A Subramanya and J Bilmes ldquoSemi-supervised learningwith measure propagationrdquo The Journal of Machine LearningResearch vol 12 pp 3311ndash3370 2011

[9] D Das and N A Smith ldquoSemi-supervised frame-semanticparsing for unknown predicatesrdquo in Proceedings of the 49thAnnualMeeting of the Association for Computational Linguisticspp 1435ndash1444 June 2011

[10] S Baluja R Seth D S Sivakumar et al ldquoVideo suggestion anddiscovery for you tube taking random walks through the viewgraphrdquo in Proceedings of the 17th International Conference onWorld Wide Web pp 895ndash904 April 2008

[11] S Petrov D Das and RMcDonald ldquoA universal part-of-speechtagsetrdquo 2011 httparxivorgabs11042086

[12] N Xue F Xia F-D Chiou and M Palmer ldquoThe Penn ChineseTreeBank phrase structure annotation of a large corpusrdquoNatural Language Engineering vol 11 no 2 pp 207ndash238 2005

[13] E F Tjong K Sang and H Dejean ldquoIntroduction to theCoNLL- shared task clause identificationrdquo in Proceedings of theWorkshop on Computational Natural Language Learning vol 7p 8 2001

[14] T Berg-Kirkpatrick A Bouchard-Cote J deNero andDKleinldquoPainless unsupervised learning with featuresrdquo in Proceedingsof the Human Language Technologies Conference of the NorthAmerican Chapter of the Association for Computational Linguis-tics pp 582ndash590 Los Angeles Calif USA June 2010

[15] A B Goldberg andX Zhu ldquoSeeing stars when there arenrsquotmanystars graph-based semi-supervised learning for sentiment cat-egorizationrdquo in Proceedings of the 1st Workshop on Graph BasedMethods for Natural Language Processing pp 45ndash52 2006

[16] DC Liu and JNocedal ldquoOn the limitedmemoryBFGSmethodfor large scale optimizationrdquoMathematical Programming B vol45 no 3 pp 503ndash528 1989

[17] L Tian F Wong and S Chao ldquoAn improvement of translationquality with adding key-words in parallel corpusrdquo in Proceed-ings of the International Conference on Machine Learning andCybernetics (ICMLC rsquo10) vol 3 pp 1273ndash1278 July 2010

[18] P P Talukdar and F Pereira ldquoExperiments in graph-based semi-supervised learning methods for class-instance acquisitionrdquo inProceedings of the 48th Annual Meeting of the Association forComputational Linguistics (ACL rsquo10) pp 1473ndash1481 July 2010

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 3: Research Article Unsupervised Chunking Based on Graph ...downloads.hindawi.com/journals/tswj/2014/401943.pdf · Unsupervised Chunking Based on Graph Propagation from Bilingual Corpus

The Scientific World Journal 3

Bilingual Chunking InductionRequire Parallel English and foreign language data119863

119890 and119863119888 unlabeled foreign training data Γ

119888 English taggerEnsure 120579119888 a set of parameters learned using a constrainedunsupervised model(1)119863119890harr119888 larr word-align-bitext (119863119890119863119888)(2)119863119890 larr chunk-tag-supervised (119863119890)(3) 119860 (119863119890harr119888119863119890)(4) 119866 larr construct-graph (Γ119888119863119888 119860)(5) 119866 larr graph-propagate (119866)(6) Δ larr extract-word-constraints (119866)(7) 120579119891 larr chunk-induce-constraints (Γ119888 Δ)(8) return 120579

119891

Algorithm 1 Graph-based unsupervised chunking approach

shallow parsing is nontrivial for the following two reasonsFirst it is necessary for resolving ambiguous problem byemploying individual words as the vertices instead of thecontext Besides if we use the whole sentences as the verticesduring graph construction it is unclear how to calculatethe sequence similarity Altun et al [5] proposed a graph-based semisupervised learning approach by using the sim-ilarity between labeled and unlabeled structured data in adiscriminative framework Recently a graph over the cliquesin an underlying structured model was defined by [6] Intheir work a new idea has been proposed that one can usea graph over trigram types combing with the weights basedon distribution similarity to improve the supervised learningalgorithms

31 Graph Vertices In this work we extend the intuitions of[6] to set up the bilingual graph Two different types of ver-tices are applied for each language because the informationtransformation in our graph is asymmetric (from English toChinese) On the Chinese side the vertices (denoted by 119863

119888)correspond to trigram types which are the same as in [6]The English vertices (denoted by 119863

119890) however correspondto word types The reason we do not need the trigram typesin English side to disambiguate them is that each Englishword are going to be labeled Additionally in the labelpropagation or graph construction we do not connect thevertices between any English words but only to the Chinesevertices

Furthermore there are two kinds of vertices consisting ofones extracted from the different sides of word aligned bitext(119863119890 119863119888) and an additional unlabeled Chinese monolingualcorpus Γ

119888 whichwill be used later in the training phase Suchas noted two different similarities will be employed to definethe edge weights between vertices from these two languagesand among the Chinese vertices

32 Monolingual Similarity Function In this section a briefintroduction is given as follows in terms of computingthe monolingual similarity between the connecting pairs ofChinese trigram type Specifically the set 119881119888 consists of all

Table 1 Various features for computing edge weights betweenChinese trigram types

Description FeatureTrigram + context 119909

11199092119909311990941199095

Trigram 119909211990931199094

Left context 11990911199092

Right context 11990941199095

Center word 1199093

Trigram minus center word 11990921199094

Left word + right context 119909211990941199095

Right word + left context 119909111990921199094

Suffix Has suffix (1199093)

Prefix Has prefix (1199093)

the word 119899-grams that occur in the text Given a symmetricsimilarity function between types defined below we linktypes 119898

119894and 119898

119895(119898119894 119898119895

isin 119881119888) with an edge weight 119908

119898119894119898119895

as follows

119908119898119894119898119895

= sim (119898

119894 119898119895) if 119898

119894isin 120581 (119898

119895) or 119898

119895isin 120581 (119898

119894)

0 otherwise(1)

where 120581(119898119895) is the set of 119896-nearest neighbors of119898

119894according

to the given similarity For all the parameters in this paper wedefine 119896 = 5

To define the similarity function for each trigram type as119898119894isin 119881119888 and we rely on the cooccurrence statistics of ten

features illustrated in Table 1Based on this monolingual similarity function a nearest

neighbor graph could be achieved In this graph the edgeweight for the 119899 most similar trigram types is set to the PMIvalues and is 0 for all other ones Finally we apply the function(119898) to denote the neighborhood of vertex 119898 and set themaximum number of 5 in our experiments

33 Bilingual Similarity Function We rely on high-confidence word alignments to define a similarity function

4 The Scientific World Journal

between the English and Chinese vertices Since the graphvertices are extracted from a parallel corpus a standardword alignment technique GIZA++ [7] is applied to alignthe English sentences 119863

119890 and their Chinese translations 119863119888

Based on the idea that the label propagation process in thegraph will provide coverage and high recall only a highconfidence alignment119863119890harr119888 gt 09 is considered

Therefore we can extract tuples of the form [119905 harr 119906] basedon these high-confidence word alignments where the Chi-nese trigram types that the middle word aligns to an Englishunigram type 119906 Relying on computing the proportion ofthese tuple counts we set the edge weights between the twolanguages by our bilingual similarity function

34 Graph Initialization So far we have introduced how toconstruct a graph But the graph here is unlabeled completelyFor the sake of label propagation we have to use a supervisedEnglish chunking tagger to label the English side of theparallel corpus in the graph initialization phase Afterwardswe simply count the individual labels of each English tokenand then normalize the counts to get the tag distribution overthe English unigram types These distributions are in the useof initializing the English verticesrsquo distribution in the graphConsidering that we extract the English vertices from a bitextall vertices in English word types are assigned with an initiallabel distribution

35 Graph Example Figure 3 shows a simple small excerptfrom a Chinese-English graph We see that only the Chinesetrigrams [一项重要选择] [因素是这样] and [不断深入]are connected to the English translated words All theseEnglish vertices are labeled by a supervised tagger In thisparticular case the neighborhoods can be more diverse anda soft label distribution is allowed over these vertices As onecan see Figure 3 is composed of three subgraphs In each sub-graph it is worth noting that the middle of Chinese trigramtypes has the same chunking types (with the labeled one)This exhibits the fact that themonolingual similarity functionguarantees the connected vertices having the same syntacticcategory The label propagation process then spreads theEnglish wordsrsquo tags to the corresponding Chinese trigramvertices After that labels are further propagated among theChinese vertices This kind of propagation is used to conveythese tags inwards and results in tag distributions for themiddle word for each Chinese trigram type More details onhow to project the chunks and propagate the labels will bedescribed in the following section

36 Label Propagation Label propagation operates on theconstructed graph The primary objective is to propagatelabels from a few labeled vertices to the entire graph byoptimizing a loss function based on the constraints or prop-erties derived from the graph for example smoothness [4 8]or sparsity [9] State-of-the-art label propagation algorithmsinclude LP-ZGL [4] adsorption [10] MAD [8] and sparsity-inducing penalties [9]

Label propagation is applied in two phases to generate softlabel distributions over all the vertices in the bilingual graph

In the first stage the label propagation is used to project theEnglish chunk labels to the Chinese side In detail we simplytransfer the label distributions from the English word typesto the connected Chinese vertices (ie 119881119897

119888) at the periphery

of the graph Note that not all the Chinese vertices are fullyconnected with English words if we consider only the high-confidence alignments In this stage a tag distribution 119889

119894over

the labels 119910 is generated which represents the proportion oftimes the center word 119888

119894isin 119881119888aligns to English words 119890

119910

tagged with label 119910

119889119894(119910) =

sum119890119910

[119888119894larrrarr 119890

119910]

sum1199101015840 sum1198901199101015840 [119888119894larrrarr 119890

1199101015840]

(2)

The second phase is the conventional label propagation topropagate the labels from the peripheral vertices 119881

119897

119888to all

Chinese vertices in the constructed graph The optimizationon the similarity graph is based on the objective function

119875 (119902) = sum

119888119894isin119881119888119862119897

119888

119908119894119895

10038171003817100381710038171003817119902119894minus 119902119895

10038171003817100381710038171003817

2

+ 1205821003817100381710038171003817119902119894 minus 119880

1003817100381710038171003817

2

st sum

119910

119902119894(119910) = 1 forall119888

119894

119902119894(119910) ge 0 forall119888

119894 119910

119902119894= 119889119894

forall119888119894isin 119881119897

119888

(3)

where 119902119894(119894 = 1 |119881

119888|) are the label distributions over

all Chinese vertices and 120582 is the hyperparameter thatwill be discussed in Section 4 Consider that 119902

119894minus 1199021198952

=

sum119910(119902119894(119910) minus 119902

119895(119910))2 is a squared loss which is used to penalize

the neighbor vertices to make sure that the connectedneighbors have different label distributions Furthermore theadditional second part of the regulation makes it possiblethat the label distribution over all possible labels 119910 is towardsthe uniform distribution 119880 All these show the fact that thisobjective function is convex

As we know the first term in (3) is a smoothness regu-larizer which can be used to encourage the similar verticeswhich have the large 119908

119894119895in between to be much more

similar Moreover the second part is applied to regularizeand encourage all marginal types to be uniform Additionallythis term also ensures that the unlabeled converged marginalvertices will be uniform over all tags if these types do not havea path to any labeled vertices This part makes it possible thatthe middle word of this kind unlabeled trigram takes on anypossible tag as well

However although a closed form solution can be derivedthrough the objective function mentioned above it would beimpossible without the inversion of the matrix of order |119881

119888|

To solve this problem we rely on an iterative update basedalgorithm instead The formula is as follows

119902(119898)

119894(119910) =

119888119894(119910) if 119888

119894isin 119881119897

119888

120574119894(119910)

120581119894

otherwise (4)

The Scientific World Journal 5

[Yixiang zhongyao xuanze]

[Yixiang cuowu xuanze]

[Yixiang mingzhi xuanze]

[Yixiang zhengque xuanze]

NP-INP-IImportantImportant

[Ta shi yi]

[Wo shi yi]

[Yinsu shi zheyang]

[Zhi shi zheyang] [Women shi nansheng]

[Buduan tigao]

[Buduan tansuo]

[Buduan shenru]

[Buduan bianhua]

ExploreVP-I

Figure 3 An example of similarity graph over trigram on labeled and unlabeled data

where forall119888119894isin 119881119888 119881119897

119888 120574119894(119910) and 119896

119894are defined as follows

120574119894(119910) = sum

119888119895isin119873(119888119894)

119908119894119895119902(119898minus1)

119894(119910) + 120582119880 (119910)

119896119894= 120582 + sum

119888119895isin119873(119888119894)

119908119894119895

(5)

This procedure will be processed 10 iterations in our experi-ment

4 Unsupervised Chunk Induction

In Section 3 the bilingual graph construction is describedsuch that English chunk tags can be projected to Chineseside by simple label propagation This relies on the high-confidence word alignments Many Chinese vertices couldnot be fully projected from the English words To comple-ment label propagation is used to further propagate amongthe Chinese trigram types This section introduces the usageof unsupervised approach to build a practical system inthe task of shallow parsing The implementation of how to

establish the chunk tag identification systemwill be discussedand presented in detail

41 Data Representation

411 Universal Chunk Tag The universal tag was first pro-posed by [11] that consists of twelve universal part-of-speechcategories for the sake of evaluating their cross-lingual POSprojection system for six different languages In this work wefollow the idea but focus on the universal chunk tags betweenEnglish and Chinese for several reasons First it is usefulfor building and evaluating unsupervised and cross-lingualchunk taggers Additionally taggers constructed based onuniversal tag set can result in a more reasonable comparisonacross different supervised chunking approaches Since twokinds of tagging standards are applied for the differentlanguages it is vacuous to state that ldquoshallow parsing forlanguage119860 ismuch harder than that for language119861rdquowhen thetag sets are incomparable Finally it also permits our modelto train the chunk taggers with a common tag set acrossmultiple languages which does not need tomaintain language

6 The Scientific World Journal

specific unification rules due to the differences in Treebankannotation guidelines

The following compares the definition of tags used in theEnglish Wall Street Journal (WSJ) [2] and the Chinese PennTreebank (CTB) 70 [12] that are used in our experiment

(i) The Phrasal Tag of Chinese In the present corpus CTB7each chunk is labeled with one syntactic category A tag set of13 categories is used to represent shallow parsing structuralinformation as follows

ADJPmdashadjective phrase phrase headed by an adjec-tiveADVPmdashadverbial phrase phrasal category headed byan adverbCLPmdashclassifiers phrase for example (QP (CD 一)(CLP (M系列))) (a series)DPmdashdeterminer phrase in CTB7 corpus a DP isa modifier of the NP if it occurs within a NP Forexample (NP (DP (DT 任何) (NP (NN 人))) (anypeople)DNPmdashphrase formed by XP plus (DEG的) (lsquos) thatmodifies a NP the XP can be an ADJP DP QP NPPP or LCPDVPmdashphrase formed by XP plus ldquo地 (-ly)rdquo thatmodifies a VPFRAGmdashfragment used to mark fragmented elementsthat cannot be built into a full structure by using nullcategoriesLCPmdashused to mark phrases formed by a localizerand its complement For example (LCP (NP (NR皖南) (NN 事变)) (LC 中)) (in the Southern AnhuiIncident)LSTmdashlist marker numbered list letters or numberswhich identify items in a list and their surroundingpunctuation is labeled as LSTNPmdashnoun phrases phrasal category that includes allconstituents depending on a head nounPPmdashpreposition phrase phrasal category headed bya prepositionQPmdashquantifier phrase used with NP for example(QP (CD五百) (CLP (M辆))) (500)VPmdashverb phrase phrasal category headed by a verb

(ii) The Phrasal Tag of English In the respect English WSJcorpus there are just eight categories NP (noun phrase)PP (prepositional phrase) VP (verb phrase) ADVP (adverbphrase) ADJP (adjective phrase) SBAR (subordinating con-junction) PRT (particle) and INTJ (interjection)

As we can see the criterion of Chinese CTB7 has a lotof significant differences from English WSJ corpus Unlikethe English Treebank the fragment of Chinese continents isnormally smaller A chunkwhich aligned to a base-NPphrasein the English side could be divided into a QP tag a CLPtag and an NP tag For example after chunking ldquo五百辆

Table 2 The description of universal chunk tags

Tag Description Words Example

NP Noun phrase DET + ADV +ADJ + NOUN The strange birds

PP Prepositionphrase TO + IN In between

VP Verb phrase ADV + VB Was lookingADVP Adverb phrase ADV Also

ADJP Adjective phrase CONJ + ADV +ADJ Warm and cozy

SBAR Subordinatingconjunction IN Whether or not

车 (500 cars)rdquo will be tagged as (NP (QP (CD 五百) (CLP(M辆))) (NN车)) But at the English side ldquo500 carsrdquo will bejust denoted with an NP tag This nonuniform standard willresult in a mismatch projection during the alignment phaseand the difficulty of evaluation in the cross-lingual settingTo fix these problems mappings are applied in the universalchunking tag setThe categoryiesCLPDP andQP aremergedinto an NP phrase That is to say the phrase such as (NP (QP(CD五百) (CLP (M辆))) (NN车)) which corresponds tothe English NP chunk ldquo500 carsrdquo will be assigned with anNP tag in the universal chunk tag Additionally the phrasesDNP and DVP could be included in the ADJP and ADVPtags respectively

On the English side the occurrence of INTJ is 0according to statisticsThis evidence shows thatwe can ignorethe INTJ chunk tag Additionally the words belong to a PRTtag that is always regarded as a VP in Chinese For example(PRT (RP up)) (NP (DET the) (NNS stairs)) is aligned to ldquo上楼梯rdquo in Chinese where ldquo上rdquo is a VP

412 IBO2 Chunk Tag There are many ways to encodethe phrase structure of a chunk tag such as IBO1 IBO2IOE1 and IOE2 [13] The one used in this study is theIBO2 encoding This format ensures that all initial wordsof different base phrases will receive a B tag which is ableto identify the boundaries of each chunk In addition twoboundary types are defined as follows

(i) B-X represents the beginning of a chunk X(ii) I-X indicates a noninitial word in a phrase X(iii) O any words that are out of the boundary of any

chunks

Hence the input Chinese sentence can be representedusing these notations as follows

去年 (NP-B)实现 (VP-B)进出口 (NP-B)总值 (NP-I) 达 (VP-B) 一千零九十八点二亿 (NP-B) 美元(NP-I)

∘(O)

Based on the above discussion a tag set that consistsof six universal chunk categories is proposed as shown inTable 2 While there are a lot of controversies about universaltags and what the exact tag set should be used these six

The Scientific World Journal 7

chunk categories are sufficient to embrace the most frequentchunk tags that exist in English and Chinese Furthermorea mapping from fine-grained chunk tags for these two kindlanguages to the universal chunk set has been developedHence a dataset consisting of common chunk tag for Englishand Chinese is achieved by combining with the originalTreebank data and the universal chunk tag set and mappingbetween these two languages

42 Chunk Induction The label distribution of Chinese wordtypes 119909 can be computed by marginalizing the chunk tagdistribution of trigram types 119888

119894= 119909minus11199090119909+1

over the left andright context as follows

119901 (119910 | 119909) =

sum119909minus1119909+1

119902119894(119910)

sum119909minus1119909+1 119910

1015840 119902119894(1199101015840)

(6)

Then a set of possible tags 119905119909(119910) is extracted through a way

that eliminates labels whose probability is below a thresholdvalue 120591

119905119909(119910) =

1 if 119901 (119910 | 119909) ge 120591

0 otherwise(7)

The way how to choose 120591 will be described in Section 5Additionally the vector 119905

119909will cover every word in the

Chinese trigram types and will be employed as features in theunsupervised Chinese shallow parsing

Similar to the work of [14 15] our chunk inductionmodel is constructed on the feature-based hidden Markovmodel (HMM) A chunking HMM generates a sequenceof words in order In each generation step based on thecurrent chunk tag 119911

119894conditionally an observed word 119909

119894and a

hidden successor chunk tag 119911119894+1

are generated independentlyThere are two types of conditional distributions in themodel emission and transition probabilities which are bothmultinomial probability distributionsGiven a sentence119909 anda state sequence 119911 a first-order Markov model defines a jointdistribution as

119875120579= (119883 = 119909 119885 = 119911) = 119875

120579(1198851= 1199111)

sdot

|119909|

prod

119894=1

119875120579(119885119894+1

= 119911119894+1

| 119885119894= 119911119894)

sdot 119875120579(119883119894= 119909119894| 119885119894= 119911119894)

(8)

where the second part represents the transition probabilityand the following part is emission probability which isdifferent from the conventional Markov model where thefeature-based model replaces the emission part with a log-linear model such that

119875120579(119883 = 119909 119885 = 119911) =

exp 120579Τ

119891 (119909 119911)

sum1199091015840isinVal(119883) exp 120579Τ119891 (119909 119911)

(9)

which corresponds to the entire Chinese vocabularyThe advantage of using this locally normalized log-

linear model is the ability of looking at various aspects of

Table 3 Feature template used in unsupervised chunking

Basic (119909= 119911 = )

Contains digit check if x contains digit and conjoin with 119911

(contains digit(119909)= 119911 = )Contains hypen contains hypen(119909)= 119911 =

Suffix indicator features for character suffixes of up tolength 1 present in 119909

Prefix indicator features for character prefixes of up tolength 1 present in 119909

Pos tag indicator feature for word POS assigned to 119909

the observation 119909 incorporating features of the observationsThen the model is trained by applying the following objectivefunction

L (120579) =

119873

sum

119894=1

logsum119885

119875 (120579) (119883 = 119909(119894)

119885 = 119911(119894)

) minus 1198621205792

(10)

It is worth noting that this function includes marginal-izing out all the sentence 119909rsquos and all possible configurations119911 which leads to a nonconvex objective We optimize thisobjective function by using L-BFGS a quasi-Newtonmethodproposed by Liu and Nocedal [16] The evidences of pastexperiments show that this direct gradient algorithm per-formed better than using a feature-enhanced modification ofthe EM (expectation-maximization)

Moreover this state-of-the-art model also has an advan-tage that makes it easy to experiment with various waysof incorporating the constraint feature into the log-linearmodel This feature function 119891

119905consists of the relevant

information extracted from the smooth graph and eliminatesthe hidden states which are inconsistent with the thresholdvector 119905

119909

43 Unsupervised Chunking Features The feature selectionprocess in the unsupervised chunking model does affect theidentification result Feature selection is the task to identifyan optimal feature subset that best represents the underlyingclassification problem through filtering the irrelevant andredundant ones Unlike the CoNLL-2000 we shared super-vised chunking task which is using WSJ (the Wall StreetJournal) corpus as background knowledge to construct thechunking model Our corpus is composed of only wordswhich do not contain any chunk information that enables usto form an optimal feature subset for the underlying chunkpattern We use the set of features as the following featuretemplates These are all coarse features on emission contextsthat are extracted from words with certain orthographicproperties Only the basic feature is used for transitions Forany emission context with word 119909 and tag 119911 we construct thefollowing feature templates as shown in Table 3

Box 1 illustrates the example about the feature subsetThe feature set of each word in a sentence is a vector of6 dimensions which are the values of the correspondingfeatures and the label indicates which kind of chunk labelshould the word belong to

8 The Scientific World Journal

Sentence中国NRNP-B建筑业NNNP-I对PPP-B外NNNP-B开放VVVP-B呈现VVVP-B新JJNP-B格局NNNP-I

Word建筑业Instance (建筑业 NP-I) (N NP-I) (N NP-I) (建)

(业) NN

Box 1 Example of feature template

5 Experiments and Results

Before presenting the results of our experiments the eval-uation metrics the datasets and the baselines used forcomparisons are firstly described

51 Evaluation Metrics The metrics for evaluating NPchunking model constitute precision rate recall rate andtheir harmonic mean 119865

1-score The tagging accuracy is

defined as

Tagging Accuracy

=The number of correct tagged words

The number of words

(11)

Precision measures the percentage of labeled NP chunksthat are correctThe number of correct tagged words includesboth the correct boundaries of NP chunk and the correctlabel The precision is therefore defined as

Precision

=The number of correct proposed tagged words

The number of correct chunk tags

(12)

Recall measures the percentage of NP chunks presentedin the input sentence that are correctly identified by thesystem Recall is depicted as below

Recall =The number of correct proposed tagged words

The number of current chunk tags

(13)

The 119865-measure illustrates a way to combine previous twomeasures into onemetricThe formula of 119865-sore is defined as

119865120573-score =

(1205732

+ 1) times Recall times Precision1205732 times Recall + Precision

(14)

where 120573 is a parameter that weights the importance of recalland precision when 120573 = 1 precision and recall are equallyimportant

52 Dataset To facilitate bilingual graph construction andunsupervised chunking experiment two kinds of data sets areutilized in this work (1) monolingual Treebank for Chinesechunk induction and (2) large amounts of parallel corpus

Table 4 (a) English-Chinese parallel corpus (b) Graph instance (c)Chinese unlabeled dataset CTB7 corpora

(a)

Number of sentence pairs Number of seeds Number of words10000 27940 31678

(b)

Number of sentences Number of vertices17617 185441

(c)

Dataset Source Number of sentencesTraining dataset Xinhua 1ndash321 7617Testing dataset Xinhua 363ndash403 912

of English and Chinese for bilingual graph construction andlabel propagation For themonolingual Treebank data we relyon Penn Chinese Treebank 70 (CTB7) [12] It consists of overonemillion words of annotated and parsed text fromChinesenewswire magazines various broadcast news and broadcastconversation programs web newsgroups and weblogs Theparallel data came from the UM corpus [17] which contains100000 pair of English-Chinese sentences The trainingand testing sets are defined for the experiment and theircorresponding statistic information is shown in Table 4 Inaddition Table 4 also shows the detailed information of dataused for bilingual graph construction including the parallelcorpus for the tag projection and the label propagation ofChinese trigram

53 Chunk Tagset and HMM States As described the uni-versal chunk tag set is employed in our experiment Thisset 119880 consists of the following six coarse-grained tags NP(noun phrase) PP (prepositional phrase) VP (verb phrase)ADVP (adverb phrase) ADJP (adjective phrase) and SBAR(subordinating conjunction) Although there might be somecontroversies about the exact definition of such a tag set these6 categories cover the most frequent chunk tags that exist inone form or another in both of English and Chinese corpora

For each kind of languages under consideration a map-ping from the fine-grained language specific chunk tags in theTreebank to the universal chunk tags is provided Hence themodel of unsupervised chunking is trained on the datasetslabeled with the universal chunk tags

The Scientific World Journal 9

Table 5 Chunk tagging evaluation results for various baselines andproposed graph-based model

Tag Feature-HMM Projection Graph-basedPre Rec F1 Pre Rec F1 Pre Rec F1

NP 060 067 063 067 072 068 081 082 081VP 056 051 053 061 057 059 078 074 076PP 036 028 032 044 031 036 060 051 055ADVP 040 046 043 045 052 048 064 068 066ADJP 047 053 050 048 058 051 067 071 069SBAR 000 000 000 050 10 066 050 10 066All 049 051 050 057 062 059 073 075 074

In this work the number of latent HMM states forChinese is fixed to be a constant which is the number of theuniversal chunk tags

54 VariousModels To comprehensively probe our approachwith a thorough analysis we evaluated two baselines inaddition to our graph-based algorithmWewere intentionallylenient with these two baselines

(i) Feature-HMM The first baseline is the vanilla feature-HMM proposed by [9] which apply L-BFGS to estimate themodel parameters and use a greedy many-to-1 mapping forevaluation

(ii) ProjectionThe direct projection serves as a second base-line It incorporates the bilingual information by projectingchunk tags from English side to the aligned Chinese textsSeveral rules were added to fix the unaligned problem Wetrained a supervised feature-HMMwith the same number ofsentences from the bitext as it is in the Treebank

(iii) Our Graph-Based Model The full model uses bothstages of label propagation (3) before extracting the constrainfeatures As a result the label distribution of all Chinese wordtypes was added in the constrain features

55 Experimental Setup Theexperimental platform is imple-mented based on two toolkits Junto [18] and Ark [1415] Junto is a Java-based label propagation toolkit for thebilingual graph construction Ark is a Java-based package forthe feature-HMM training and testing

In the feature-HMM training we need to set a fewhyperparameters to minimize the number of free parametersin the model 119862 = 10 was used as regularization constantin (10) and trained L-BFGS for 1000 iterations Severalthreshold values for 120591 to extract the vector 119905

119909were tried and it

was found that 02 works best It indicates that 120591 = 02 couldbe used for Chinese trigram types

56 Results Table 5 shows the complete set of results Asexpected the projection baseline is better than the feature-HMM for being able to benefit from the bilingual infor-mation It greatly improves upon the monolingual baselineby 12 on 119865

1-score Comparing among the unsupervised

approaches the feature-HMM achieves only 50 of 1198651-

score on the universal chunk tags Overall the graph-basedmodel outperforms these two baselines That is to say theimprovement of feature-HMM with the graph-based settingis statistically significantwith respect to othermodels Graph-based modal performs 24 better than the state-of-the-art feature-HMM and 12 better than the direct projectionmodel

6 Conclusion

This paper presents an unsupervised graph-based Chinesechunking by using label propagation for projecting chunkinformation across languages Our results suggest that it ispossible for unlabeled text to learn accurate chunk tags fromthe bitext the data which has parallel English text In thefeature we propose an alternative graph-based unsupervisedapproach on chunking for languages that are lacking ready-to-use resource

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors are grateful to the Science and TechnologyDevelopment Fund of Macau and the Research Committeeof the University of Macau (Grant nos MYRG076(Y1-L2)-FST13-WF andMYRG070(Y1-L2)-FST12-CS) for the fundingsupport for their research

References

[1] R Koeling ldquoChunking with maximum entropy modelsrdquo inProceedings of the 2ndWorkshop on Learning Language in Logicand the 4th Conference on Computational Natural LanguageLerning vol 7 pp 139ndash141 2000

[2] M P Marcus M A Marcinkiewicz and B Santorini ldquoBuildinga large annotated corpus of English the Penn TreebankrdquoComputational Linguistics vol 19 no 2 pp 313ndash330 1993

[3] D Yarowsky and G Ngai ldquoInducing multilingual POS taggersandNPbracketers via robust projection across aligned corporardquoin Proceedings of the 2ndMeeting of the North American Chapterof the Association for Computational Linguistics (NAACL rsquo01)pp 200ndash207 2001

[4] X Zhu Z Ghahramani and J Lafferty ldquoSemi-supervisedlearning using Gaussian fields and harmonic functionsrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 912ndash919 August 2003

[5] Y Altun M Belkin and D A Mcallester ldquoMaximum marginsemi-supervised learning for structured variablesrdquo in Advancesin Neural Information Processing Systems pp 33ndash40 2005

[6] A Subramanya S Petrov and F Pereira ldquoEfficient graph-based semi-supervised learning of structured tagging modelsrdquoinProceedings of theConference onEmpiricalMethods inNaturalLanguage Processing pp 167ndash176 October 2010

10 The Scientific World Journal

[7] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

[8] A Subramanya and J Bilmes ldquoSemi-supervised learningwith measure propagationrdquo The Journal of Machine LearningResearch vol 12 pp 3311ndash3370 2011

[9] D Das and N A Smith ldquoSemi-supervised frame-semanticparsing for unknown predicatesrdquo in Proceedings of the 49thAnnualMeeting of the Association for Computational Linguisticspp 1435ndash1444 June 2011

[10] S Baluja R Seth D S Sivakumar et al ldquoVideo suggestion anddiscovery for you tube taking random walks through the viewgraphrdquo in Proceedings of the 17th International Conference onWorld Wide Web pp 895ndash904 April 2008

[11] S Petrov D Das and RMcDonald ldquoA universal part-of-speechtagsetrdquo 2011 httparxivorgabs11042086

[12] N Xue F Xia F-D Chiou and M Palmer ldquoThe Penn ChineseTreeBank phrase structure annotation of a large corpusrdquoNatural Language Engineering vol 11 no 2 pp 207ndash238 2005

[13] E F Tjong K Sang and H Dejean ldquoIntroduction to theCoNLL- shared task clause identificationrdquo in Proceedings of theWorkshop on Computational Natural Language Learning vol 7p 8 2001

[14] T Berg-Kirkpatrick A Bouchard-Cote J deNero andDKleinldquoPainless unsupervised learning with featuresrdquo in Proceedingsof the Human Language Technologies Conference of the NorthAmerican Chapter of the Association for Computational Linguis-tics pp 582ndash590 Los Angeles Calif USA June 2010

[15] A B Goldberg andX Zhu ldquoSeeing stars when there arenrsquotmanystars graph-based semi-supervised learning for sentiment cat-egorizationrdquo in Proceedings of the 1st Workshop on Graph BasedMethods for Natural Language Processing pp 45ndash52 2006

[16] DC Liu and JNocedal ldquoOn the limitedmemoryBFGSmethodfor large scale optimizationrdquoMathematical Programming B vol45 no 3 pp 503ndash528 1989

[17] L Tian F Wong and S Chao ldquoAn improvement of translationquality with adding key-words in parallel corpusrdquo in Proceed-ings of the International Conference on Machine Learning andCybernetics (ICMLC rsquo10) vol 3 pp 1273ndash1278 July 2010

[18] P P Talukdar and F Pereira ldquoExperiments in graph-based semi-supervised learning methods for class-instance acquisitionrdquo inProceedings of the 48th Annual Meeting of the Association forComputational Linguistics (ACL rsquo10) pp 1473ndash1481 July 2010

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 4: Research Article Unsupervised Chunking Based on Graph ...downloads.hindawi.com/journals/tswj/2014/401943.pdf · Unsupervised Chunking Based on Graph Propagation from Bilingual Corpus

4 The Scientific World Journal

between the English and Chinese vertices Since the graphvertices are extracted from a parallel corpus a standardword alignment technique GIZA++ [7] is applied to alignthe English sentences 119863

119890 and their Chinese translations 119863119888

Based on the idea that the label propagation process in thegraph will provide coverage and high recall only a highconfidence alignment119863119890harr119888 gt 09 is considered

Therefore we can extract tuples of the form [119905 harr 119906] basedon these high-confidence word alignments where the Chi-nese trigram types that the middle word aligns to an Englishunigram type 119906 Relying on computing the proportion ofthese tuple counts we set the edge weights between the twolanguages by our bilingual similarity function

34 Graph Initialization So far we have introduced how toconstruct a graph But the graph here is unlabeled completelyFor the sake of label propagation we have to use a supervisedEnglish chunking tagger to label the English side of theparallel corpus in the graph initialization phase Afterwardswe simply count the individual labels of each English tokenand then normalize the counts to get the tag distribution overthe English unigram types These distributions are in the useof initializing the English verticesrsquo distribution in the graphConsidering that we extract the English vertices from a bitextall vertices in English word types are assigned with an initiallabel distribution

35 Graph Example Figure 3 shows a simple small excerptfrom a Chinese-English graph We see that only the Chinesetrigrams [一项重要选择] [因素是这样] and [不断深入]are connected to the English translated words All theseEnglish vertices are labeled by a supervised tagger In thisparticular case the neighborhoods can be more diverse anda soft label distribution is allowed over these vertices As onecan see Figure 3 is composed of three subgraphs In each sub-graph it is worth noting that the middle of Chinese trigramtypes has the same chunking types (with the labeled one)This exhibits the fact that themonolingual similarity functionguarantees the connected vertices having the same syntacticcategory The label propagation process then spreads theEnglish wordsrsquo tags to the corresponding Chinese trigramvertices After that labels are further propagated among theChinese vertices This kind of propagation is used to conveythese tags inwards and results in tag distributions for themiddle word for each Chinese trigram type More details onhow to project the chunks and propagate the labels will bedescribed in the following section

36 Label Propagation Label propagation operates on theconstructed graph The primary objective is to propagatelabels from a few labeled vertices to the entire graph byoptimizing a loss function based on the constraints or prop-erties derived from the graph for example smoothness [4 8]or sparsity [9] State-of-the-art label propagation algorithmsinclude LP-ZGL [4] adsorption [10] MAD [8] and sparsity-inducing penalties [9]

Label propagation is applied in two phases to generate softlabel distributions over all the vertices in the bilingual graph

In the first stage the label propagation is used to project theEnglish chunk labels to the Chinese side In detail we simplytransfer the label distributions from the English word typesto the connected Chinese vertices (ie 119881119897

119888) at the periphery

of the graph Note that not all the Chinese vertices are fullyconnected with English words if we consider only the high-confidence alignments In this stage a tag distribution 119889

119894over

the labels 119910 is generated which represents the proportion oftimes the center word 119888

119894isin 119881119888aligns to English words 119890

119910

tagged with label 119910

119889119894(119910) =

sum119890119910

[119888119894larrrarr 119890

119910]

sum1199101015840 sum1198901199101015840 [119888119894larrrarr 119890

1199101015840]

(2)

The second phase is the conventional label propagation topropagate the labels from the peripheral vertices 119881

119897

119888to all

Chinese vertices in the constructed graph The optimizationon the similarity graph is based on the objective function

119875 (119902) = sum

119888119894isin119881119888119862119897

119888

119908119894119895

10038171003817100381710038171003817119902119894minus 119902119895

10038171003817100381710038171003817

2

+ 1205821003817100381710038171003817119902119894 minus 119880

1003817100381710038171003817

2

st sum

119910

119902119894(119910) = 1 forall119888

119894

119902119894(119910) ge 0 forall119888

119894 119910

119902119894= 119889119894

forall119888119894isin 119881119897

119888

(3)

where 119902119894(119894 = 1 |119881

119888|) are the label distributions over

all Chinese vertices and 120582 is the hyperparameter thatwill be discussed in Section 4 Consider that 119902

119894minus 1199021198952

=

sum119910(119902119894(119910) minus 119902

119895(119910))2 is a squared loss which is used to penalize

the neighbor vertices to make sure that the connectedneighbors have different label distributions Furthermore theadditional second part of the regulation makes it possiblethat the label distribution over all possible labels 119910 is towardsthe uniform distribution 119880 All these show the fact that thisobjective function is convex

As we know the first term in (3) is a smoothness regu-larizer which can be used to encourage the similar verticeswhich have the large 119908

119894119895in between to be much more

similar Moreover the second part is applied to regularizeand encourage all marginal types to be uniform Additionallythis term also ensures that the unlabeled converged marginalvertices will be uniform over all tags if these types do not havea path to any labeled vertices This part makes it possible thatthe middle word of this kind unlabeled trigram takes on anypossible tag as well

However although a closed form solution can be derivedthrough the objective function mentioned above it would beimpossible without the inversion of the matrix of order |119881

119888|

To solve this problem we rely on an iterative update basedalgorithm instead The formula is as follows

119902(119898)

119894(119910) =

119888119894(119910) if 119888

119894isin 119881119897

119888

120574119894(119910)

120581119894

otherwise (4)

The Scientific World Journal 5

[Yixiang zhongyao xuanze]

[Yixiang cuowu xuanze]

[Yixiang mingzhi xuanze]

[Yixiang zhengque xuanze]

NP-INP-IImportantImportant

[Ta shi yi]

[Wo shi yi]

[Yinsu shi zheyang]

[Zhi shi zheyang] [Women shi nansheng]

[Buduan tigao]

[Buduan tansuo]

[Buduan shenru]

[Buduan bianhua]

ExploreVP-I

Figure 3 An example of similarity graph over trigram on labeled and unlabeled data

where forall119888119894isin 119881119888 119881119897

119888 120574119894(119910) and 119896

119894are defined as follows

120574119894(119910) = sum

119888119895isin119873(119888119894)

119908119894119895119902(119898minus1)

119894(119910) + 120582119880 (119910)

119896119894= 120582 + sum

119888119895isin119873(119888119894)

119908119894119895

(5)

This procedure will be processed 10 iterations in our experi-ment

4 Unsupervised Chunk Induction

In Section 3 the bilingual graph construction is describedsuch that English chunk tags can be projected to Chineseside by simple label propagation This relies on the high-confidence word alignments Many Chinese vertices couldnot be fully projected from the English words To comple-ment label propagation is used to further propagate amongthe Chinese trigram types This section introduces the usageof unsupervised approach to build a practical system inthe task of shallow parsing The implementation of how to

establish the chunk tag identification systemwill be discussedand presented in detail

41 Data Representation

411 Universal Chunk Tag The universal tag was first pro-posed by [11] that consists of twelve universal part-of-speechcategories for the sake of evaluating their cross-lingual POSprojection system for six different languages In this work wefollow the idea but focus on the universal chunk tags betweenEnglish and Chinese for several reasons First it is usefulfor building and evaluating unsupervised and cross-lingualchunk taggers Additionally taggers constructed based onuniversal tag set can result in a more reasonable comparisonacross different supervised chunking approaches Since twokinds of tagging standards are applied for the differentlanguages it is vacuous to state that ldquoshallow parsing forlanguage119860 ismuch harder than that for language119861rdquowhen thetag sets are incomparable Finally it also permits our modelto train the chunk taggers with a common tag set acrossmultiple languages which does not need tomaintain language

6 The Scientific World Journal

specific unification rules due to the differences in Treebankannotation guidelines

The following compares the definition of tags used in theEnglish Wall Street Journal (WSJ) [2] and the Chinese PennTreebank (CTB) 70 [12] that are used in our experiment

(i) The Phrasal Tag of Chinese In the present corpus CTB7each chunk is labeled with one syntactic category A tag set of13 categories is used to represent shallow parsing structuralinformation as follows

ADJPmdashadjective phrase phrase headed by an adjec-tiveADVPmdashadverbial phrase phrasal category headed byan adverbCLPmdashclassifiers phrase for example (QP (CD 一)(CLP (M系列))) (a series)DPmdashdeterminer phrase in CTB7 corpus a DP isa modifier of the NP if it occurs within a NP Forexample (NP (DP (DT 任何) (NP (NN 人))) (anypeople)DNPmdashphrase formed by XP plus (DEG的) (lsquos) thatmodifies a NP the XP can be an ADJP DP QP NPPP or LCPDVPmdashphrase formed by XP plus ldquo地 (-ly)rdquo thatmodifies a VPFRAGmdashfragment used to mark fragmented elementsthat cannot be built into a full structure by using nullcategoriesLCPmdashused to mark phrases formed by a localizerand its complement For example (LCP (NP (NR皖南) (NN 事变)) (LC 中)) (in the Southern AnhuiIncident)LSTmdashlist marker numbered list letters or numberswhich identify items in a list and their surroundingpunctuation is labeled as LSTNPmdashnoun phrases phrasal category that includes allconstituents depending on a head nounPPmdashpreposition phrase phrasal category headed bya prepositionQPmdashquantifier phrase used with NP for example(QP (CD五百) (CLP (M辆))) (500)VPmdashverb phrase phrasal category headed by a verb

(ii) The Phrasal Tag of English In the respect English WSJcorpus there are just eight categories NP (noun phrase)PP (prepositional phrase) VP (verb phrase) ADVP (adverbphrase) ADJP (adjective phrase) SBAR (subordinating con-junction) PRT (particle) and INTJ (interjection)

As we can see the criterion of Chinese CTB7 has a lotof significant differences from English WSJ corpus Unlikethe English Treebank the fragment of Chinese continents isnormally smaller A chunkwhich aligned to a base-NPphrasein the English side could be divided into a QP tag a CLPtag and an NP tag For example after chunking ldquo五百辆

Table 2 The description of universal chunk tags

Tag Description Words Example

NP Noun phrase DET + ADV +ADJ + NOUN The strange birds

PP Prepositionphrase TO + IN In between

VP Verb phrase ADV + VB Was lookingADVP Adverb phrase ADV Also

ADJP Adjective phrase CONJ + ADV +ADJ Warm and cozy

SBAR Subordinatingconjunction IN Whether or not

车 (500 cars)rdquo will be tagged as (NP (QP (CD 五百) (CLP(M辆))) (NN车)) But at the English side ldquo500 carsrdquo will bejust denoted with an NP tag This nonuniform standard willresult in a mismatch projection during the alignment phaseand the difficulty of evaluation in the cross-lingual settingTo fix these problems mappings are applied in the universalchunking tag setThe categoryiesCLPDP andQP aremergedinto an NP phrase That is to say the phrase such as (NP (QP(CD五百) (CLP (M辆))) (NN车)) which corresponds tothe English NP chunk ldquo500 carsrdquo will be assigned with anNP tag in the universal chunk tag Additionally the phrasesDNP and DVP could be included in the ADJP and ADVPtags respectively

On the English side the occurrence of INTJ is 0according to statisticsThis evidence shows thatwe can ignorethe INTJ chunk tag Additionally the words belong to a PRTtag that is always regarded as a VP in Chinese For example(PRT (RP up)) (NP (DET the) (NNS stairs)) is aligned to ldquo上楼梯rdquo in Chinese where ldquo上rdquo is a VP

412 IBO2 Chunk Tag There are many ways to encodethe phrase structure of a chunk tag such as IBO1 IBO2IOE1 and IOE2 [13] The one used in this study is theIBO2 encoding This format ensures that all initial wordsof different base phrases will receive a B tag which is ableto identify the boundaries of each chunk In addition twoboundary types are defined as follows

(i) B-X represents the beginning of a chunk X(ii) I-X indicates a noninitial word in a phrase X(iii) O any words that are out of the boundary of any

chunks

Hence the input Chinese sentence can be representedusing these notations as follows

去年 (NP-B)实现 (VP-B)进出口 (NP-B)总值 (NP-I) 达 (VP-B) 一千零九十八点二亿 (NP-B) 美元(NP-I)

∘(O)

Based on the above discussion a tag set that consistsof six universal chunk categories is proposed as shown inTable 2 While there are a lot of controversies about universaltags and what the exact tag set should be used these six

The Scientific World Journal 7

chunk categories are sufficient to embrace the most frequentchunk tags that exist in English and Chinese Furthermorea mapping from fine-grained chunk tags for these two kindlanguages to the universal chunk set has been developedHence a dataset consisting of common chunk tag for Englishand Chinese is achieved by combining with the originalTreebank data and the universal chunk tag set and mappingbetween these two languages

42 Chunk Induction The label distribution of Chinese wordtypes 119909 can be computed by marginalizing the chunk tagdistribution of trigram types 119888

119894= 119909minus11199090119909+1

over the left andright context as follows

119901 (119910 | 119909) =

sum119909minus1119909+1

119902119894(119910)

sum119909minus1119909+1 119910

1015840 119902119894(1199101015840)

(6)

Then a set of possible tags 119905119909(119910) is extracted through a way

that eliminates labels whose probability is below a thresholdvalue 120591

119905119909(119910) =

1 if 119901 (119910 | 119909) ge 120591

0 otherwise(7)

The way how to choose 120591 will be described in Section 5Additionally the vector 119905

119909will cover every word in the

Chinese trigram types and will be employed as features in theunsupervised Chinese shallow parsing

Similar to the work of [14 15] our chunk inductionmodel is constructed on the feature-based hidden Markovmodel (HMM) A chunking HMM generates a sequenceof words in order In each generation step based on thecurrent chunk tag 119911

119894conditionally an observed word 119909

119894and a

hidden successor chunk tag 119911119894+1

are generated independentlyThere are two types of conditional distributions in themodel emission and transition probabilities which are bothmultinomial probability distributionsGiven a sentence119909 anda state sequence 119911 a first-order Markov model defines a jointdistribution as

119875120579= (119883 = 119909 119885 = 119911) = 119875

120579(1198851= 1199111)

sdot

|119909|

prod

119894=1

119875120579(119885119894+1

= 119911119894+1

| 119885119894= 119911119894)

sdot 119875120579(119883119894= 119909119894| 119885119894= 119911119894)

(8)

where the second part represents the transition probabilityand the following part is emission probability which isdifferent from the conventional Markov model where thefeature-based model replaces the emission part with a log-linear model such that

119875120579(119883 = 119909 119885 = 119911) =

exp 120579Τ

119891 (119909 119911)

sum1199091015840isinVal(119883) exp 120579Τ119891 (119909 119911)

(9)

which corresponds to the entire Chinese vocabularyThe advantage of using this locally normalized log-

linear model is the ability of looking at various aspects of

Table 3 Feature template used in unsupervised chunking

Basic (119909= 119911 = )

Contains digit check if x contains digit and conjoin with 119911

(contains digit(119909)= 119911 = )Contains hypen contains hypen(119909)= 119911 =

Suffix indicator features for character suffixes of up tolength 1 present in 119909

Prefix indicator features for character prefixes of up tolength 1 present in 119909

Pos tag indicator feature for word POS assigned to 119909

the observation 119909 incorporating features of the observationsThen the model is trained by applying the following objectivefunction

L (120579) =

119873

sum

119894=1

logsum119885

119875 (120579) (119883 = 119909(119894)

119885 = 119911(119894)

) minus 1198621205792

(10)

It is worth noting that this function includes marginal-izing out all the sentence 119909rsquos and all possible configurations119911 which leads to a nonconvex objective We optimize thisobjective function by using L-BFGS a quasi-Newtonmethodproposed by Liu and Nocedal [16] The evidences of pastexperiments show that this direct gradient algorithm per-formed better than using a feature-enhanced modification ofthe EM (expectation-maximization)

Moreover this state-of-the-art model also has an advan-tage that makes it easy to experiment with various waysof incorporating the constraint feature into the log-linearmodel This feature function 119891

119905consists of the relevant

information extracted from the smooth graph and eliminatesthe hidden states which are inconsistent with the thresholdvector 119905

119909

43 Unsupervised Chunking Features The feature selectionprocess in the unsupervised chunking model does affect theidentification result Feature selection is the task to identifyan optimal feature subset that best represents the underlyingclassification problem through filtering the irrelevant andredundant ones Unlike the CoNLL-2000 we shared super-vised chunking task which is using WSJ (the Wall StreetJournal) corpus as background knowledge to construct thechunking model Our corpus is composed of only wordswhich do not contain any chunk information that enables usto form an optimal feature subset for the underlying chunkpattern We use the set of features as the following featuretemplates These are all coarse features on emission contextsthat are extracted from words with certain orthographicproperties Only the basic feature is used for transitions Forany emission context with word 119909 and tag 119911 we construct thefollowing feature templates as shown in Table 3

Box 1 illustrates the example about the feature subsetThe feature set of each word in a sentence is a vector of6 dimensions which are the values of the correspondingfeatures and the label indicates which kind of chunk labelshould the word belong to

8 The Scientific World Journal

Sentence中国NRNP-B建筑业NNNP-I对PPP-B外NNNP-B开放VVVP-B呈现VVVP-B新JJNP-B格局NNNP-I

Word建筑业Instance (建筑业 NP-I) (N NP-I) (N NP-I) (建)

(业) NN

Box 1 Example of feature template

5 Experiments and Results

Before presenting the results of our experiments the eval-uation metrics the datasets and the baselines used forcomparisons are firstly described

51 Evaluation Metrics The metrics for evaluating NPchunking model constitute precision rate recall rate andtheir harmonic mean 119865

1-score The tagging accuracy is

defined as

Tagging Accuracy

=The number of correct tagged words

The number of words

(11)

Precision measures the percentage of labeled NP chunksthat are correctThe number of correct tagged words includesboth the correct boundaries of NP chunk and the correctlabel The precision is therefore defined as

Precision

=The number of correct proposed tagged words

The number of correct chunk tags

(12)

Recall measures the percentage of NP chunks presentedin the input sentence that are correctly identified by thesystem Recall is depicted as below

Recall =The number of correct proposed tagged words

The number of current chunk tags

(13)

The 119865-measure illustrates a way to combine previous twomeasures into onemetricThe formula of 119865-sore is defined as

119865120573-score =

(1205732

+ 1) times Recall times Precision1205732 times Recall + Precision

(14)

where 120573 is a parameter that weights the importance of recalland precision when 120573 = 1 precision and recall are equallyimportant

52 Dataset To facilitate bilingual graph construction andunsupervised chunking experiment two kinds of data sets areutilized in this work (1) monolingual Treebank for Chinesechunk induction and (2) large amounts of parallel corpus

Table 4 (a) English-Chinese parallel corpus (b) Graph instance (c)Chinese unlabeled dataset CTB7 corpora

(a)

Number of sentence pairs Number of seeds Number of words10000 27940 31678

(b)

Number of sentences Number of vertices17617 185441

(c)

Dataset Source Number of sentencesTraining dataset Xinhua 1ndash321 7617Testing dataset Xinhua 363ndash403 912

of English and Chinese for bilingual graph construction andlabel propagation For themonolingual Treebank data we relyon Penn Chinese Treebank 70 (CTB7) [12] It consists of overonemillion words of annotated and parsed text fromChinesenewswire magazines various broadcast news and broadcastconversation programs web newsgroups and weblogs Theparallel data came from the UM corpus [17] which contains100000 pair of English-Chinese sentences The trainingand testing sets are defined for the experiment and theircorresponding statistic information is shown in Table 4 Inaddition Table 4 also shows the detailed information of dataused for bilingual graph construction including the parallelcorpus for the tag projection and the label propagation ofChinese trigram

53 Chunk Tagset and HMM States As described the uni-versal chunk tag set is employed in our experiment Thisset 119880 consists of the following six coarse-grained tags NP(noun phrase) PP (prepositional phrase) VP (verb phrase)ADVP (adverb phrase) ADJP (adjective phrase) and SBAR(subordinating conjunction) Although there might be somecontroversies about the exact definition of such a tag set these6 categories cover the most frequent chunk tags that exist inone form or another in both of English and Chinese corpora

For each kind of languages under consideration a map-ping from the fine-grained language specific chunk tags in theTreebank to the universal chunk tags is provided Hence themodel of unsupervised chunking is trained on the datasetslabeled with the universal chunk tags

The Scientific World Journal 9

Table 5 Chunk tagging evaluation results for various baselines andproposed graph-based model

Tag Feature-HMM Projection Graph-basedPre Rec F1 Pre Rec F1 Pre Rec F1

NP 060 067 063 067 072 068 081 082 081VP 056 051 053 061 057 059 078 074 076PP 036 028 032 044 031 036 060 051 055ADVP 040 046 043 045 052 048 064 068 066ADJP 047 053 050 048 058 051 067 071 069SBAR 000 000 000 050 10 066 050 10 066All 049 051 050 057 062 059 073 075 074

In this work the number of latent HMM states forChinese is fixed to be a constant which is the number of theuniversal chunk tags

54 VariousModels To comprehensively probe our approachwith a thorough analysis we evaluated two baselines inaddition to our graph-based algorithmWewere intentionallylenient with these two baselines

(i) Feature-HMM The first baseline is the vanilla feature-HMM proposed by [9] which apply L-BFGS to estimate themodel parameters and use a greedy many-to-1 mapping forevaluation

(ii) ProjectionThe direct projection serves as a second base-line It incorporates the bilingual information by projectingchunk tags from English side to the aligned Chinese textsSeveral rules were added to fix the unaligned problem Wetrained a supervised feature-HMMwith the same number ofsentences from the bitext as it is in the Treebank

(iii) Our Graph-Based Model The full model uses bothstages of label propagation (3) before extracting the constrainfeatures As a result the label distribution of all Chinese wordtypes was added in the constrain features

55 Experimental Setup Theexperimental platform is imple-mented based on two toolkits Junto [18] and Ark [1415] Junto is a Java-based label propagation toolkit for thebilingual graph construction Ark is a Java-based package forthe feature-HMM training and testing

In the feature-HMM training we need to set a fewhyperparameters to minimize the number of free parametersin the model 119862 = 10 was used as regularization constantin (10) and trained L-BFGS for 1000 iterations Severalthreshold values for 120591 to extract the vector 119905

119909were tried and it

was found that 02 works best It indicates that 120591 = 02 couldbe used for Chinese trigram types

56 Results Table 5 shows the complete set of results Asexpected the projection baseline is better than the feature-HMM for being able to benefit from the bilingual infor-mation It greatly improves upon the monolingual baselineby 12 on 119865

1-score Comparing among the unsupervised

approaches the feature-HMM achieves only 50 of 1198651-

score on the universal chunk tags Overall the graph-basedmodel outperforms these two baselines That is to say theimprovement of feature-HMM with the graph-based settingis statistically significantwith respect to othermodels Graph-based modal performs 24 better than the state-of-the-art feature-HMM and 12 better than the direct projectionmodel

6 Conclusion

This paper presents an unsupervised graph-based Chinesechunking by using label propagation for projecting chunkinformation across languages Our results suggest that it ispossible for unlabeled text to learn accurate chunk tags fromthe bitext the data which has parallel English text In thefeature we propose an alternative graph-based unsupervisedapproach on chunking for languages that are lacking ready-to-use resource

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors are grateful to the Science and TechnologyDevelopment Fund of Macau and the Research Committeeof the University of Macau (Grant nos MYRG076(Y1-L2)-FST13-WF andMYRG070(Y1-L2)-FST12-CS) for the fundingsupport for their research

References

[1] R Koeling ldquoChunking with maximum entropy modelsrdquo inProceedings of the 2ndWorkshop on Learning Language in Logicand the 4th Conference on Computational Natural LanguageLerning vol 7 pp 139ndash141 2000

[2] M P Marcus M A Marcinkiewicz and B Santorini ldquoBuildinga large annotated corpus of English the Penn TreebankrdquoComputational Linguistics vol 19 no 2 pp 313ndash330 1993

[3] D Yarowsky and G Ngai ldquoInducing multilingual POS taggersandNPbracketers via robust projection across aligned corporardquoin Proceedings of the 2ndMeeting of the North American Chapterof the Association for Computational Linguistics (NAACL rsquo01)pp 200ndash207 2001

[4] X Zhu Z Ghahramani and J Lafferty ldquoSemi-supervisedlearning using Gaussian fields and harmonic functionsrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 912ndash919 August 2003

[5] Y Altun M Belkin and D A Mcallester ldquoMaximum marginsemi-supervised learning for structured variablesrdquo in Advancesin Neural Information Processing Systems pp 33ndash40 2005

[6] A Subramanya S Petrov and F Pereira ldquoEfficient graph-based semi-supervised learning of structured tagging modelsrdquoinProceedings of theConference onEmpiricalMethods inNaturalLanguage Processing pp 167ndash176 October 2010

10 The Scientific World Journal

[7] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

[8] A Subramanya and J Bilmes ldquoSemi-supervised learningwith measure propagationrdquo The Journal of Machine LearningResearch vol 12 pp 3311ndash3370 2011

[9] D Das and N A Smith ldquoSemi-supervised frame-semanticparsing for unknown predicatesrdquo in Proceedings of the 49thAnnualMeeting of the Association for Computational Linguisticspp 1435ndash1444 June 2011

[10] S Baluja R Seth D S Sivakumar et al ldquoVideo suggestion anddiscovery for you tube taking random walks through the viewgraphrdquo in Proceedings of the 17th International Conference onWorld Wide Web pp 895ndash904 April 2008

[11] S Petrov D Das and RMcDonald ldquoA universal part-of-speechtagsetrdquo 2011 httparxivorgabs11042086

[12] N Xue F Xia F-D Chiou and M Palmer ldquoThe Penn ChineseTreeBank phrase structure annotation of a large corpusrdquoNatural Language Engineering vol 11 no 2 pp 207ndash238 2005

[13] E F Tjong K Sang and H Dejean ldquoIntroduction to theCoNLL- shared task clause identificationrdquo in Proceedings of theWorkshop on Computational Natural Language Learning vol 7p 8 2001

[14] T Berg-Kirkpatrick A Bouchard-Cote J deNero andDKleinldquoPainless unsupervised learning with featuresrdquo in Proceedingsof the Human Language Technologies Conference of the NorthAmerican Chapter of the Association for Computational Linguis-tics pp 582ndash590 Los Angeles Calif USA June 2010

[15] A B Goldberg andX Zhu ldquoSeeing stars when there arenrsquotmanystars graph-based semi-supervised learning for sentiment cat-egorizationrdquo in Proceedings of the 1st Workshop on Graph BasedMethods for Natural Language Processing pp 45ndash52 2006

[16] DC Liu and JNocedal ldquoOn the limitedmemoryBFGSmethodfor large scale optimizationrdquoMathematical Programming B vol45 no 3 pp 503ndash528 1989

[17] L Tian F Wong and S Chao ldquoAn improvement of translationquality with adding key-words in parallel corpusrdquo in Proceed-ings of the International Conference on Machine Learning andCybernetics (ICMLC rsquo10) vol 3 pp 1273ndash1278 July 2010

[18] P P Talukdar and F Pereira ldquoExperiments in graph-based semi-supervised learning methods for class-instance acquisitionrdquo inProceedings of the 48th Annual Meeting of the Association forComputational Linguistics (ACL rsquo10) pp 1473ndash1481 July 2010

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 5: Research Article Unsupervised Chunking Based on Graph ...downloads.hindawi.com/journals/tswj/2014/401943.pdf · Unsupervised Chunking Based on Graph Propagation from Bilingual Corpus

The Scientific World Journal 5

[Yixiang zhongyao xuanze]

[Yixiang cuowu xuanze]

[Yixiang mingzhi xuanze]

[Yixiang zhengque xuanze]

NP-INP-IImportantImportant

[Ta shi yi]

[Wo shi yi]

[Yinsu shi zheyang]

[Zhi shi zheyang] [Women shi nansheng]

[Buduan tigao]

[Buduan tansuo]

[Buduan shenru]

[Buduan bianhua]

ExploreVP-I

Figure 3 An example of similarity graph over trigram on labeled and unlabeled data

where forall119888119894isin 119881119888 119881119897

119888 120574119894(119910) and 119896

119894are defined as follows

120574119894(119910) = sum

119888119895isin119873(119888119894)

119908119894119895119902(119898minus1)

119894(119910) + 120582119880 (119910)

119896119894= 120582 + sum

119888119895isin119873(119888119894)

119908119894119895

(5)

This procedure will be processed 10 iterations in our experi-ment

4 Unsupervised Chunk Induction

In Section 3 the bilingual graph construction is describedsuch that English chunk tags can be projected to Chineseside by simple label propagation This relies on the high-confidence word alignments Many Chinese vertices couldnot be fully projected from the English words To comple-ment label propagation is used to further propagate amongthe Chinese trigram types This section introduces the usageof unsupervised approach to build a practical system inthe task of shallow parsing The implementation of how to

establish the chunk tag identification systemwill be discussedand presented in detail

41 Data Representation

411 Universal Chunk Tag The universal tag was first pro-posed by [11] that consists of twelve universal part-of-speechcategories for the sake of evaluating their cross-lingual POSprojection system for six different languages In this work wefollow the idea but focus on the universal chunk tags betweenEnglish and Chinese for several reasons First it is usefulfor building and evaluating unsupervised and cross-lingualchunk taggers Additionally taggers constructed based onuniversal tag set can result in a more reasonable comparisonacross different supervised chunking approaches Since twokinds of tagging standards are applied for the differentlanguages it is vacuous to state that ldquoshallow parsing forlanguage119860 ismuch harder than that for language119861rdquowhen thetag sets are incomparable Finally it also permits our modelto train the chunk taggers with a common tag set acrossmultiple languages which does not need tomaintain language

6 The Scientific World Journal

specific unification rules due to the differences in Treebankannotation guidelines

The following compares the definition of tags used in theEnglish Wall Street Journal (WSJ) [2] and the Chinese PennTreebank (CTB) 70 [12] that are used in our experiment

(i) The Phrasal Tag of Chinese In the present corpus CTB7each chunk is labeled with one syntactic category A tag set of13 categories is used to represent shallow parsing structuralinformation as follows

ADJPmdashadjective phrase phrase headed by an adjec-tiveADVPmdashadverbial phrase phrasal category headed byan adverbCLPmdashclassifiers phrase for example (QP (CD 一)(CLP (M系列))) (a series)DPmdashdeterminer phrase in CTB7 corpus a DP isa modifier of the NP if it occurs within a NP Forexample (NP (DP (DT 任何) (NP (NN 人))) (anypeople)DNPmdashphrase formed by XP plus (DEG的) (lsquos) thatmodifies a NP the XP can be an ADJP DP QP NPPP or LCPDVPmdashphrase formed by XP plus ldquo地 (-ly)rdquo thatmodifies a VPFRAGmdashfragment used to mark fragmented elementsthat cannot be built into a full structure by using nullcategoriesLCPmdashused to mark phrases formed by a localizerand its complement For example (LCP (NP (NR皖南) (NN 事变)) (LC 中)) (in the Southern AnhuiIncident)LSTmdashlist marker numbered list letters or numberswhich identify items in a list and their surroundingpunctuation is labeled as LSTNPmdashnoun phrases phrasal category that includes allconstituents depending on a head nounPPmdashpreposition phrase phrasal category headed bya prepositionQPmdashquantifier phrase used with NP for example(QP (CD五百) (CLP (M辆))) (500)VPmdashverb phrase phrasal category headed by a verb

(ii) The Phrasal Tag of English In the respect English WSJcorpus there are just eight categories NP (noun phrase)PP (prepositional phrase) VP (verb phrase) ADVP (adverbphrase) ADJP (adjective phrase) SBAR (subordinating con-junction) PRT (particle) and INTJ (interjection)

As we can see the criterion of Chinese CTB7 has a lotof significant differences from English WSJ corpus Unlikethe English Treebank the fragment of Chinese continents isnormally smaller A chunkwhich aligned to a base-NPphrasein the English side could be divided into a QP tag a CLPtag and an NP tag For example after chunking ldquo五百辆

Table 2 The description of universal chunk tags

Tag Description Words Example

NP Noun phrase DET + ADV +ADJ + NOUN The strange birds

PP Prepositionphrase TO + IN In between

VP Verb phrase ADV + VB Was lookingADVP Adverb phrase ADV Also

ADJP Adjective phrase CONJ + ADV +ADJ Warm and cozy

SBAR Subordinatingconjunction IN Whether or not

车 (500 cars)rdquo will be tagged as (NP (QP (CD 五百) (CLP(M辆))) (NN车)) But at the English side ldquo500 carsrdquo will bejust denoted with an NP tag This nonuniform standard willresult in a mismatch projection during the alignment phaseand the difficulty of evaluation in the cross-lingual settingTo fix these problems mappings are applied in the universalchunking tag setThe categoryiesCLPDP andQP aremergedinto an NP phrase That is to say the phrase such as (NP (QP(CD五百) (CLP (M辆))) (NN车)) which corresponds tothe English NP chunk ldquo500 carsrdquo will be assigned with anNP tag in the universal chunk tag Additionally the phrasesDNP and DVP could be included in the ADJP and ADVPtags respectively

On the English side the occurrence of INTJ is 0according to statisticsThis evidence shows thatwe can ignorethe INTJ chunk tag Additionally the words belong to a PRTtag that is always regarded as a VP in Chinese For example(PRT (RP up)) (NP (DET the) (NNS stairs)) is aligned to ldquo上楼梯rdquo in Chinese where ldquo上rdquo is a VP

412 IBO2 Chunk Tag There are many ways to encodethe phrase structure of a chunk tag such as IBO1 IBO2IOE1 and IOE2 [13] The one used in this study is theIBO2 encoding This format ensures that all initial wordsof different base phrases will receive a B tag which is ableto identify the boundaries of each chunk In addition twoboundary types are defined as follows

(i) B-X represents the beginning of a chunk X(ii) I-X indicates a noninitial word in a phrase X(iii) O any words that are out of the boundary of any

chunks

Hence the input Chinese sentence can be representedusing these notations as follows

去年 (NP-B)实现 (VP-B)进出口 (NP-B)总值 (NP-I) 达 (VP-B) 一千零九十八点二亿 (NP-B) 美元(NP-I)

∘(O)

Based on the above discussion a tag set that consistsof six universal chunk categories is proposed as shown inTable 2 While there are a lot of controversies about universaltags and what the exact tag set should be used these six

The Scientific World Journal 7

chunk categories are sufficient to embrace the most frequentchunk tags that exist in English and Chinese Furthermorea mapping from fine-grained chunk tags for these two kindlanguages to the universal chunk set has been developedHence a dataset consisting of common chunk tag for Englishand Chinese is achieved by combining with the originalTreebank data and the universal chunk tag set and mappingbetween these two languages

42 Chunk Induction The label distribution of Chinese wordtypes 119909 can be computed by marginalizing the chunk tagdistribution of trigram types 119888

119894= 119909minus11199090119909+1

over the left andright context as follows

119901 (119910 | 119909) =

sum119909minus1119909+1

119902119894(119910)

sum119909minus1119909+1 119910

1015840 119902119894(1199101015840)

(6)

Then a set of possible tags 119905119909(119910) is extracted through a way

that eliminates labels whose probability is below a thresholdvalue 120591

119905119909(119910) =

1 if 119901 (119910 | 119909) ge 120591

0 otherwise(7)

The way how to choose 120591 will be described in Section 5Additionally the vector 119905

119909will cover every word in the

Chinese trigram types and will be employed as features in theunsupervised Chinese shallow parsing

Similar to the work of [14 15] our chunk inductionmodel is constructed on the feature-based hidden Markovmodel (HMM) A chunking HMM generates a sequenceof words in order In each generation step based on thecurrent chunk tag 119911

119894conditionally an observed word 119909

119894and a

hidden successor chunk tag 119911119894+1

are generated independentlyThere are two types of conditional distributions in themodel emission and transition probabilities which are bothmultinomial probability distributionsGiven a sentence119909 anda state sequence 119911 a first-order Markov model defines a jointdistribution as

119875120579= (119883 = 119909 119885 = 119911) = 119875

120579(1198851= 1199111)

sdot

|119909|

prod

119894=1

119875120579(119885119894+1

= 119911119894+1

| 119885119894= 119911119894)

sdot 119875120579(119883119894= 119909119894| 119885119894= 119911119894)

(8)

where the second part represents the transition probabilityand the following part is emission probability which isdifferent from the conventional Markov model where thefeature-based model replaces the emission part with a log-linear model such that

119875120579(119883 = 119909 119885 = 119911) =

exp 120579Τ

119891 (119909 119911)

sum1199091015840isinVal(119883) exp 120579Τ119891 (119909 119911)

(9)

which corresponds to the entire Chinese vocabularyThe advantage of using this locally normalized log-

linear model is the ability of looking at various aspects of

Table 3 Feature template used in unsupervised chunking

Basic (119909= 119911 = )

Contains digit check if x contains digit and conjoin with 119911

(contains digit(119909)= 119911 = )Contains hypen contains hypen(119909)= 119911 =

Suffix indicator features for character suffixes of up tolength 1 present in 119909

Prefix indicator features for character prefixes of up tolength 1 present in 119909

Pos tag indicator feature for word POS assigned to 119909

the observation 119909 incorporating features of the observationsThen the model is trained by applying the following objectivefunction

L (120579) =

119873

sum

119894=1

logsum119885

119875 (120579) (119883 = 119909(119894)

119885 = 119911(119894)

) minus 1198621205792

(10)

It is worth noting that this function includes marginal-izing out all the sentence 119909rsquos and all possible configurations119911 which leads to a nonconvex objective We optimize thisobjective function by using L-BFGS a quasi-Newtonmethodproposed by Liu and Nocedal [16] The evidences of pastexperiments show that this direct gradient algorithm per-formed better than using a feature-enhanced modification ofthe EM (expectation-maximization)

Moreover this state-of-the-art model also has an advan-tage that makes it easy to experiment with various waysof incorporating the constraint feature into the log-linearmodel This feature function 119891

119905consists of the relevant

information extracted from the smooth graph and eliminatesthe hidden states which are inconsistent with the thresholdvector 119905

119909

43 Unsupervised Chunking Features The feature selectionprocess in the unsupervised chunking model does affect theidentification result Feature selection is the task to identifyan optimal feature subset that best represents the underlyingclassification problem through filtering the irrelevant andredundant ones Unlike the CoNLL-2000 we shared super-vised chunking task which is using WSJ (the Wall StreetJournal) corpus as background knowledge to construct thechunking model Our corpus is composed of only wordswhich do not contain any chunk information that enables usto form an optimal feature subset for the underlying chunkpattern We use the set of features as the following featuretemplates These are all coarse features on emission contextsthat are extracted from words with certain orthographicproperties Only the basic feature is used for transitions Forany emission context with word 119909 and tag 119911 we construct thefollowing feature templates as shown in Table 3

Box 1 illustrates the example about the feature subsetThe feature set of each word in a sentence is a vector of6 dimensions which are the values of the correspondingfeatures and the label indicates which kind of chunk labelshould the word belong to

8 The Scientific World Journal

Sentence中国NRNP-B建筑业NNNP-I对PPP-B外NNNP-B开放VVVP-B呈现VVVP-B新JJNP-B格局NNNP-I

Word建筑业Instance (建筑业 NP-I) (N NP-I) (N NP-I) (建)

(业) NN

Box 1 Example of feature template

5 Experiments and Results

Before presenting the results of our experiments the eval-uation metrics the datasets and the baselines used forcomparisons are firstly described

51 Evaluation Metrics The metrics for evaluating NPchunking model constitute precision rate recall rate andtheir harmonic mean 119865

1-score The tagging accuracy is

defined as

Tagging Accuracy

=The number of correct tagged words

The number of words

(11)

Precision measures the percentage of labeled NP chunksthat are correctThe number of correct tagged words includesboth the correct boundaries of NP chunk and the correctlabel The precision is therefore defined as

Precision

=The number of correct proposed tagged words

The number of correct chunk tags

(12)

Recall measures the percentage of NP chunks presentedin the input sentence that are correctly identified by thesystem Recall is depicted as below

Recall =The number of correct proposed tagged words

The number of current chunk tags

(13)

The 119865-measure illustrates a way to combine previous twomeasures into onemetricThe formula of 119865-sore is defined as

119865120573-score =

(1205732

+ 1) times Recall times Precision1205732 times Recall + Precision

(14)

where 120573 is a parameter that weights the importance of recalland precision when 120573 = 1 precision and recall are equallyimportant

52 Dataset To facilitate bilingual graph construction andunsupervised chunking experiment two kinds of data sets areutilized in this work (1) monolingual Treebank for Chinesechunk induction and (2) large amounts of parallel corpus

Table 4 (a) English-Chinese parallel corpus (b) Graph instance (c)Chinese unlabeled dataset CTB7 corpora

(a)

Number of sentence pairs Number of seeds Number of words10000 27940 31678

(b)

Number of sentences Number of vertices17617 185441

(c)

Dataset Source Number of sentencesTraining dataset Xinhua 1ndash321 7617Testing dataset Xinhua 363ndash403 912

of English and Chinese for bilingual graph construction andlabel propagation For themonolingual Treebank data we relyon Penn Chinese Treebank 70 (CTB7) [12] It consists of overonemillion words of annotated and parsed text fromChinesenewswire magazines various broadcast news and broadcastconversation programs web newsgroups and weblogs Theparallel data came from the UM corpus [17] which contains100000 pair of English-Chinese sentences The trainingand testing sets are defined for the experiment and theircorresponding statistic information is shown in Table 4 Inaddition Table 4 also shows the detailed information of dataused for bilingual graph construction including the parallelcorpus for the tag projection and the label propagation ofChinese trigram

53 Chunk Tagset and HMM States As described the uni-versal chunk tag set is employed in our experiment Thisset 119880 consists of the following six coarse-grained tags NP(noun phrase) PP (prepositional phrase) VP (verb phrase)ADVP (adverb phrase) ADJP (adjective phrase) and SBAR(subordinating conjunction) Although there might be somecontroversies about the exact definition of such a tag set these6 categories cover the most frequent chunk tags that exist inone form or another in both of English and Chinese corpora

For each kind of languages under consideration a map-ping from the fine-grained language specific chunk tags in theTreebank to the universal chunk tags is provided Hence themodel of unsupervised chunking is trained on the datasetslabeled with the universal chunk tags

The Scientific World Journal 9

Table 5 Chunk tagging evaluation results for various baselines andproposed graph-based model

Tag Feature-HMM Projection Graph-basedPre Rec F1 Pre Rec F1 Pre Rec F1

NP 060 067 063 067 072 068 081 082 081VP 056 051 053 061 057 059 078 074 076PP 036 028 032 044 031 036 060 051 055ADVP 040 046 043 045 052 048 064 068 066ADJP 047 053 050 048 058 051 067 071 069SBAR 000 000 000 050 10 066 050 10 066All 049 051 050 057 062 059 073 075 074

In this work the number of latent HMM states forChinese is fixed to be a constant which is the number of theuniversal chunk tags

54 VariousModels To comprehensively probe our approachwith a thorough analysis we evaluated two baselines inaddition to our graph-based algorithmWewere intentionallylenient with these two baselines

(i) Feature-HMM The first baseline is the vanilla feature-HMM proposed by [9] which apply L-BFGS to estimate themodel parameters and use a greedy many-to-1 mapping forevaluation

(ii) ProjectionThe direct projection serves as a second base-line It incorporates the bilingual information by projectingchunk tags from English side to the aligned Chinese textsSeveral rules were added to fix the unaligned problem Wetrained a supervised feature-HMMwith the same number ofsentences from the bitext as it is in the Treebank

(iii) Our Graph-Based Model The full model uses bothstages of label propagation (3) before extracting the constrainfeatures As a result the label distribution of all Chinese wordtypes was added in the constrain features

55 Experimental Setup Theexperimental platform is imple-mented based on two toolkits Junto [18] and Ark [1415] Junto is a Java-based label propagation toolkit for thebilingual graph construction Ark is a Java-based package forthe feature-HMM training and testing

In the feature-HMM training we need to set a fewhyperparameters to minimize the number of free parametersin the model 119862 = 10 was used as regularization constantin (10) and trained L-BFGS for 1000 iterations Severalthreshold values for 120591 to extract the vector 119905

119909were tried and it

was found that 02 works best It indicates that 120591 = 02 couldbe used for Chinese trigram types

56 Results Table 5 shows the complete set of results Asexpected the projection baseline is better than the feature-HMM for being able to benefit from the bilingual infor-mation It greatly improves upon the monolingual baselineby 12 on 119865

1-score Comparing among the unsupervised

approaches the feature-HMM achieves only 50 of 1198651-

score on the universal chunk tags Overall the graph-basedmodel outperforms these two baselines That is to say theimprovement of feature-HMM with the graph-based settingis statistically significantwith respect to othermodels Graph-based modal performs 24 better than the state-of-the-art feature-HMM and 12 better than the direct projectionmodel

6 Conclusion

This paper presents an unsupervised graph-based Chinesechunking by using label propagation for projecting chunkinformation across languages Our results suggest that it ispossible for unlabeled text to learn accurate chunk tags fromthe bitext the data which has parallel English text In thefeature we propose an alternative graph-based unsupervisedapproach on chunking for languages that are lacking ready-to-use resource

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors are grateful to the Science and TechnologyDevelopment Fund of Macau and the Research Committeeof the University of Macau (Grant nos MYRG076(Y1-L2)-FST13-WF andMYRG070(Y1-L2)-FST12-CS) for the fundingsupport for their research

References

[1] R Koeling ldquoChunking with maximum entropy modelsrdquo inProceedings of the 2ndWorkshop on Learning Language in Logicand the 4th Conference on Computational Natural LanguageLerning vol 7 pp 139ndash141 2000

[2] M P Marcus M A Marcinkiewicz and B Santorini ldquoBuildinga large annotated corpus of English the Penn TreebankrdquoComputational Linguistics vol 19 no 2 pp 313ndash330 1993

[3] D Yarowsky and G Ngai ldquoInducing multilingual POS taggersandNPbracketers via robust projection across aligned corporardquoin Proceedings of the 2ndMeeting of the North American Chapterof the Association for Computational Linguistics (NAACL rsquo01)pp 200ndash207 2001

[4] X Zhu Z Ghahramani and J Lafferty ldquoSemi-supervisedlearning using Gaussian fields and harmonic functionsrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 912ndash919 August 2003

[5] Y Altun M Belkin and D A Mcallester ldquoMaximum marginsemi-supervised learning for structured variablesrdquo in Advancesin Neural Information Processing Systems pp 33ndash40 2005

[6] A Subramanya S Petrov and F Pereira ldquoEfficient graph-based semi-supervised learning of structured tagging modelsrdquoinProceedings of theConference onEmpiricalMethods inNaturalLanguage Processing pp 167ndash176 October 2010

10 The Scientific World Journal

[7] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

[8] A Subramanya and J Bilmes ldquoSemi-supervised learningwith measure propagationrdquo The Journal of Machine LearningResearch vol 12 pp 3311ndash3370 2011

[9] D Das and N A Smith ldquoSemi-supervised frame-semanticparsing for unknown predicatesrdquo in Proceedings of the 49thAnnualMeeting of the Association for Computational Linguisticspp 1435ndash1444 June 2011

[10] S Baluja R Seth D S Sivakumar et al ldquoVideo suggestion anddiscovery for you tube taking random walks through the viewgraphrdquo in Proceedings of the 17th International Conference onWorld Wide Web pp 895ndash904 April 2008

[11] S Petrov D Das and RMcDonald ldquoA universal part-of-speechtagsetrdquo 2011 httparxivorgabs11042086

[12] N Xue F Xia F-D Chiou and M Palmer ldquoThe Penn ChineseTreeBank phrase structure annotation of a large corpusrdquoNatural Language Engineering vol 11 no 2 pp 207ndash238 2005

[13] E F Tjong K Sang and H Dejean ldquoIntroduction to theCoNLL- shared task clause identificationrdquo in Proceedings of theWorkshop on Computational Natural Language Learning vol 7p 8 2001

[14] T Berg-Kirkpatrick A Bouchard-Cote J deNero andDKleinldquoPainless unsupervised learning with featuresrdquo in Proceedingsof the Human Language Technologies Conference of the NorthAmerican Chapter of the Association for Computational Linguis-tics pp 582ndash590 Los Angeles Calif USA June 2010

[15] A B Goldberg andX Zhu ldquoSeeing stars when there arenrsquotmanystars graph-based semi-supervised learning for sentiment cat-egorizationrdquo in Proceedings of the 1st Workshop on Graph BasedMethods for Natural Language Processing pp 45ndash52 2006

[16] DC Liu and JNocedal ldquoOn the limitedmemoryBFGSmethodfor large scale optimizationrdquoMathematical Programming B vol45 no 3 pp 503ndash528 1989

[17] L Tian F Wong and S Chao ldquoAn improvement of translationquality with adding key-words in parallel corpusrdquo in Proceed-ings of the International Conference on Machine Learning andCybernetics (ICMLC rsquo10) vol 3 pp 1273ndash1278 July 2010

[18] P P Talukdar and F Pereira ldquoExperiments in graph-based semi-supervised learning methods for class-instance acquisitionrdquo inProceedings of the 48th Annual Meeting of the Association forComputational Linguistics (ACL rsquo10) pp 1473ndash1481 July 2010

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 6: Research Article Unsupervised Chunking Based on Graph ...downloads.hindawi.com/journals/tswj/2014/401943.pdf · Unsupervised Chunking Based on Graph Propagation from Bilingual Corpus

6 The Scientific World Journal

specific unification rules due to the differences in Treebankannotation guidelines

The following compares the definition of tags used in theEnglish Wall Street Journal (WSJ) [2] and the Chinese PennTreebank (CTB) 70 [12] that are used in our experiment

(i) The Phrasal Tag of Chinese In the present corpus CTB7each chunk is labeled with one syntactic category A tag set of13 categories is used to represent shallow parsing structuralinformation as follows

ADJPmdashadjective phrase phrase headed by an adjec-tiveADVPmdashadverbial phrase phrasal category headed byan adverbCLPmdashclassifiers phrase for example (QP (CD 一)(CLP (M系列))) (a series)DPmdashdeterminer phrase in CTB7 corpus a DP isa modifier of the NP if it occurs within a NP Forexample (NP (DP (DT 任何) (NP (NN 人))) (anypeople)DNPmdashphrase formed by XP plus (DEG的) (lsquos) thatmodifies a NP the XP can be an ADJP DP QP NPPP or LCPDVPmdashphrase formed by XP plus ldquo地 (-ly)rdquo thatmodifies a VPFRAGmdashfragment used to mark fragmented elementsthat cannot be built into a full structure by using nullcategoriesLCPmdashused to mark phrases formed by a localizerand its complement For example (LCP (NP (NR皖南) (NN 事变)) (LC 中)) (in the Southern AnhuiIncident)LSTmdashlist marker numbered list letters or numberswhich identify items in a list and their surroundingpunctuation is labeled as LSTNPmdashnoun phrases phrasal category that includes allconstituents depending on a head nounPPmdashpreposition phrase phrasal category headed bya prepositionQPmdashquantifier phrase used with NP for example(QP (CD五百) (CLP (M辆))) (500)VPmdashverb phrase phrasal category headed by a verb

(ii) The Phrasal Tag of English In the respect English WSJcorpus there are just eight categories NP (noun phrase)PP (prepositional phrase) VP (verb phrase) ADVP (adverbphrase) ADJP (adjective phrase) SBAR (subordinating con-junction) PRT (particle) and INTJ (interjection)

As we can see the criterion of Chinese CTB7 has a lotof significant differences from English WSJ corpus Unlikethe English Treebank the fragment of Chinese continents isnormally smaller A chunkwhich aligned to a base-NPphrasein the English side could be divided into a QP tag a CLPtag and an NP tag For example after chunking ldquo五百辆

Table 2 The description of universal chunk tags

Tag Description Words Example

NP Noun phrase DET + ADV +ADJ + NOUN The strange birds

PP Prepositionphrase TO + IN In between

VP Verb phrase ADV + VB Was lookingADVP Adverb phrase ADV Also

ADJP Adjective phrase CONJ + ADV +ADJ Warm and cozy

SBAR Subordinatingconjunction IN Whether or not

车 (500 cars)rdquo will be tagged as (NP (QP (CD 五百) (CLP(M辆))) (NN车)) But at the English side ldquo500 carsrdquo will bejust denoted with an NP tag This nonuniform standard willresult in a mismatch projection during the alignment phaseand the difficulty of evaluation in the cross-lingual settingTo fix these problems mappings are applied in the universalchunking tag setThe categoryiesCLPDP andQP aremergedinto an NP phrase That is to say the phrase such as (NP (QP(CD五百) (CLP (M辆))) (NN车)) which corresponds tothe English NP chunk ldquo500 carsrdquo will be assigned with anNP tag in the universal chunk tag Additionally the phrasesDNP and DVP could be included in the ADJP and ADVPtags respectively

On the English side the occurrence of INTJ is 0according to statisticsThis evidence shows thatwe can ignorethe INTJ chunk tag Additionally the words belong to a PRTtag that is always regarded as a VP in Chinese For example(PRT (RP up)) (NP (DET the) (NNS stairs)) is aligned to ldquo上楼梯rdquo in Chinese where ldquo上rdquo is a VP

412 IBO2 Chunk Tag There are many ways to encodethe phrase structure of a chunk tag such as IBO1 IBO2IOE1 and IOE2 [13] The one used in this study is theIBO2 encoding This format ensures that all initial wordsof different base phrases will receive a B tag which is ableto identify the boundaries of each chunk In addition twoboundary types are defined as follows

(i) B-X represents the beginning of a chunk X(ii) I-X indicates a noninitial word in a phrase X(iii) O any words that are out of the boundary of any

chunks

Hence the input Chinese sentence can be representedusing these notations as follows

去年 (NP-B)实现 (VP-B)进出口 (NP-B)总值 (NP-I) 达 (VP-B) 一千零九十八点二亿 (NP-B) 美元(NP-I)

∘(O)

Based on the above discussion a tag set that consistsof six universal chunk categories is proposed as shown inTable 2 While there are a lot of controversies about universaltags and what the exact tag set should be used these six

The Scientific World Journal 7

chunk categories are sufficient to embrace the most frequentchunk tags that exist in English and Chinese Furthermorea mapping from fine-grained chunk tags for these two kindlanguages to the universal chunk set has been developedHence a dataset consisting of common chunk tag for Englishand Chinese is achieved by combining with the originalTreebank data and the universal chunk tag set and mappingbetween these two languages

42 Chunk Induction The label distribution of Chinese wordtypes 119909 can be computed by marginalizing the chunk tagdistribution of trigram types 119888

119894= 119909minus11199090119909+1

over the left andright context as follows

119901 (119910 | 119909) =

sum119909minus1119909+1

119902119894(119910)

sum119909minus1119909+1 119910

1015840 119902119894(1199101015840)

(6)

Then a set of possible tags 119905119909(119910) is extracted through a way

that eliminates labels whose probability is below a thresholdvalue 120591

119905119909(119910) =

1 if 119901 (119910 | 119909) ge 120591

0 otherwise(7)

The way how to choose 120591 will be described in Section 5Additionally the vector 119905

119909will cover every word in the

Chinese trigram types and will be employed as features in theunsupervised Chinese shallow parsing

Similar to the work of [14 15] our chunk inductionmodel is constructed on the feature-based hidden Markovmodel (HMM) A chunking HMM generates a sequenceof words in order In each generation step based on thecurrent chunk tag 119911

119894conditionally an observed word 119909

119894and a

hidden successor chunk tag 119911119894+1

are generated independentlyThere are two types of conditional distributions in themodel emission and transition probabilities which are bothmultinomial probability distributionsGiven a sentence119909 anda state sequence 119911 a first-order Markov model defines a jointdistribution as

119875120579= (119883 = 119909 119885 = 119911) = 119875

120579(1198851= 1199111)

sdot

|119909|

prod

119894=1

119875120579(119885119894+1

= 119911119894+1

| 119885119894= 119911119894)

sdot 119875120579(119883119894= 119909119894| 119885119894= 119911119894)

(8)

where the second part represents the transition probabilityand the following part is emission probability which isdifferent from the conventional Markov model where thefeature-based model replaces the emission part with a log-linear model such that

119875120579(119883 = 119909 119885 = 119911) =

exp 120579Τ

119891 (119909 119911)

sum1199091015840isinVal(119883) exp 120579Τ119891 (119909 119911)

(9)

which corresponds to the entire Chinese vocabularyThe advantage of using this locally normalized log-

linear model is the ability of looking at various aspects of

Table 3 Feature template used in unsupervised chunking

Basic (119909= 119911 = )

Contains digit check if x contains digit and conjoin with 119911

(contains digit(119909)= 119911 = )Contains hypen contains hypen(119909)= 119911 =

Suffix indicator features for character suffixes of up tolength 1 present in 119909

Prefix indicator features for character prefixes of up tolength 1 present in 119909

Pos tag indicator feature for word POS assigned to 119909

the observation 119909 incorporating features of the observationsThen the model is trained by applying the following objectivefunction

L (120579) =

119873

sum

119894=1

logsum119885

119875 (120579) (119883 = 119909(119894)

119885 = 119911(119894)

) minus 1198621205792

(10)

It is worth noting that this function includes marginal-izing out all the sentence 119909rsquos and all possible configurations119911 which leads to a nonconvex objective We optimize thisobjective function by using L-BFGS a quasi-Newtonmethodproposed by Liu and Nocedal [16] The evidences of pastexperiments show that this direct gradient algorithm per-formed better than using a feature-enhanced modification ofthe EM (expectation-maximization)

Moreover this state-of-the-art model also has an advan-tage that makes it easy to experiment with various waysof incorporating the constraint feature into the log-linearmodel This feature function 119891

119905consists of the relevant

information extracted from the smooth graph and eliminatesthe hidden states which are inconsistent with the thresholdvector 119905

119909

43 Unsupervised Chunking Features The feature selectionprocess in the unsupervised chunking model does affect theidentification result Feature selection is the task to identifyan optimal feature subset that best represents the underlyingclassification problem through filtering the irrelevant andredundant ones Unlike the CoNLL-2000 we shared super-vised chunking task which is using WSJ (the Wall StreetJournal) corpus as background knowledge to construct thechunking model Our corpus is composed of only wordswhich do not contain any chunk information that enables usto form an optimal feature subset for the underlying chunkpattern We use the set of features as the following featuretemplates These are all coarse features on emission contextsthat are extracted from words with certain orthographicproperties Only the basic feature is used for transitions Forany emission context with word 119909 and tag 119911 we construct thefollowing feature templates as shown in Table 3

Box 1 illustrates the example about the feature subsetThe feature set of each word in a sentence is a vector of6 dimensions which are the values of the correspondingfeatures and the label indicates which kind of chunk labelshould the word belong to

8 The Scientific World Journal

Sentence中国NRNP-B建筑业NNNP-I对PPP-B外NNNP-B开放VVVP-B呈现VVVP-B新JJNP-B格局NNNP-I

Word建筑业Instance (建筑业 NP-I) (N NP-I) (N NP-I) (建)

(业) NN

Box 1 Example of feature template

5 Experiments and Results

Before presenting the results of our experiments the eval-uation metrics the datasets and the baselines used forcomparisons are firstly described

51 Evaluation Metrics The metrics for evaluating NPchunking model constitute precision rate recall rate andtheir harmonic mean 119865

1-score The tagging accuracy is

defined as

Tagging Accuracy

=The number of correct tagged words

The number of words

(11)

Precision measures the percentage of labeled NP chunksthat are correctThe number of correct tagged words includesboth the correct boundaries of NP chunk and the correctlabel The precision is therefore defined as

Precision

=The number of correct proposed tagged words

The number of correct chunk tags

(12)

Recall measures the percentage of NP chunks presentedin the input sentence that are correctly identified by thesystem Recall is depicted as below

Recall =The number of correct proposed tagged words

The number of current chunk tags

(13)

The 119865-measure illustrates a way to combine previous twomeasures into onemetricThe formula of 119865-sore is defined as

119865120573-score =

(1205732

+ 1) times Recall times Precision1205732 times Recall + Precision

(14)

where 120573 is a parameter that weights the importance of recalland precision when 120573 = 1 precision and recall are equallyimportant

52 Dataset To facilitate bilingual graph construction andunsupervised chunking experiment two kinds of data sets areutilized in this work (1) monolingual Treebank for Chinesechunk induction and (2) large amounts of parallel corpus

Table 4 (a) English-Chinese parallel corpus (b) Graph instance (c)Chinese unlabeled dataset CTB7 corpora

(a)

Number of sentence pairs Number of seeds Number of words10000 27940 31678

(b)

Number of sentences Number of vertices17617 185441

(c)

Dataset Source Number of sentencesTraining dataset Xinhua 1ndash321 7617Testing dataset Xinhua 363ndash403 912

of English and Chinese for bilingual graph construction andlabel propagation For themonolingual Treebank data we relyon Penn Chinese Treebank 70 (CTB7) [12] It consists of overonemillion words of annotated and parsed text fromChinesenewswire magazines various broadcast news and broadcastconversation programs web newsgroups and weblogs Theparallel data came from the UM corpus [17] which contains100000 pair of English-Chinese sentences The trainingand testing sets are defined for the experiment and theircorresponding statistic information is shown in Table 4 Inaddition Table 4 also shows the detailed information of dataused for bilingual graph construction including the parallelcorpus for the tag projection and the label propagation ofChinese trigram

53 Chunk Tagset and HMM States As described the uni-versal chunk tag set is employed in our experiment Thisset 119880 consists of the following six coarse-grained tags NP(noun phrase) PP (prepositional phrase) VP (verb phrase)ADVP (adverb phrase) ADJP (adjective phrase) and SBAR(subordinating conjunction) Although there might be somecontroversies about the exact definition of such a tag set these6 categories cover the most frequent chunk tags that exist inone form or another in both of English and Chinese corpora

For each kind of languages under consideration a map-ping from the fine-grained language specific chunk tags in theTreebank to the universal chunk tags is provided Hence themodel of unsupervised chunking is trained on the datasetslabeled with the universal chunk tags

The Scientific World Journal 9

Table 5 Chunk tagging evaluation results for various baselines andproposed graph-based model

Tag Feature-HMM Projection Graph-basedPre Rec F1 Pre Rec F1 Pre Rec F1

NP 060 067 063 067 072 068 081 082 081VP 056 051 053 061 057 059 078 074 076PP 036 028 032 044 031 036 060 051 055ADVP 040 046 043 045 052 048 064 068 066ADJP 047 053 050 048 058 051 067 071 069SBAR 000 000 000 050 10 066 050 10 066All 049 051 050 057 062 059 073 075 074

In this work the number of latent HMM states forChinese is fixed to be a constant which is the number of theuniversal chunk tags

54 VariousModels To comprehensively probe our approachwith a thorough analysis we evaluated two baselines inaddition to our graph-based algorithmWewere intentionallylenient with these two baselines

(i) Feature-HMM The first baseline is the vanilla feature-HMM proposed by [9] which apply L-BFGS to estimate themodel parameters and use a greedy many-to-1 mapping forevaluation

(ii) ProjectionThe direct projection serves as a second base-line It incorporates the bilingual information by projectingchunk tags from English side to the aligned Chinese textsSeveral rules were added to fix the unaligned problem Wetrained a supervised feature-HMMwith the same number ofsentences from the bitext as it is in the Treebank

(iii) Our Graph-Based Model The full model uses bothstages of label propagation (3) before extracting the constrainfeatures As a result the label distribution of all Chinese wordtypes was added in the constrain features

55 Experimental Setup Theexperimental platform is imple-mented based on two toolkits Junto [18] and Ark [1415] Junto is a Java-based label propagation toolkit for thebilingual graph construction Ark is a Java-based package forthe feature-HMM training and testing

In the feature-HMM training we need to set a fewhyperparameters to minimize the number of free parametersin the model 119862 = 10 was used as regularization constantin (10) and trained L-BFGS for 1000 iterations Severalthreshold values for 120591 to extract the vector 119905

119909were tried and it

was found that 02 works best It indicates that 120591 = 02 couldbe used for Chinese trigram types

56 Results Table 5 shows the complete set of results Asexpected the projection baseline is better than the feature-HMM for being able to benefit from the bilingual infor-mation It greatly improves upon the monolingual baselineby 12 on 119865

1-score Comparing among the unsupervised

approaches the feature-HMM achieves only 50 of 1198651-

score on the universal chunk tags Overall the graph-basedmodel outperforms these two baselines That is to say theimprovement of feature-HMM with the graph-based settingis statistically significantwith respect to othermodels Graph-based modal performs 24 better than the state-of-the-art feature-HMM and 12 better than the direct projectionmodel

6 Conclusion

This paper presents an unsupervised graph-based Chinesechunking by using label propagation for projecting chunkinformation across languages Our results suggest that it ispossible for unlabeled text to learn accurate chunk tags fromthe bitext the data which has parallel English text In thefeature we propose an alternative graph-based unsupervisedapproach on chunking for languages that are lacking ready-to-use resource

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors are grateful to the Science and TechnologyDevelopment Fund of Macau and the Research Committeeof the University of Macau (Grant nos MYRG076(Y1-L2)-FST13-WF andMYRG070(Y1-L2)-FST12-CS) for the fundingsupport for their research

References

[1] R Koeling ldquoChunking with maximum entropy modelsrdquo inProceedings of the 2ndWorkshop on Learning Language in Logicand the 4th Conference on Computational Natural LanguageLerning vol 7 pp 139ndash141 2000

[2] M P Marcus M A Marcinkiewicz and B Santorini ldquoBuildinga large annotated corpus of English the Penn TreebankrdquoComputational Linguistics vol 19 no 2 pp 313ndash330 1993

[3] D Yarowsky and G Ngai ldquoInducing multilingual POS taggersandNPbracketers via robust projection across aligned corporardquoin Proceedings of the 2ndMeeting of the North American Chapterof the Association for Computational Linguistics (NAACL rsquo01)pp 200ndash207 2001

[4] X Zhu Z Ghahramani and J Lafferty ldquoSemi-supervisedlearning using Gaussian fields and harmonic functionsrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 912ndash919 August 2003

[5] Y Altun M Belkin and D A Mcallester ldquoMaximum marginsemi-supervised learning for structured variablesrdquo in Advancesin Neural Information Processing Systems pp 33ndash40 2005

[6] A Subramanya S Petrov and F Pereira ldquoEfficient graph-based semi-supervised learning of structured tagging modelsrdquoinProceedings of theConference onEmpiricalMethods inNaturalLanguage Processing pp 167ndash176 October 2010

10 The Scientific World Journal

[7] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

[8] A Subramanya and J Bilmes ldquoSemi-supervised learningwith measure propagationrdquo The Journal of Machine LearningResearch vol 12 pp 3311ndash3370 2011

[9] D Das and N A Smith ldquoSemi-supervised frame-semanticparsing for unknown predicatesrdquo in Proceedings of the 49thAnnualMeeting of the Association for Computational Linguisticspp 1435ndash1444 June 2011

[10] S Baluja R Seth D S Sivakumar et al ldquoVideo suggestion anddiscovery for you tube taking random walks through the viewgraphrdquo in Proceedings of the 17th International Conference onWorld Wide Web pp 895ndash904 April 2008

[11] S Petrov D Das and RMcDonald ldquoA universal part-of-speechtagsetrdquo 2011 httparxivorgabs11042086

[12] N Xue F Xia F-D Chiou and M Palmer ldquoThe Penn ChineseTreeBank phrase structure annotation of a large corpusrdquoNatural Language Engineering vol 11 no 2 pp 207ndash238 2005

[13] E F Tjong K Sang and H Dejean ldquoIntroduction to theCoNLL- shared task clause identificationrdquo in Proceedings of theWorkshop on Computational Natural Language Learning vol 7p 8 2001

[14] T Berg-Kirkpatrick A Bouchard-Cote J deNero andDKleinldquoPainless unsupervised learning with featuresrdquo in Proceedingsof the Human Language Technologies Conference of the NorthAmerican Chapter of the Association for Computational Linguis-tics pp 582ndash590 Los Angeles Calif USA June 2010

[15] A B Goldberg andX Zhu ldquoSeeing stars when there arenrsquotmanystars graph-based semi-supervised learning for sentiment cat-egorizationrdquo in Proceedings of the 1st Workshop on Graph BasedMethods for Natural Language Processing pp 45ndash52 2006

[16] DC Liu and JNocedal ldquoOn the limitedmemoryBFGSmethodfor large scale optimizationrdquoMathematical Programming B vol45 no 3 pp 503ndash528 1989

[17] L Tian F Wong and S Chao ldquoAn improvement of translationquality with adding key-words in parallel corpusrdquo in Proceed-ings of the International Conference on Machine Learning andCybernetics (ICMLC rsquo10) vol 3 pp 1273ndash1278 July 2010

[18] P P Talukdar and F Pereira ldquoExperiments in graph-based semi-supervised learning methods for class-instance acquisitionrdquo inProceedings of the 48th Annual Meeting of the Association forComputational Linguistics (ACL rsquo10) pp 1473ndash1481 July 2010

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 7: Research Article Unsupervised Chunking Based on Graph ...downloads.hindawi.com/journals/tswj/2014/401943.pdf · Unsupervised Chunking Based on Graph Propagation from Bilingual Corpus

The Scientific World Journal 7

chunk categories are sufficient to embrace the most frequentchunk tags that exist in English and Chinese Furthermorea mapping from fine-grained chunk tags for these two kindlanguages to the universal chunk set has been developedHence a dataset consisting of common chunk tag for Englishand Chinese is achieved by combining with the originalTreebank data and the universal chunk tag set and mappingbetween these two languages

42 Chunk Induction The label distribution of Chinese wordtypes 119909 can be computed by marginalizing the chunk tagdistribution of trigram types 119888

119894= 119909minus11199090119909+1

over the left andright context as follows

119901 (119910 | 119909) =

sum119909minus1119909+1

119902119894(119910)

sum119909minus1119909+1 119910

1015840 119902119894(1199101015840)

(6)

Then a set of possible tags 119905119909(119910) is extracted through a way

that eliminates labels whose probability is below a thresholdvalue 120591

119905119909(119910) =

1 if 119901 (119910 | 119909) ge 120591

0 otherwise(7)

The way how to choose 120591 will be described in Section 5Additionally the vector 119905

119909will cover every word in the

Chinese trigram types and will be employed as features in theunsupervised Chinese shallow parsing

Similar to the work of [14 15] our chunk inductionmodel is constructed on the feature-based hidden Markovmodel (HMM) A chunking HMM generates a sequenceof words in order In each generation step based on thecurrent chunk tag 119911

119894conditionally an observed word 119909

119894and a

hidden successor chunk tag 119911119894+1

are generated independentlyThere are two types of conditional distributions in themodel emission and transition probabilities which are bothmultinomial probability distributionsGiven a sentence119909 anda state sequence 119911 a first-order Markov model defines a jointdistribution as

119875120579= (119883 = 119909 119885 = 119911) = 119875

120579(1198851= 1199111)

sdot

|119909|

prod

119894=1

119875120579(119885119894+1

= 119911119894+1

| 119885119894= 119911119894)

sdot 119875120579(119883119894= 119909119894| 119885119894= 119911119894)

(8)

where the second part represents the transition probabilityand the following part is emission probability which isdifferent from the conventional Markov model where thefeature-based model replaces the emission part with a log-linear model such that

119875120579(119883 = 119909 119885 = 119911) =

exp 120579Τ

119891 (119909 119911)

sum1199091015840isinVal(119883) exp 120579Τ119891 (119909 119911)

(9)

which corresponds to the entire Chinese vocabularyThe advantage of using this locally normalized log-

linear model is the ability of looking at various aspects of

Table 3 Feature template used in unsupervised chunking

Basic (119909= 119911 = )

Contains digit check if x contains digit and conjoin with 119911

(contains digit(119909)= 119911 = )Contains hypen contains hypen(119909)= 119911 =

Suffix indicator features for character suffixes of up tolength 1 present in 119909

Prefix indicator features for character prefixes of up tolength 1 present in 119909

Pos tag indicator feature for word POS assigned to 119909

the observation 119909 incorporating features of the observationsThen the model is trained by applying the following objectivefunction

L (120579) =

119873

sum

119894=1

logsum119885

119875 (120579) (119883 = 119909(119894)

119885 = 119911(119894)

) minus 1198621205792

(10)

It is worth noting that this function includes marginal-izing out all the sentence 119909rsquos and all possible configurations119911 which leads to a nonconvex objective We optimize thisobjective function by using L-BFGS a quasi-Newtonmethodproposed by Liu and Nocedal [16] The evidences of pastexperiments show that this direct gradient algorithm per-formed better than using a feature-enhanced modification ofthe EM (expectation-maximization)

Moreover this state-of-the-art model also has an advan-tage that makes it easy to experiment with various waysof incorporating the constraint feature into the log-linearmodel This feature function 119891

119905consists of the relevant

information extracted from the smooth graph and eliminatesthe hidden states which are inconsistent with the thresholdvector 119905

119909

43 Unsupervised Chunking Features The feature selectionprocess in the unsupervised chunking model does affect theidentification result Feature selection is the task to identifyan optimal feature subset that best represents the underlyingclassification problem through filtering the irrelevant andredundant ones Unlike the CoNLL-2000 we shared super-vised chunking task which is using WSJ (the Wall StreetJournal) corpus as background knowledge to construct thechunking model Our corpus is composed of only wordswhich do not contain any chunk information that enables usto form an optimal feature subset for the underlying chunkpattern We use the set of features as the following featuretemplates These are all coarse features on emission contextsthat are extracted from words with certain orthographicproperties Only the basic feature is used for transitions Forany emission context with word 119909 and tag 119911 we construct thefollowing feature templates as shown in Table 3

Box 1 illustrates the example about the feature subsetThe feature set of each word in a sentence is a vector of6 dimensions which are the values of the correspondingfeatures and the label indicates which kind of chunk labelshould the word belong to

8 The Scientific World Journal

Sentence中国NRNP-B建筑业NNNP-I对PPP-B外NNNP-B开放VVVP-B呈现VVVP-B新JJNP-B格局NNNP-I

Word建筑业Instance (建筑业 NP-I) (N NP-I) (N NP-I) (建)

(业) NN

Box 1 Example of feature template

5 Experiments and Results

Before presenting the results of our experiments the eval-uation metrics the datasets and the baselines used forcomparisons are firstly described

51 Evaluation Metrics The metrics for evaluating NPchunking model constitute precision rate recall rate andtheir harmonic mean 119865

1-score The tagging accuracy is

defined as

Tagging Accuracy

=The number of correct tagged words

The number of words

(11)

Precision measures the percentage of labeled NP chunksthat are correctThe number of correct tagged words includesboth the correct boundaries of NP chunk and the correctlabel The precision is therefore defined as

Precision

=The number of correct proposed tagged words

The number of correct chunk tags

(12)

Recall measures the percentage of NP chunks presentedin the input sentence that are correctly identified by thesystem Recall is depicted as below

Recall =The number of correct proposed tagged words

The number of current chunk tags

(13)

The 119865-measure illustrates a way to combine previous twomeasures into onemetricThe formula of 119865-sore is defined as

119865120573-score =

(1205732

+ 1) times Recall times Precision1205732 times Recall + Precision

(14)

where 120573 is a parameter that weights the importance of recalland precision when 120573 = 1 precision and recall are equallyimportant

52 Dataset To facilitate bilingual graph construction andunsupervised chunking experiment two kinds of data sets areutilized in this work (1) monolingual Treebank for Chinesechunk induction and (2) large amounts of parallel corpus

Table 4 (a) English-Chinese parallel corpus (b) Graph instance (c)Chinese unlabeled dataset CTB7 corpora

(a)

Number of sentence pairs Number of seeds Number of words10000 27940 31678

(b)

Number of sentences Number of vertices17617 185441

(c)

Dataset Source Number of sentencesTraining dataset Xinhua 1ndash321 7617Testing dataset Xinhua 363ndash403 912

of English and Chinese for bilingual graph construction andlabel propagation For themonolingual Treebank data we relyon Penn Chinese Treebank 70 (CTB7) [12] It consists of overonemillion words of annotated and parsed text fromChinesenewswire magazines various broadcast news and broadcastconversation programs web newsgroups and weblogs Theparallel data came from the UM corpus [17] which contains100000 pair of English-Chinese sentences The trainingand testing sets are defined for the experiment and theircorresponding statistic information is shown in Table 4 Inaddition Table 4 also shows the detailed information of dataused for bilingual graph construction including the parallelcorpus for the tag projection and the label propagation ofChinese trigram

53 Chunk Tagset and HMM States As described the uni-versal chunk tag set is employed in our experiment Thisset 119880 consists of the following six coarse-grained tags NP(noun phrase) PP (prepositional phrase) VP (verb phrase)ADVP (adverb phrase) ADJP (adjective phrase) and SBAR(subordinating conjunction) Although there might be somecontroversies about the exact definition of such a tag set these6 categories cover the most frequent chunk tags that exist inone form or another in both of English and Chinese corpora

For each kind of languages under consideration a map-ping from the fine-grained language specific chunk tags in theTreebank to the universal chunk tags is provided Hence themodel of unsupervised chunking is trained on the datasetslabeled with the universal chunk tags

The Scientific World Journal 9

Table 5 Chunk tagging evaluation results for various baselines andproposed graph-based model

Tag Feature-HMM Projection Graph-basedPre Rec F1 Pre Rec F1 Pre Rec F1

NP 060 067 063 067 072 068 081 082 081VP 056 051 053 061 057 059 078 074 076PP 036 028 032 044 031 036 060 051 055ADVP 040 046 043 045 052 048 064 068 066ADJP 047 053 050 048 058 051 067 071 069SBAR 000 000 000 050 10 066 050 10 066All 049 051 050 057 062 059 073 075 074

In this work the number of latent HMM states forChinese is fixed to be a constant which is the number of theuniversal chunk tags

54 VariousModels To comprehensively probe our approachwith a thorough analysis we evaluated two baselines inaddition to our graph-based algorithmWewere intentionallylenient with these two baselines

(i) Feature-HMM The first baseline is the vanilla feature-HMM proposed by [9] which apply L-BFGS to estimate themodel parameters and use a greedy many-to-1 mapping forevaluation

(ii) ProjectionThe direct projection serves as a second base-line It incorporates the bilingual information by projectingchunk tags from English side to the aligned Chinese textsSeveral rules were added to fix the unaligned problem Wetrained a supervised feature-HMMwith the same number ofsentences from the bitext as it is in the Treebank

(iii) Our Graph-Based Model The full model uses bothstages of label propagation (3) before extracting the constrainfeatures As a result the label distribution of all Chinese wordtypes was added in the constrain features

55 Experimental Setup Theexperimental platform is imple-mented based on two toolkits Junto [18] and Ark [1415] Junto is a Java-based label propagation toolkit for thebilingual graph construction Ark is a Java-based package forthe feature-HMM training and testing

In the feature-HMM training we need to set a fewhyperparameters to minimize the number of free parametersin the model 119862 = 10 was used as regularization constantin (10) and trained L-BFGS for 1000 iterations Severalthreshold values for 120591 to extract the vector 119905

119909were tried and it

was found that 02 works best It indicates that 120591 = 02 couldbe used for Chinese trigram types

56 Results Table 5 shows the complete set of results Asexpected the projection baseline is better than the feature-HMM for being able to benefit from the bilingual infor-mation It greatly improves upon the monolingual baselineby 12 on 119865

1-score Comparing among the unsupervised

approaches the feature-HMM achieves only 50 of 1198651-

score on the universal chunk tags Overall the graph-basedmodel outperforms these two baselines That is to say theimprovement of feature-HMM with the graph-based settingis statistically significantwith respect to othermodels Graph-based modal performs 24 better than the state-of-the-art feature-HMM and 12 better than the direct projectionmodel

6 Conclusion

This paper presents an unsupervised graph-based Chinesechunking by using label propagation for projecting chunkinformation across languages Our results suggest that it ispossible for unlabeled text to learn accurate chunk tags fromthe bitext the data which has parallel English text In thefeature we propose an alternative graph-based unsupervisedapproach on chunking for languages that are lacking ready-to-use resource

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors are grateful to the Science and TechnologyDevelopment Fund of Macau and the Research Committeeof the University of Macau (Grant nos MYRG076(Y1-L2)-FST13-WF andMYRG070(Y1-L2)-FST12-CS) for the fundingsupport for their research

References

[1] R Koeling ldquoChunking with maximum entropy modelsrdquo inProceedings of the 2ndWorkshop on Learning Language in Logicand the 4th Conference on Computational Natural LanguageLerning vol 7 pp 139ndash141 2000

[2] M P Marcus M A Marcinkiewicz and B Santorini ldquoBuildinga large annotated corpus of English the Penn TreebankrdquoComputational Linguistics vol 19 no 2 pp 313ndash330 1993

[3] D Yarowsky and G Ngai ldquoInducing multilingual POS taggersandNPbracketers via robust projection across aligned corporardquoin Proceedings of the 2ndMeeting of the North American Chapterof the Association for Computational Linguistics (NAACL rsquo01)pp 200ndash207 2001

[4] X Zhu Z Ghahramani and J Lafferty ldquoSemi-supervisedlearning using Gaussian fields and harmonic functionsrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 912ndash919 August 2003

[5] Y Altun M Belkin and D A Mcallester ldquoMaximum marginsemi-supervised learning for structured variablesrdquo in Advancesin Neural Information Processing Systems pp 33ndash40 2005

[6] A Subramanya S Petrov and F Pereira ldquoEfficient graph-based semi-supervised learning of structured tagging modelsrdquoinProceedings of theConference onEmpiricalMethods inNaturalLanguage Processing pp 167ndash176 October 2010

10 The Scientific World Journal

[7] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

[8] A Subramanya and J Bilmes ldquoSemi-supervised learningwith measure propagationrdquo The Journal of Machine LearningResearch vol 12 pp 3311ndash3370 2011

[9] D Das and N A Smith ldquoSemi-supervised frame-semanticparsing for unknown predicatesrdquo in Proceedings of the 49thAnnualMeeting of the Association for Computational Linguisticspp 1435ndash1444 June 2011

[10] S Baluja R Seth D S Sivakumar et al ldquoVideo suggestion anddiscovery for you tube taking random walks through the viewgraphrdquo in Proceedings of the 17th International Conference onWorld Wide Web pp 895ndash904 April 2008

[11] S Petrov D Das and RMcDonald ldquoA universal part-of-speechtagsetrdquo 2011 httparxivorgabs11042086

[12] N Xue F Xia F-D Chiou and M Palmer ldquoThe Penn ChineseTreeBank phrase structure annotation of a large corpusrdquoNatural Language Engineering vol 11 no 2 pp 207ndash238 2005

[13] E F Tjong K Sang and H Dejean ldquoIntroduction to theCoNLL- shared task clause identificationrdquo in Proceedings of theWorkshop on Computational Natural Language Learning vol 7p 8 2001

[14] T Berg-Kirkpatrick A Bouchard-Cote J deNero andDKleinldquoPainless unsupervised learning with featuresrdquo in Proceedingsof the Human Language Technologies Conference of the NorthAmerican Chapter of the Association for Computational Linguis-tics pp 582ndash590 Los Angeles Calif USA June 2010

[15] A B Goldberg andX Zhu ldquoSeeing stars when there arenrsquotmanystars graph-based semi-supervised learning for sentiment cat-egorizationrdquo in Proceedings of the 1st Workshop on Graph BasedMethods for Natural Language Processing pp 45ndash52 2006

[16] DC Liu and JNocedal ldquoOn the limitedmemoryBFGSmethodfor large scale optimizationrdquoMathematical Programming B vol45 no 3 pp 503ndash528 1989

[17] L Tian F Wong and S Chao ldquoAn improvement of translationquality with adding key-words in parallel corpusrdquo in Proceed-ings of the International Conference on Machine Learning andCybernetics (ICMLC rsquo10) vol 3 pp 1273ndash1278 July 2010

[18] P P Talukdar and F Pereira ldquoExperiments in graph-based semi-supervised learning methods for class-instance acquisitionrdquo inProceedings of the 48th Annual Meeting of the Association forComputational Linguistics (ACL rsquo10) pp 1473ndash1481 July 2010

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 8: Research Article Unsupervised Chunking Based on Graph ...downloads.hindawi.com/journals/tswj/2014/401943.pdf · Unsupervised Chunking Based on Graph Propagation from Bilingual Corpus

8 The Scientific World Journal

Sentence中国NRNP-B建筑业NNNP-I对PPP-B外NNNP-B开放VVVP-B呈现VVVP-B新JJNP-B格局NNNP-I

Word建筑业Instance (建筑业 NP-I) (N NP-I) (N NP-I) (建)

(业) NN

Box 1 Example of feature template

5 Experiments and Results

Before presenting the results of our experiments the eval-uation metrics the datasets and the baselines used forcomparisons are firstly described

51 Evaluation Metrics The metrics for evaluating NPchunking model constitute precision rate recall rate andtheir harmonic mean 119865

1-score The tagging accuracy is

defined as

Tagging Accuracy

=The number of correct tagged words

The number of words

(11)

Precision measures the percentage of labeled NP chunksthat are correctThe number of correct tagged words includesboth the correct boundaries of NP chunk and the correctlabel The precision is therefore defined as

Precision

=The number of correct proposed tagged words

The number of correct chunk tags

(12)

Recall measures the percentage of NP chunks presentedin the input sentence that are correctly identified by thesystem Recall is depicted as below

Recall =The number of correct proposed tagged words

The number of current chunk tags

(13)

The 119865-measure illustrates a way to combine previous twomeasures into onemetricThe formula of 119865-sore is defined as

119865120573-score =

(1205732

+ 1) times Recall times Precision1205732 times Recall + Precision

(14)

where 120573 is a parameter that weights the importance of recalland precision when 120573 = 1 precision and recall are equallyimportant

52 Dataset To facilitate bilingual graph construction andunsupervised chunking experiment two kinds of data sets areutilized in this work (1) monolingual Treebank for Chinesechunk induction and (2) large amounts of parallel corpus

Table 4 (a) English-Chinese parallel corpus (b) Graph instance (c)Chinese unlabeled dataset CTB7 corpora

(a)

Number of sentence pairs Number of seeds Number of words10000 27940 31678

(b)

Number of sentences Number of vertices17617 185441

(c)

Dataset Source Number of sentencesTraining dataset Xinhua 1ndash321 7617Testing dataset Xinhua 363ndash403 912

of English and Chinese for bilingual graph construction andlabel propagation For themonolingual Treebank data we relyon Penn Chinese Treebank 70 (CTB7) [12] It consists of overonemillion words of annotated and parsed text fromChinesenewswire magazines various broadcast news and broadcastconversation programs web newsgroups and weblogs Theparallel data came from the UM corpus [17] which contains100000 pair of English-Chinese sentences The trainingand testing sets are defined for the experiment and theircorresponding statistic information is shown in Table 4 Inaddition Table 4 also shows the detailed information of dataused for bilingual graph construction including the parallelcorpus for the tag projection and the label propagation ofChinese trigram

53 Chunk Tagset and HMM States As described the uni-versal chunk tag set is employed in our experiment Thisset 119880 consists of the following six coarse-grained tags NP(noun phrase) PP (prepositional phrase) VP (verb phrase)ADVP (adverb phrase) ADJP (adjective phrase) and SBAR(subordinating conjunction) Although there might be somecontroversies about the exact definition of such a tag set these6 categories cover the most frequent chunk tags that exist inone form or another in both of English and Chinese corpora

For each kind of languages under consideration a map-ping from the fine-grained language specific chunk tags in theTreebank to the universal chunk tags is provided Hence themodel of unsupervised chunking is trained on the datasetslabeled with the universal chunk tags

The Scientific World Journal 9

Table 5 Chunk tagging evaluation results for various baselines andproposed graph-based model

Tag Feature-HMM Projection Graph-basedPre Rec F1 Pre Rec F1 Pre Rec F1

NP 060 067 063 067 072 068 081 082 081VP 056 051 053 061 057 059 078 074 076PP 036 028 032 044 031 036 060 051 055ADVP 040 046 043 045 052 048 064 068 066ADJP 047 053 050 048 058 051 067 071 069SBAR 000 000 000 050 10 066 050 10 066All 049 051 050 057 062 059 073 075 074

In this work the number of latent HMM states forChinese is fixed to be a constant which is the number of theuniversal chunk tags

54 VariousModels To comprehensively probe our approachwith a thorough analysis we evaluated two baselines inaddition to our graph-based algorithmWewere intentionallylenient with these two baselines

(i) Feature-HMM The first baseline is the vanilla feature-HMM proposed by [9] which apply L-BFGS to estimate themodel parameters and use a greedy many-to-1 mapping forevaluation

(ii) ProjectionThe direct projection serves as a second base-line It incorporates the bilingual information by projectingchunk tags from English side to the aligned Chinese textsSeveral rules were added to fix the unaligned problem Wetrained a supervised feature-HMMwith the same number ofsentences from the bitext as it is in the Treebank

(iii) Our Graph-Based Model The full model uses bothstages of label propagation (3) before extracting the constrainfeatures As a result the label distribution of all Chinese wordtypes was added in the constrain features

55 Experimental Setup Theexperimental platform is imple-mented based on two toolkits Junto [18] and Ark [1415] Junto is a Java-based label propagation toolkit for thebilingual graph construction Ark is a Java-based package forthe feature-HMM training and testing

In the feature-HMM training we need to set a fewhyperparameters to minimize the number of free parametersin the model 119862 = 10 was used as regularization constantin (10) and trained L-BFGS for 1000 iterations Severalthreshold values for 120591 to extract the vector 119905

119909were tried and it

was found that 02 works best It indicates that 120591 = 02 couldbe used for Chinese trigram types

56 Results Table 5 shows the complete set of results Asexpected the projection baseline is better than the feature-HMM for being able to benefit from the bilingual infor-mation It greatly improves upon the monolingual baselineby 12 on 119865

1-score Comparing among the unsupervised

approaches the feature-HMM achieves only 50 of 1198651-

score on the universal chunk tags Overall the graph-basedmodel outperforms these two baselines That is to say theimprovement of feature-HMM with the graph-based settingis statistically significantwith respect to othermodels Graph-based modal performs 24 better than the state-of-the-art feature-HMM and 12 better than the direct projectionmodel

6 Conclusion

This paper presents an unsupervised graph-based Chinesechunking by using label propagation for projecting chunkinformation across languages Our results suggest that it ispossible for unlabeled text to learn accurate chunk tags fromthe bitext the data which has parallel English text In thefeature we propose an alternative graph-based unsupervisedapproach on chunking for languages that are lacking ready-to-use resource

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors are grateful to the Science and TechnologyDevelopment Fund of Macau and the Research Committeeof the University of Macau (Grant nos MYRG076(Y1-L2)-FST13-WF andMYRG070(Y1-L2)-FST12-CS) for the fundingsupport for their research

References

[1] R Koeling ldquoChunking with maximum entropy modelsrdquo inProceedings of the 2ndWorkshop on Learning Language in Logicand the 4th Conference on Computational Natural LanguageLerning vol 7 pp 139ndash141 2000

[2] M P Marcus M A Marcinkiewicz and B Santorini ldquoBuildinga large annotated corpus of English the Penn TreebankrdquoComputational Linguistics vol 19 no 2 pp 313ndash330 1993

[3] D Yarowsky and G Ngai ldquoInducing multilingual POS taggersandNPbracketers via robust projection across aligned corporardquoin Proceedings of the 2ndMeeting of the North American Chapterof the Association for Computational Linguistics (NAACL rsquo01)pp 200ndash207 2001

[4] X Zhu Z Ghahramani and J Lafferty ldquoSemi-supervisedlearning using Gaussian fields and harmonic functionsrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 912ndash919 August 2003

[5] Y Altun M Belkin and D A Mcallester ldquoMaximum marginsemi-supervised learning for structured variablesrdquo in Advancesin Neural Information Processing Systems pp 33ndash40 2005

[6] A Subramanya S Petrov and F Pereira ldquoEfficient graph-based semi-supervised learning of structured tagging modelsrdquoinProceedings of theConference onEmpiricalMethods inNaturalLanguage Processing pp 167ndash176 October 2010

10 The Scientific World Journal

[7] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

[8] A Subramanya and J Bilmes ldquoSemi-supervised learningwith measure propagationrdquo The Journal of Machine LearningResearch vol 12 pp 3311ndash3370 2011

[9] D Das and N A Smith ldquoSemi-supervised frame-semanticparsing for unknown predicatesrdquo in Proceedings of the 49thAnnualMeeting of the Association for Computational Linguisticspp 1435ndash1444 June 2011

[10] S Baluja R Seth D S Sivakumar et al ldquoVideo suggestion anddiscovery for you tube taking random walks through the viewgraphrdquo in Proceedings of the 17th International Conference onWorld Wide Web pp 895ndash904 April 2008

[11] S Petrov D Das and RMcDonald ldquoA universal part-of-speechtagsetrdquo 2011 httparxivorgabs11042086

[12] N Xue F Xia F-D Chiou and M Palmer ldquoThe Penn ChineseTreeBank phrase structure annotation of a large corpusrdquoNatural Language Engineering vol 11 no 2 pp 207ndash238 2005

[13] E F Tjong K Sang and H Dejean ldquoIntroduction to theCoNLL- shared task clause identificationrdquo in Proceedings of theWorkshop on Computational Natural Language Learning vol 7p 8 2001

[14] T Berg-Kirkpatrick A Bouchard-Cote J deNero andDKleinldquoPainless unsupervised learning with featuresrdquo in Proceedingsof the Human Language Technologies Conference of the NorthAmerican Chapter of the Association for Computational Linguis-tics pp 582ndash590 Los Angeles Calif USA June 2010

[15] A B Goldberg andX Zhu ldquoSeeing stars when there arenrsquotmanystars graph-based semi-supervised learning for sentiment cat-egorizationrdquo in Proceedings of the 1st Workshop on Graph BasedMethods for Natural Language Processing pp 45ndash52 2006

[16] DC Liu and JNocedal ldquoOn the limitedmemoryBFGSmethodfor large scale optimizationrdquoMathematical Programming B vol45 no 3 pp 503ndash528 1989

[17] L Tian F Wong and S Chao ldquoAn improvement of translationquality with adding key-words in parallel corpusrdquo in Proceed-ings of the International Conference on Machine Learning andCybernetics (ICMLC rsquo10) vol 3 pp 1273ndash1278 July 2010

[18] P P Talukdar and F Pereira ldquoExperiments in graph-based semi-supervised learning methods for class-instance acquisitionrdquo inProceedings of the 48th Annual Meeting of the Association forComputational Linguistics (ACL rsquo10) pp 1473ndash1481 July 2010

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 9: Research Article Unsupervised Chunking Based on Graph ...downloads.hindawi.com/journals/tswj/2014/401943.pdf · Unsupervised Chunking Based on Graph Propagation from Bilingual Corpus

The Scientific World Journal 9

Table 5 Chunk tagging evaluation results for various baselines andproposed graph-based model

Tag Feature-HMM Projection Graph-basedPre Rec F1 Pre Rec F1 Pre Rec F1

NP 060 067 063 067 072 068 081 082 081VP 056 051 053 061 057 059 078 074 076PP 036 028 032 044 031 036 060 051 055ADVP 040 046 043 045 052 048 064 068 066ADJP 047 053 050 048 058 051 067 071 069SBAR 000 000 000 050 10 066 050 10 066All 049 051 050 057 062 059 073 075 074

In this work the number of latent HMM states forChinese is fixed to be a constant which is the number of theuniversal chunk tags

54 VariousModels To comprehensively probe our approachwith a thorough analysis we evaluated two baselines inaddition to our graph-based algorithmWewere intentionallylenient with these two baselines

(i) Feature-HMM The first baseline is the vanilla feature-HMM proposed by [9] which apply L-BFGS to estimate themodel parameters and use a greedy many-to-1 mapping forevaluation

(ii) ProjectionThe direct projection serves as a second base-line It incorporates the bilingual information by projectingchunk tags from English side to the aligned Chinese textsSeveral rules were added to fix the unaligned problem Wetrained a supervised feature-HMMwith the same number ofsentences from the bitext as it is in the Treebank

(iii) Our Graph-Based Model The full model uses bothstages of label propagation (3) before extracting the constrainfeatures As a result the label distribution of all Chinese wordtypes was added in the constrain features

55 Experimental Setup Theexperimental platform is imple-mented based on two toolkits Junto [18] and Ark [1415] Junto is a Java-based label propagation toolkit for thebilingual graph construction Ark is a Java-based package forthe feature-HMM training and testing

In the feature-HMM training we need to set a fewhyperparameters to minimize the number of free parametersin the model 119862 = 10 was used as regularization constantin (10) and trained L-BFGS for 1000 iterations Severalthreshold values for 120591 to extract the vector 119905

119909were tried and it

was found that 02 works best It indicates that 120591 = 02 couldbe used for Chinese trigram types

56 Results Table 5 shows the complete set of results Asexpected the projection baseline is better than the feature-HMM for being able to benefit from the bilingual infor-mation It greatly improves upon the monolingual baselineby 12 on 119865

1-score Comparing among the unsupervised

approaches the feature-HMM achieves only 50 of 1198651-

score on the universal chunk tags Overall the graph-basedmodel outperforms these two baselines That is to say theimprovement of feature-HMM with the graph-based settingis statistically significantwith respect to othermodels Graph-based modal performs 24 better than the state-of-the-art feature-HMM and 12 better than the direct projectionmodel

6 Conclusion

This paper presents an unsupervised graph-based Chinesechunking by using label propagation for projecting chunkinformation across languages Our results suggest that it ispossible for unlabeled text to learn accurate chunk tags fromthe bitext the data which has parallel English text In thefeature we propose an alternative graph-based unsupervisedapproach on chunking for languages that are lacking ready-to-use resource

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors are grateful to the Science and TechnologyDevelopment Fund of Macau and the Research Committeeof the University of Macau (Grant nos MYRG076(Y1-L2)-FST13-WF andMYRG070(Y1-L2)-FST12-CS) for the fundingsupport for their research

References

[1] R Koeling ldquoChunking with maximum entropy modelsrdquo inProceedings of the 2ndWorkshop on Learning Language in Logicand the 4th Conference on Computational Natural LanguageLerning vol 7 pp 139ndash141 2000

[2] M P Marcus M A Marcinkiewicz and B Santorini ldquoBuildinga large annotated corpus of English the Penn TreebankrdquoComputational Linguistics vol 19 no 2 pp 313ndash330 1993

[3] D Yarowsky and G Ngai ldquoInducing multilingual POS taggersandNPbracketers via robust projection across aligned corporardquoin Proceedings of the 2ndMeeting of the North American Chapterof the Association for Computational Linguistics (NAACL rsquo01)pp 200ndash207 2001

[4] X Zhu Z Ghahramani and J Lafferty ldquoSemi-supervisedlearning using Gaussian fields and harmonic functionsrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 912ndash919 August 2003

[5] Y Altun M Belkin and D A Mcallester ldquoMaximum marginsemi-supervised learning for structured variablesrdquo in Advancesin Neural Information Processing Systems pp 33ndash40 2005

[6] A Subramanya S Petrov and F Pereira ldquoEfficient graph-based semi-supervised learning of structured tagging modelsrdquoinProceedings of theConference onEmpiricalMethods inNaturalLanguage Processing pp 167ndash176 October 2010

10 The Scientific World Journal

[7] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

[8] A Subramanya and J Bilmes ldquoSemi-supervised learningwith measure propagationrdquo The Journal of Machine LearningResearch vol 12 pp 3311ndash3370 2011

[9] D Das and N A Smith ldquoSemi-supervised frame-semanticparsing for unknown predicatesrdquo in Proceedings of the 49thAnnualMeeting of the Association for Computational Linguisticspp 1435ndash1444 June 2011

[10] S Baluja R Seth D S Sivakumar et al ldquoVideo suggestion anddiscovery for you tube taking random walks through the viewgraphrdquo in Proceedings of the 17th International Conference onWorld Wide Web pp 895ndash904 April 2008

[11] S Petrov D Das and RMcDonald ldquoA universal part-of-speechtagsetrdquo 2011 httparxivorgabs11042086

[12] N Xue F Xia F-D Chiou and M Palmer ldquoThe Penn ChineseTreeBank phrase structure annotation of a large corpusrdquoNatural Language Engineering vol 11 no 2 pp 207ndash238 2005

[13] E F Tjong K Sang and H Dejean ldquoIntroduction to theCoNLL- shared task clause identificationrdquo in Proceedings of theWorkshop on Computational Natural Language Learning vol 7p 8 2001

[14] T Berg-Kirkpatrick A Bouchard-Cote J deNero andDKleinldquoPainless unsupervised learning with featuresrdquo in Proceedingsof the Human Language Technologies Conference of the NorthAmerican Chapter of the Association for Computational Linguis-tics pp 582ndash590 Los Angeles Calif USA June 2010

[15] A B Goldberg andX Zhu ldquoSeeing stars when there arenrsquotmanystars graph-based semi-supervised learning for sentiment cat-egorizationrdquo in Proceedings of the 1st Workshop on Graph BasedMethods for Natural Language Processing pp 45ndash52 2006

[16] DC Liu and JNocedal ldquoOn the limitedmemoryBFGSmethodfor large scale optimizationrdquoMathematical Programming B vol45 no 3 pp 503ndash528 1989

[17] L Tian F Wong and S Chao ldquoAn improvement of translationquality with adding key-words in parallel corpusrdquo in Proceed-ings of the International Conference on Machine Learning andCybernetics (ICMLC rsquo10) vol 3 pp 1273ndash1278 July 2010

[18] P P Talukdar and F Pereira ldquoExperiments in graph-based semi-supervised learning methods for class-instance acquisitionrdquo inProceedings of the 48th Annual Meeting of the Association forComputational Linguistics (ACL rsquo10) pp 1473ndash1481 July 2010

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 10: Research Article Unsupervised Chunking Based on Graph ...downloads.hindawi.com/journals/tswj/2014/401943.pdf · Unsupervised Chunking Based on Graph Propagation from Bilingual Corpus

10 The Scientific World Journal

[7] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

[8] A Subramanya and J Bilmes ldquoSemi-supervised learningwith measure propagationrdquo The Journal of Machine LearningResearch vol 12 pp 3311ndash3370 2011

[9] D Das and N A Smith ldquoSemi-supervised frame-semanticparsing for unknown predicatesrdquo in Proceedings of the 49thAnnualMeeting of the Association for Computational Linguisticspp 1435ndash1444 June 2011

[10] S Baluja R Seth D S Sivakumar et al ldquoVideo suggestion anddiscovery for you tube taking random walks through the viewgraphrdquo in Proceedings of the 17th International Conference onWorld Wide Web pp 895ndash904 April 2008

[11] S Petrov D Das and RMcDonald ldquoA universal part-of-speechtagsetrdquo 2011 httparxivorgabs11042086

[12] N Xue F Xia F-D Chiou and M Palmer ldquoThe Penn ChineseTreeBank phrase structure annotation of a large corpusrdquoNatural Language Engineering vol 11 no 2 pp 207ndash238 2005

[13] E F Tjong K Sang and H Dejean ldquoIntroduction to theCoNLL- shared task clause identificationrdquo in Proceedings of theWorkshop on Computational Natural Language Learning vol 7p 8 2001

[14] T Berg-Kirkpatrick A Bouchard-Cote J deNero andDKleinldquoPainless unsupervised learning with featuresrdquo in Proceedingsof the Human Language Technologies Conference of the NorthAmerican Chapter of the Association for Computational Linguis-tics pp 582ndash590 Los Angeles Calif USA June 2010

[15] A B Goldberg andX Zhu ldquoSeeing stars when there arenrsquotmanystars graph-based semi-supervised learning for sentiment cat-egorizationrdquo in Proceedings of the 1st Workshop on Graph BasedMethods for Natural Language Processing pp 45ndash52 2006

[16] DC Liu and JNocedal ldquoOn the limitedmemoryBFGSmethodfor large scale optimizationrdquoMathematical Programming B vol45 no 3 pp 503ndash528 1989

[17] L Tian F Wong and S Chao ldquoAn improvement of translationquality with adding key-words in parallel corpusrdquo in Proceed-ings of the International Conference on Machine Learning andCybernetics (ICMLC rsquo10) vol 3 pp 1273ndash1278 July 2010

[18] P P Talukdar and F Pereira ldquoExperiments in graph-based semi-supervised learning methods for class-instance acquisitionrdquo inProceedings of the 48th Annual Meeting of the Association forComputational Linguistics (ACL rsquo10) pp 1473ndash1481 July 2010

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 11: Research Article Unsupervised Chunking Based on Graph ...downloads.hindawi.com/journals/tswj/2014/401943.pdf · Unsupervised Chunking Based on Graph Propagation from Bilingual Corpus

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014