Neural Word Segmentation Learning for Chinese

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 409–420,Berlin, Germany, August 7-12, 2016. c©2016 Association for Computational Linguistics

Neural Word Segmentation Learning for Chinese

Deng Cai and Hai Zhao∗Department of Computer Science and Engineering

Key Lab of Shanghai Education Commissionfor Intelligent Interaction and Cognitive Engineering

Shanghai Jiao Tong University, Shanghai, [email protected], [email protected]

Abstract

Most previous approaches to Chineseword segmentation formalize this prob-lem as a character-based sequence label-ing task so that only contextual informa-tion within fixed sized local windows andsimple interactions between adjacent tagscan be captured. In this paper, we pro-pose a novel neural framework which thor-oughly eliminates context windows andcan utilize complete segmentation history.Our model employs a gated combinationneural network over characters to producedistributed representations of word candi-dates, which are then given to a long short-term memory (LSTM) language scoringmodel. Experiments on the benchmarkdatasets show that without the help offeature engineering as most existing ap-proaches, our models achieve competitiveor better performances with previous state-of-the-art methods.

1 Introduction

Most east Asian languages including Chinese arewritten without explicit word delimiters, therefore,word segmentation is a preliminary step for pro-cessing those languages. Since Xue (2003), mostmethods formalize the Chinese word segmentation(CWS) as a sequence labeling problem with char-acter position tags, which can be handled with su-

∗Corresponding author. This paper was partially sup-ported by Cai Yuanpei Program (CSC No. 201304490199and No. 201304490171), National Natural Science Founda-tion of China (No. 61170114 and No. 61272248), NationalBasic Research Program of China (No. 2013CB329401),Major Basic Research Program of Shanghai Science andTechnology Committee (No. 15JC1400103), Art and Sci-ence Interdisciplinary Funds of Shanghai Jiao Tong Univer-sity (No. 14JCRZ04), and Key Project of National SocietyScience Foundation of China (No. 15-ZDA041).

pervised learning methods such as Maximum En-tropy (Berger et al., 1996; Low et al., 2005) andConditional Random Fields (Lafferty et al., 2001;Peng et al., 2004; Zhao et al., 2006a). However,those methods heavily depend on the choice ofhandcrafted features.

Recently, neural models have been widely usedfor NLP tasks for their ability to minimize the ef-fort in feature engineering. For the task of CWS,Zheng et al. (2013) adapted the general neuralnetwork architecture for sequence labeling pro-posed in (Collobert et al., 2011), and used char-acter embeddings as input to a two-layer network.Pei et al. (2014) improved upon (Zheng et al.,2013) by explicitly modeling the interactions be-tween local context and previous tag. Chen et al.(2015a) proposed a gated recursive neural networkto model the feature combinations of context char-acters. Chen et al. (2015b) used an LSTM archi-tecture to capture potential long-distance depen-dencies, which alleviates the limitation of the sizeof context window but introduced another windowfor hidden states.

Despite the differences, all these models are de-signed to solve CWS by assigning labels to thecharacters in the sequence one by one. At eachtime step of inference, these models compute thetag scores of character based on (i) context fea-tures within a fixed sized local window and (ii)tagging history of previous one.

Nevertheless, the tag-tag transition is insuffi-cient to model the complicated influence fromprevious segmentation decisions, though it couldsometimes be a crucial clue to later segmentationdecisions. The fixed context window size, whichis broadly adopted by these methods for featureengineering, also restricts the flexibility of model-ing diverse distances. Moreover, word-level infor-mation, which is being the greater granularity unitas suggested in (Huang and Zhao, 2006), remains

409

Models Characters Words Tags

character based(Zheng et al., 2013), . . . ci−2, ci−1, ci, ci+1, ci+2 - ti−1ti

(Chen et al., 2015b) c0, c1, . . . , ci, ci+1, ci+2 - ti−1ti

word based(Zhang and Clark, 2007), . . . c in wj−1, wj , wj+1 wj−1, wj , wj+1 -

Ours c0, c1, . . . , ci w0, w1, . . . , wj -

Table 1: Feature windows of different models. i(j) indexes the current character(word) that is underscoring.

unemployed.To alleviate the drawbacks inside previous

methods and release those inconvenient constrainssuch as the fixed sized context window, this pa-per makes a latest attempt to re-formalize CWSas a direct segmentation learning task. Ourmethod does not make tagging decisions on in-dividual characters, but directly evaluates the rel-ative likelihood of different segmented sentencesand then search for a segmentation with the high-est score. To feature a segmented sentence, aseries of distributed vector representations (Ben-gio et al., 2003) are generated to characterize thecorresponding word candidates. Such a repre-sentation setting makes the decoding quite dif-ferent from previous methods and indeed muchmore challenging, however, more discriminativefeatures can be captured.

Though the vector building is word centered,our proposed scoring model covers all three pro-cessing levels from character, word until sen-tence. First, the distributed representation startsfrom character embedding, as in the context ofword segmentation, the n-gram data sparsity is-sue makes it impractical to use word vectors im-mediately. Second, as the word candidate rep-resentation is derived from its characters, the in-side character structure will also be encoded, thusit can be used to determine the word likelihoodof its own. Third, to evaluate how a segmentedsentence makes sense through word interacting,an LSTM (Hochreiter and Schmidhuber, 1997) isused to chain together word candidates incremen-tally and construct the representation of partiallysegmented sentence at each decoding step, so thatthe coherence between next word candidate andprevious segmentation history can be depicted.

To our best knowledge, our proposed approachto CWS is the first attempt which explicitly mod-els the entire contents of the segmenter’s state,including the complete history of both segmenta-tion decisions and input characters. The compar-

Neural Network Scoring Model

Decoder

···

··· ···

··· ···

Max-Margin Training

自然/语言/处理

自 +1.5 自然语 -1.5

然语 -1.5 然语言 +0.7

言处理 -2.1 处理 +1.5

处理 +1.5

+1.2 +0.8

+2.0+2.3 +3.2

+0.3 +1.2

自然语言处理(input sequence) 自/然语言/处理

(output sentence)

(golden sentence)

Figure 1: Our framework.

isons of feature windows used in different mod-els are shown in Table 1. Compared to both se-quence labeling schemes and word-based modelsin the past, our model thoroughly eliminates con-text windows and can capture the complete historyof segmentation decisions, which offers more pos-sibilities to effectively and accurately model seg-mentation context.

2 Overview

We formulate the CWS problem as finding a map-ping from an input character sequence x to a wordsequence y, and the output sentence y∗ satisfies:

y∗ = arg maxy∈GEN(x)

(n∑i=1

score(yi|y1, · · · , yi−1))

where n is the number of word candidates in y,and GEN(x) denotes the set of possible segmenta-tions for an input sequence x. Unlike all previousworks, our scoring function is sensitive to the com-plete contents of partially segmented sentence.

As shown in Figure 1, to solve CWS in thisway, a neural network scoring model is designedto evaluate the likelihood of a segmented sentence.Based on the proposed model, a decoder is de-veloped to find the segmented sentence with thehighest score. Meanwhile, a max-margin methodis utilized to perform the training by comparing

410

segmented sentence

Lookup Table

GCNN Unit

LSTM Unit

Predicting

Scoring

c1 c2 c3 c4 c5 c6 c7 c8

y1 y2 y3 y4

p1 p2 p3 p4

u

Figure 2: Architecture of our proposed neural network scoring model, where ci denotes the i-th inputcharacter, yj denotes the learned representation of the j-th word candidate, pk denotes the predictionfor the (k + 1)-th word candidate and u is the trainable parameter vector for scoring the likelihood ofindividual word candidates.

the structured difference of decoder output and thegolden segmentation. The following sections willintroduce each of these components in detail.

3 Neural Network Scoring Model

The score for a segmented sentence is computedby first mapping it into a sequence of word candi-date vectors, then the scoring model takes the vec-tor sequence as input, scoring on each word can-didate from two perspectives: (1) how likely theword candidate itself can be recognized as a legalword; (2) how reasonable the link is for the wordcandidate to follow previous segmentation historyimmediately. After that, the word candidate is ap-pended to the segmentation history, updating thestate of the scoring system for subsequent judge-ments. Figure 2 illustrates the entire scoring neu-ral network.

3.1 Word Score

Character Embedding. While the scores aredecided at the word-level, using word embedding(Bengio et al., 2003; Wang et al., 2016) imme-diately will lead to a remarkable issue that rarewords and out-of-vocabulary words will be poorlyestimated (Kim et al., 2015). In addition, thecharacter-level information inside an n-gram canbe helpful to judge whether it is a true word.Therefore, a lookup table of character embeddingsis used as the bottom layer.

Formally, we have a character dictionary D of

size |D |. Then each character c ∈ D is repre-sented as a real-valued vector (character embed-ding) c ∈ Rd, where d is the dimensionality of thevector space. The character embeddings are thenstacked into an embedding matrix M ∈ Rd×|D|.For a character c ∈ D , its character embeddingc ∈ Rd is retrieved by the embedding layer ac-cording to its index.

Gated Combination Neural Network. In orderto obtain word representation through its charac-ters, in the simplest strategy, character vectors areintegrated into their word representation using aweight matrix W(L) that is shared across all wordswith the same length L, followed by a non-linearfunction g(·). Specifically, ci (1 ≤ i ≤ L) ared-dimensional character vector representations re-spectively, the corresponding word vector w willbe d-dimensional as well:

w = g(W(L)

c1...

cL

) (1)

where W(L) ∈ Rd×Ld and g is a non-linear func-tion as mentioned above.

Although the mechanism above seems to workwell, it can not sufficiently model the complicatedcombination features in practice, yet.

Gated structure in neural network can be usefulfor hybrid feature extraction according to (Chen etal., 2015a; Chung et al., 2014; Cho et al., 2014),

411

c1

cL

w w

r1

rL

zN

z1

zL

Figure 3: Gated combination neural network.

we therefore propose a gated combination neu-ral network (GCNN) especially for character com-positionality which contains two types of gates,namely reset gate and update gate. Intuitively, thereset gates decide which part of the character vec-tors should be mixed while the update gates decidewhat to preserve when combining the charactersinformation. Concretely, for words with length L,the word vector w ∈ Rd is computed as follows:

w = zN w +L∑i=1

zi ci

where zN , zi (1 ≤ i ≤ L) are update gates for newactivation w and governed characters respectively,and indicates element-wise multiplication.

The new activation w is computed as:

w = tanh(W(L)

r1 c1...

rL cL

)

where W(L) ∈ Rd×Ld and ri ∈ Rd (1 ≤ i ≤ L)are the reset gates for governed characters respec-tively, which can be formalized as: r1

...rL

= σ(R(L)

c1...

cL

)

where R(L) ∈ RLd×Ld is the coefficient matrix ofreset gates and σ denotes the sigmoid function.

The update gates can be formalized as:zNz1...

zL

= exp(U(L)

wc1...

cL

)

1/Z1/Z

...1/Z

where U(L) ∈ R(L+1)d×(L+1)d is the coefficientmatrix of update gates, and Z ∈ Rd is the normal-

ization vector,

Zk =L∑i=1

[exp(U(L)

wc1...

cL

)]d×i+k

where 0 ≤ k < d.According to the normalization condition, the

update gates are constrained by:

zN +L∑i=1

zi = 1

The gated mechanism is capable of capturingboth character and character interaction character-istics to give an efficient word representation (SeeSection 6.3).

Word Score. Denote the learned vector rep-resentations for a segmented sentence y with[y1,y2, · · · ,yn], where n is the number of wordcandidates in the sentence. word score will becomputed by the dot products of vector yi(1 ≤i ≤ n) and a trainable parameter vector u ∈ Rd.

Word Score(yi) = u · yi (2)

It indicates how likely a word candidate by itselfis to be a true word.

3.2 Link Score

Inspired by the recurrent neural network languagemodel (RNN-LM) (Mikolov et al., 2010; Sunder-meyer et al., 2012), we utilize an LSTM system tocapture the coherence in a segmented sentence.

Long Short-Term Memory Networks. TheLSTM neural network (Hochreiter and Schmid-huber, 1997) is an extension of the recurrent neu-ral network (RNN), which is an effective tool forsequence modeling tasks using its hidden statesfor history information preservation. At each timestep t, an RNN takes the input xt and updates itsrecurrent hidden state ht by

ht = g(Uht−1 + Wxt + b)

where g is a non-linear function.Although RNN is capable, in principle, to pro-

cess arbitrary-length sequences, it can be difficultto train an RNN to learn long-range dependenciesdue to the vanishing gradients. LSTM addresses

412

yt−1 pt yt pt+1 yt+1 pt+2

ht−1 ht ht+1

Figure 4: Link scores (dashed lines).

this problem by introducing a memory cell to pre-serve states over long periods of time, and con-trols the update of hidden state and memory cellby three types of gates, namely input gate, for-get gate and output gate. Concretely, each stepof LSTM takes input xt,ht−1, ct−1 and producesht, ct via the following calculations:

it = σ(Wixt + Uiht−1 + bi)

ft = σ(Wfxt + Ufht−1 + bf )ot = σ(Woxt + Uoht−1 + bo)ct = tanh(Wcxt + Ucht−1 + bc)ct = ft ct−1 + it ctht = ot tanh(ct)

where σ, are respectively the element-wise sig-moid function and multiplication, it, ft,ot, ct arerespectively the input gate, forget gate, output gateand memory cell activation vector at time t, all ofwhich have the same size as hidden state vectorht ∈ RH .

Link Score. LSTMs have been shown to outper-form RNNs on many NLP tasks, notably languagemodeling (Sundermeyer et al., 2012).

In our model, LSTM is utilized to chain to-gether word candidates in a left-to-right, incre-mental manner. At time step t, a prediction pt+1 ∈Rd about next word yt+1 is made based on the hid-den state ht:

pt+1 = tanh(Wpht + bp)

link score for next word yt+1 is then computed as:

Link Score(yt+1) = pt+1 · yt+1 (3)

Due to the structure of LSTM, the predictionvector pt+1 carries useful information detectedfrom the entire segmentation history, includingprevious segmentation decisions. In this way, ourmodel gains the ability of sequence-level discrim-ination rather than local optimization.

3.3 Sentence scoreSentence score for a segmented sentence y with nword candidates is computed by summing up wordscores (2) and link scores (3) as follow:

s(y[1:n], θ) =n∑t=1

(u · yt + pt · yt) (4)

where θ is the parameter set used in our model.

4 Decoding

The total number of possible segmented sentencesgrows exponentially with the length of charactersequence, which makes it impractical to computethe scores of every possible segmentation. In orderto get exact inference, most sequence-labeling sys-tems address this problem with a Viterbi search,which takes the advantage of their hypothesisthat the tag interactions only exist within adjacentcharacters (Markov assumption). However, sinceour model is intended to capture complete his-tory of segmentation decisions, such dynamic pro-gramming algorithms can not be adopted in thissituation.

Algorithm 1 Beam Search.Input: model parameters θ

beam size kmaximum word length winput character sequence c[1 : n]

Output: Approx. k best segmentations1: π[0]← (score = 0,h = h0, c = c0)2: for i = 1 to n do3: . Generate Candidate Word Vectors4: X ← ∅5: for j = max(1, i− w) to i do6: w = GCNN-Procedure(c[j : i])7: X.add((index = j − 1, word = w))8: end for9: . Join Segmentation

10: Y ← y.append(x) | y ∈ π[x.index] andx ∈ X

11: . Filter k-Max12: π[i]← k- arg max

y∈Yy.score

13: end for14: return π[n]

To make our model efficient in practical use, wepropose a beam-search algorithm with dynamicprogramming motivations as shown in Algorithm1. The main idea is that any segmentation of the

413

first i characters can be separated as two parts, thefirst part consists of characters with indexes from0 to j that is denoted as y, the rest part is the wordcomposed by c[j+1 : i]. The influence from previ-ous segmentation y can be represented as a triple(y.score, y.h, y.c), where y.score, y.h, y.c in-dicate the current score, current hidden state vec-tor and current memory cell vector respectively.Beam search ensures that the total time for seg-menting a sentence of n characters is w × k × n,where w, k are maximum word length and beamsize respectively.

5 Training

We use the max-margin criterion (Taskar et al.,2005) to train our model. As reported in (Kum-merfeld et al., 2015), the margin methods gen-erally outperform both likelihood and perceptionmethods. For a given character sequence x(i), de-note the correct segmented sentence for x(i) asy(i). We define a structured margin loss ∆(y(i), y)for predicting a segmented sentence y:

∆(y(i), y) =m∑t=1

µ1y(i),t 6= yt

wherem is the length of sequence x(i) and µ is thediscount parameter. The calculation of margin losscould be regarded as to count the number of in-correctly segmented characters and then multipleit with a fixed discount parameter for smoothing.Therefore, the loss is proportional to the numberof incorrectly segmented characters.

Given a set of training setΩ, the regularized ob-jective function is the loss function J(θ) includingan `2 norm term:

J(θ) =1|Ω|

∑(x(i),y(i))∈Ω

li(θ) +λ

2||θ||22

li(θ) = maxy∈GEN(x(i))

(s(y, θ) + ∆(y(i), y)− s(y(i), θ))

where the function s(·) is the sentence score de-fined in equation (4).

Due to the hinge loss, the objective function isnot differentiable, we use a subgradient method(Ratliff et al., 2007) which computes a gradient-like direction. Following (Socher et al., 2013), weuse the diagonal variant of AdaGrad (Duchi et al.,2011) with minibatchs to minimize the objective.

Character embedding size d = 50Hidden unit number H = 50Initial learning rate α = 0.2Margin loss discount µ = 0.2Regularization λ = 10−6

Dropout rate on input layer p = 0.2Maximum word length w = 4

Table 2: Hyper-parameter settings.

The update for the i-th parameter at time step t isas follows:

θt,i = θt−1,i − α√∑tτ=1 g

2τ,i

gt,i

where α is the initial learning rate and gτ,i ∈ R|θi|

is the subgradient at time step τ for parameter θi.

6 Experiments

6.1 DatasetsTo evaluate the proposed segmenter, we use twopopular datasets, PKU and MSR, from the secondInternational Chinese Word Segmentation Bakeoff(Emerson, 2005). These datasets are commonlyused by previous state-of-the-art models and neu-ral network models.

Both datasets are preprocessed by replacing thecontinuous English characters and digits with aunique token. All experiments are conductedwith standard Bakeoff scoring program1 calculat-ing precision, recall, and F1-score.

6.2 Hyper-parametersHyper-parameters of neural network model signif-icantly impact on its performance. To determinea set of suitable hyper-parameters, we divide thetraining data into two sets, the first 90% sentencesas training set and the rest 10% sentences as de-velopment set. We choose the hyper-parametersas shown in Table 2.

We found that the character embedding size hasa limited impact on the performance as long as itis large enough. The size 50 is chosen as a goodtrade-off between speed and performance. Thenumber of hidden units is set to be the same asthe character embedding. Maximum word lengthdetermines the number of parameters in GCNNpart and the time consuming of beam search, sincethe words with a length l > 4 are relatively rare,

1http://www.sighan.org/bakeoff2003/score

414

92

93

94

95

96

0 10 20 30 40

beam size=2beam size=4beam size=8

epochs

F1-

scor

e(%

)

Figure 5: Performances of different beam sizes onPKU dataset.

92

93

94

95

96

0 10 20 30 40

only word scoreonly link scoreboth

epochs

F1-

scor

e(%

)

Figure 6: Performances of different score strate-gies on PKU dataset.

0.29% in PKU training data and 1.25% in MSRtraining data, we set the maximum word length to4 in our experiments.2

Dropout is a popular technique for improvingthe performance of neural networks by reducingoverfitting (Srivastava et al., 2014). We also dropthe input layer of our model with dropout rate 20%to avoid overfitting.

6.3 Model Analysis

Beam Size. We first investigated the impact ofbeam size over segmentation performance. Fig-ure 5 shows that a segmenter with beam size 4 isenough to get the best performance, which makesour model find a good balance between accuracyand efficiency.

GCNN. We then studied the role of GCNN inour model. To reveal the impact of GCNN, were-implemented a simplified version of our model,

2This 4-character limitation is just for consistence for bothdatasets. We are aware that it is a too strict setting, especiallywhich makes additional performance loss in a dataset withlarger average word length, i.e., MSR.

models P R FSingle layer (d = 50) 94.3 93.7 94.0GCNN (d = 50) 95.8 95.2 95.5Single layer (d = 100) 94.9 94.4 94.7

Table 3: Performances of different models onPKU dataset.

PKU MSR+Dictionary ours theirs ours theirs

(Chen et al., 2015a) 94.9 95.9 95.8 96.2(Chen et al., 2015b) 94.6 95.7 95.7 96.4

This work 95.7 - 96.4 -

Table 4: Comparison of using different Chineseidiom dictionaries.3

which replaces the GCNN part with a single non-linear layer as in equation (1). The results arelisted in Table 3, which demonstrate that the per-formance is significantly boosted by exploiting theGCNN architecture (94.0% to 95.5% on F1-score),while the best performance that the simplified ver-sion can achieve is 94.7%, but using a much largercharacter embedding size.

Link Score & Word Score. We conducted sev-eral experiments to investigate the individual ef-fect of link score and word score, since thesetwo types of scores are intended to estimate thesentence likelihood from two different perspec-tives: the semantic coherence between words andthe existence of individual words. The learningcurves of models with different scoring strategiesare shown in Figure 6.

The model with only word score can be re-garded as the situation that the segmentation de-cisions are made only based on local window in-formation. The comparisons show that such amodel gives moderate performance. By contrast,the model with only link score gives a much bet-ter performance close to the joint model, whichdemonstrates that the complete segmentation his-tory, which can not be effectively modeled in pre-vious schemes, possesses huge appliance value forword segmentation.

6.4 Results

3The dictionary used in (Chen et al., 2015a; Chen et al.,2015b) is neither publicly released nor specified the exactsource until now. We have to re-run their code using our se-lected dictionary to make a fair comparison. Our dictionaryhas been submitted along with this submission.

415

ModelsPKU MSR

P R F P R F(Zheng et al., 2013) 92.8 92.0 92.4 92.9 93.6 93.3

(Pei et al., 2014) 93.7 93.4 93.5 94.6 94.2 94.4(Chen et al., 2015a)* 94.6 94.2 94.4 94.6 95.6 95.1(Chen et al., 2015b) * 94.6 94.0 94.3 94.5 95.5 95.0

This work 95.5 94.9 95.2 96.1 96.7 96.4+Pre-trained character embedding

(Zheng et al., 2013) 93.5 92.2 92.8 94.2 93.7 93.9(Pei et al., 2014) 94.4 93.6 94.0 95.2 94.6 94.9

(Chen et al., 2015a)* 94.8 94.1 94.5 94.9 95.9 95.4(Chen et al., 2015b)* 95.1 94.4 94.8 95.1 96.2 95.6

This work 95.8 95.2 95.5 96.3 96.8 96.5

Table 5: Comparison with previous neural network models. Results with * are from our runs on theirreleased implementations.5

Models PKU MSR PKU MSR(Tseng et al., 2005) 95.0 96.4 - -

(Zhang and Clark, 2007) 94.5 97.2 - -(Zhao and Kit, 2008b) 95.4 97.6 - -

(Sun et al., 2009) 95.2 97.3 - -(Sun et al., 2012) 95.4 97.4 - -

(Zhang et al., 2013) - - 96.1* 97.4*(Chen et al., 2015a) 94.5 95.4 96.4* 97.6*(Chen et al., 2015b) 94.8 95.6 96.5* 97.4*

This work 95.5 96.5 - -

Table 6: Comparison with previous state-of-the-art models. Results with * used external dictionary orcorpus.

We first compare our model with the latest neuralnetwork methods as shown in Table 4. However,(Chen et al., 2015a; Chen et al., 2015b) used anextra preprocess to filter out Chinese idioms ac-cording to an external dictionary.4 Table 4 liststhe results (F1-scores) with different dictionaries,which show that our models perform better whenunder the same settings.

Table 5 gives comparisons among previous neu-ral network models. In the first block of Table 5,the character embedding matrix M is randomlyinitialized. The results show that our proposednovel model outperforms previous neural network

4In detail, when a dictionary is used, a preprocess is per-formed before training and test, which scans original text tofind out Chinese idioms included in the dictionary and replacethem with a unique token. This treatment does not strictly fol-low the convention of closed-set setting defined by SIGHANBakeoff, as no linguistic resources, either dictionary or cor-pus, other than the training corpus, should be adopted.

5To make comparisons fair, we re-run their code(https://github.com/dalstonChen) without their unspecifiedChinese idiom dictionary.

methods.Previous works have found that the perfor-

mance can be improved by pre-training the char-acter embeddings on large unlabeled data. There-fore, we use word2vec (Mikolov et al., 2013)toolkit6 to pre-train the character embeddings onthe Chinese Wikipedia corpus and use them forinitialization. Table 5 also shows the resultswith additional pre-trained character embeddings.Again, our model achieves better performancethan previous neural network models do.

Table 6 compares our models with previousstate-of-the-art systems. Recent systems such as(Zhang et al., 2013), (Chen et al., 2015b) and(Chen et al., 2015a) rely on both extensive featureengineering and external corpora to boost perfor-mance. Such systems are not directly compara-ble with our models. In the closed-set setting, ourmodels can achieve state-of-the-art performance

6http://code.google.com/p/word2vec/

416

Max. word length F1 score Time (Days)4 96.5 45 96.7 56 96.8 6

Table 7: Results on MSR dataset with differentmaximum decoding word length settings.

on PKU dataset but a competitive result on MSRdataset, which can attribute to too strict maximumword length setting for consistence as it is wellknown that MSR corpus has a much longer aver-age word length (Zhao et al., 2010).

Table 7 demonstrates the results on MSR corpuswith different maximum decoding word lengths,in which both F1 scores and training time aregiven. The results show that the segmentationperformance can indeed further be improved byallowing longer words during decoding, thoughlonger training time are also needed. As 6-character words are allowed, F1 score on MSR canbe furthermore improved 0.3%.

For the running cost, we roughly report the cur-rent computation consuming on PKU dataset.7 Ittakes about two days to finish 50 training epochs(for results in Figure 6 and the last row of Ta-ble 6) only with two cores of an Intel i7-5960XCPU. The requirement for RAM during training isless than 800MB. The trained model can be savedwithin 4MB on the hard disk.

7 Related Work

Neural Network Models. Most modern CWSmethods followed (Xue, 2003) treated CWS as asequence labeling problems (Zhao et al., 2006b).Recently, researchers have tended to explore neu-ral network based approaches (Collobert et al.,2011) to reduce efforts of feature engineering(Zheng et al., 2013; Qi et al., 2014; Chen et al.,2015a; Chen et al., 2015b). They modeled CWSas tagging problem as well, scoring tags on indi-vidual characters. In those models, tag scores aredecided by context information within local win-dows and the sentence-level score is obtained viacontext-independently tag transitions. Pei et al.(2014) introduced the tag embedding as input tocapture the combinations of context and tag his-tory. However, in previous works, only the tag ofprevious one character was taken into considera-tion though theoretically the complete history of

7Our code is released at https://github.com/jcyk/CWS.

actions taken by the segmenter should be consid-ered.

Alternatives to Sequence Labeling. Besidessequence labeling schemes, Zhang and Clark(2007) proposed a word-based perceptron method.Zhang et al. (2012) used a linear-time incrementalmodel which can also benefits from various kindsof features including word-based features. Butboth of them rely heavily on massive handcraftedfeatures. Contemporary to this work, some neuralmodels (Zhang et al., 2016a; Liu et al., 2016) alsoleverage word-level information. Specifically, Liuet al. (2016) use a semi-CRF taking segment-levelembeddings as input and Zhang et al. (2016a) usea transition-based framework.

Another notable exception is (Ma and Hinrichs,2015), which is also an embedding-based model,but models CWS as configuration-action match-ing. However, again, this method only uses thecontext information within limited sized windows.

Other Techniques. The proposed model mightfurthermore benefit from some techniques inrecent state-of-the-art systems, such as semi-supervised learning (Zhao and Kit, 2008b; Zhaoand Kit, 2008a; Sun and Xu, 2011; Zhao and Kit,2011; Zeng et al., 2013; Zhang et al., 2013), incor-porating global information (Zhao and Kit, 2007;Zhang et al., 2016b), and joint models (Qian andLiu, 2012; Li and Zhou, 2012).

8 Conclusion

This paper presents a novel neural framework forthe task of Chinese word segmentation, whichcontains three main components: (1) a factory toproduce word representation when given its gov-erned characters; (2) a sentence-level likelihoodevaluation system for segmented sentence; (3) anefficient and effective algorithm to find the bestsegmentation.

The proposed framework makes a latest attemptto formalize word segmentation as a direct struc-tured learning procedure in terms of the recent dis-tributed representation framework.

Though our system outputs results that are bet-ter than the latest neural network segmenters butcomparable to all previous state-of-the-art sys-tems, the framework remains a great of potentialthat can be further investigated and improved inthe future.

417

ReferencesYoshua Bengio, Rejean Ducharme, Pascal Vincent, and

Christian Janvin. 2003. A neural probabilistic lan-guage model. The Journal of Machine Learning Re-search, 3:1137–1155.

Adam L Berger, Vincent J Della Pietra, and StephenA Della Pietra. 1996. A maximum entropy ap-proach to natural language processing. Computa-tional linguistics, 22(1):39–71.

Xinchi Chen, Xipeng Qiu, Chenxi Zhu, and XuanjingHuang. 2015a. Gated recursive neural network forchinese word segmentation. In Proceedings of the53rd Annual Meeting of the Association for Compu-tational Linguistics and the 7th International JointConference on Natural Language Processing, pages1744–1753.

Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu,and Xuanjing Huang. 2015b. Long short-termmemory neural networks for chinese word segmen-tation. In Proceedings of the 2015 Conference onEmpirical Methods in Natural Language Process-ing, pages 1197–1206.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder–decoderfor statistical machine translation. In Proceedings ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing, pages 1724–1734.

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,and Yoshua Bengio. 2014. Empirical evaluation ofgated recurrent neural networks on sequence model-ing. arXiv preprint arXiv:1412.3555.

Ronan Collobert, Jason Weston, Leon Bottou, MichaelKarlen, Koray Kavukcuoglu, and Pavel Kuksa.2011. Natural language processing (almost) fromscratch. The Journal of Machine Learning Re-search, 12:2493–2537.

John Duchi, Elad Hazan, and Yoram Singer. 2011.Adaptive subgradient methods for online learningand stochastic optimization. The Journal of Ma-chine Learning Research, 12:2121–2159.

Thomas Emerson. 2005. The second international chi-nese word segmentation bakeoff. In Proceedings ofthe fourth SIGHAN workshop on Chinese languageProcessing, volume 133.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Chang-Ning Huang and Hai Zhao. 2006. Which isessential for chinese word segmentation: Characterversus word. In The 20th Pacific Asia Conferenceon Language, Information and Computation, pages1–12.

Yoon Kim, Yacine Jernite, David Sontag, and Alexan-der M Rush. 2015. Character-aware neural lan-guage models. arXiv preprint arXiv:1508.06615.

Jonathan K. Kummerfeld, Taylor Berg-Kirkpatrick,and Dan Klein. 2015. An empirical analysis of opti-mization for max-margin nlp. In Proceedings of the2015 Conference on Empirical Methods in NaturalLanguage Processing, pages 273–279.

John Lafferty, Andrew McCallum, and Fernando CNPereira. 2001. Conditional random fields: Prob-abilistic models for segmenting and labeling se-quence data. In Proceedings of the Eighteenth In-terntional Conference on Machine Learning.

Zhongguo Li and Guodong Zhou. 2012. Unified de-pendency parsing of chinese morphological and syn-tactic structures. In Proceedings of the 2012 JointConference on Empirical Methods in Natural Lan-guage Processing and Computational Natural Lan-guage Learning, pages 1445–1454.

Yijia Liu, Wanxiang Che, Jiang Guo, Bing Qin, andTing Liu. 2016. Exploring segment representationsfor neural segmentation models. arXiv preprintarXiv:1604.05499.

Jin Kiat Low, Hwee Tou Ng, and Wenyuan Guo. 2005.A maximum entropy approach to chinese word seg-mentation. In Proceedings of the Fourth SIGHANWorkshop on Chinese Language Processing, volume1612164, pages 448–455.

Jianqiang Ma and Erhard Hinrichs. 2015. Accuratelinear-time chinese word segmentation via embed-ding matching. In Proceedings of the 53rd AnnualMeeting of the Association for Computational Lin-guistics and the 7th International Joint Conferenceon Natural Language Processing, pages 1733–1743.

Tomas Mikolov, Martin Karafiat, Lukas Burget, JanCernocky, and Sanjeev Khudanpur. 2010. Re-current neural network based language model. In11th Annual Conference of the International SpeechCommunication Association, pages 1045–1048.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781.

Wenzhe Pei, Tao Ge, and Baobao Chang. 2014. Max-margin tensor neural network for chinese word seg-mentation. In Proceedings of the 52nd Annual Meet-ing of the Association for Computational Linguis-tics, pages 293–303.

Fuchun Peng, Fangfang Feng, and Andrew McCallum.2004. Chinese segmentation and new word detec-tion using conditional random fields. In Proceed-ings of the 20th international conference on Compu-tational Linguistics, page 562.

418

Yanjun Qi, Sujatha G Das, Ronan Collobert, and JasonWeston. 2014. Deep learning for character-basedinformation extraction. In Advances in InformationRetrieval, pages 668–674.

Xian Qian and Yang Liu. 2012. Joint chinese wordsegmentation, pos tagging and parsing. In Pro-ceedings of the 2012 Joint Conference on EmpiricalMethods in Natural Language Processing and Com-putational Natural Language Learning, pages 501–511.

Nathan D Ratliff, J Andrew Bagnell, and Martin Zinke-vich. 2007. (approximate) subgradient methods forstructured prediction. In International Conferenceon Artificial Intelligence and Statistics, pages 380–387.

Richard Socher, John Bauer, Christopher D. Manning,and Ng Andrew Y. 2013. Parsing with composi-tional vector grammars. In Proceedings of the 51stAnnual Meeting of the Association for Computa-tional Linguistics, pages 455–465.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: A simple way to prevent neural networksfrom overfitting. The Journal of Machine LearningResearch, 15(1):1929–1958.

Weiwei Sun and Jia Xu. 2011. Enhancing chineseword segmentation using unlabeled data. In Pro-ceedings of the Conference on Empirical Methodsin Natural Language Processing, pages 970–979.

Xu Sun, Yaozhong Zhang, Takuya Matsuzaki, Yoshi-masa Tsuruoka, and Jun’ichi Tsujii. 2009. A dis-criminative latent variable chinese segmenter withhybrid word/character information. In Proceedingsof Human Language Technologies: The 2009 An-nual Conference of the North American Chapterof the Association for Computational Linguistics,pages 56–64.

Xu Sun, Houfeng Wang, and Wenjie Li. 2012. Fast on-line training with frequency-adaptive learning ratesfor chinese word segmentation and new word de-tection. In Proceedings of the 50th Annual Meet-ing of the Association for Computational Linguis-tics, pages 253–262.

Martin Sundermeyer, Ralf Schluter, and Hermann Ney.2012. Lstm neural networks for language model-ing. In 13th Annual Conference of the InternationalSpeech Communication Association.

Ben Taskar, Vassil Chatalbashev, Daphne Koller, andCarlos Guestrin. 2005. Learning structured predic-tion models: A large margin approach. In Proceed-ings of the 22nd international conference on Ma-chine learning, pages 896–903.

Huihsin Tseng, Pichuan Chang, Galen Andrew, DanielJurafsky, and Christopher Manning. 2005. A condi-tional random field word segmenter for sighan bake-off 2005. In Proceedings of the fourth SIGHAN

workshop on Chinese language Processing, volume171.

Peilu Wang, Yao Qian, Hai Zhao, Frank K. Soong, LeiHe, and Ke Wu. 2016. Learning distributed wordrepresentations for bidirectional lstm recurrent neu-ral network. In Proceedings of the 2016 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies.

Nianwen Xue. 2003. Chinese word segmentation ascharacter tagging. Computational Linguistics andChinese Language Processing, 8(1):29–48.

Xiaodong Zeng, Derek F. Wong, Lidia S. Chao, and Is-abel Trancoso. 2013. Graph-based semi-supervisedmodel for joint chinese word segmentation and part-of-speech tagging. In Proceedings of the 51st An-nual Meeting of the Association for ComputationalLinguistics, pages 770–779.

Yue Zhang and Stephen Clark. 2007. Chinese segmen-tation with a word-based perceptron algorithm. InProceedings of the 45th Annual Meeting of the As-sociation of Computational Linguistics, pages 840–847.

Kaixu Zhang, Maosong Sun, and Changle Zhou. 2012.Word segmentation on chinese mirco-blog data witha linear-time incremental model. In Second CIPS-SIGHAN Joint Conference on Chinese LanguageProcessing, pages 41–46.

Longkai Zhang, Houfeng Wang, Xu Sun, and MairgupMansur. 2013. Exploring representations from un-labeled data with co-training for Chinese word seg-mentation. In Proceedings of the 2013 Conferenceon Empirical Methods in Natural Language Pro-cessing, pages 311–321.

Meishan Zhang, Yue Zhang, and Guohong Fu. 2016a.Transition-based neural word segmentation. In Pro-ceedings of the 54nd Annual Meeting of the Associ-ation for Computational Linguistics.

Zhiong Zhang, Hai Zhao, and Lianhui Qin. 2016b.Probabilistic graph-based dependency parsing withconvolutional neural network. In Proceedings of the54nd Annual Meeting of the Association for Compu-tational Linguistics.

Hai Zhao and Chunyu Kit. 2007. Incorporatingglobal information into supervised learning for chi-nese word segmentation. In Proceedings of the 10thConference of the Pacific Association for Computa-tional Linguistics, pages 66–74.

Hai Zhao and Chunyu Kit. 2008a. Exploiting unla-beled text with different unsupervised segmentationcriteria for chinese word segmentation. Research inComputing Science, 33:93–104.

419

Hai Zhao and Chunyu Kit. 2008b. Unsupervisedsegmentation helps supervised learning of charac-ter tagging for word segmentation and named entityrecognition. In Proceedings of the Third Interna-tional Joint Conference on Natural Language Pro-cessing, pages 106–111.

Hai Zhao and Chunyu Kit. 2011. Integrating unsu-pervised and supervised word segmentation: Therole of goodness measures. Information Sciences,181(1):163–183.

Hai Zhao, Chang-Ning Huang, and Mu Li. 2006a. Animproved chinese word segmentation system withconditional random field. In Proceedings of the FifthSIGHAN Workshop on Chinese Language Process-ing, volume 1082117.

Hai Zhao, Chang-Ning Huang, Mu Li, and Bao-Liang

Lu. 2006b. Effective tag set selection in chineseword segmentation via conditional random fieldmodeling. In Proceedings of the 9th Pacific Asso-ciation for Computational Linguistics, volume 20,pages 87–94.

Hai Zhao, Chang-Ning Huang, Mu Li, and Bao-LiangLu. 2010. A unified character-based tagging frame-work for chinese word segmentation. ACM Trans-actions on Asian Language Information Processing,9(2):5.

Xiaoqing Zheng, Hanyang Chen, and Tianyu Xu.2013. Deep learning for Chinese word segmentationand POS tagging. In Proceedings of the 2013 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 647–657.

420

Documents

Neural Word Segmentation Learning for Chinese