Automatic Judgment Prediction via Legal Reading Comprehension

Automatic Judgment Prediction via Legal Reading Comprehension

Shangbang Long1, Cunchao Tu2, Zhiyuan Liu2∗, Maosong Sun2

1Peking University2Department of Computer Science and Technology

State Key Lab on Intelligent Technology and SystemsInstitute for Artificial Intelligence, Tsinghua University, Beijing, China

[email protected], [email protected], {lzy,sms}@tsinghua.edu.cn

Abstract

Automatic judgment prediction aims to pre-dict the judicial results based on case mate-rials. It has been studied for several decadesmainly by lawyers and judges, considered as anovel and prospective application of artificialintelligence techniques in the legal field. Mostexisting methods follow the text classificationframework, which fails to model the complexinteractions among complementary case ma-terials. To address this issue, we formalizethe task as Legal Reading Comprehension ac-cording to the legal scenario. Following theworking protocol of human judges, LRC pre-dicts the final judgment results based on threetypes of information, including fact descrip-tion, plaintiffs’ pleas, and law articles. More-over, we propose a novel LRC model, Auto-Judge, which captures the complex semanticinteractions among facts, pleas, and laws. Inexperiments, we construct a real-world civilcase dataset for LRC. Experimental resultson this dataset demonstrate that our modelachieves significant improvement over state-of-the-art models. We will publish all sourcecodes and datasets of this work on github.com for further research.

1 Introduction

Automatic judgment prediction is to train a ma-chine judge to determine whether a certain plea ina given civil case would be supported or rejected.In countries with civil law system, e.g. mainlandChina, such process should be done with referenceto related law articles and the fact description, asis performed by a human judge. The intuitioncomes from the fact that under civil law system,law articles act as principles for juridical judg-ments. Such techniques would have a wide rangeof promising applications. On the one hand, legalconsulting systems could provide better access to

∗Corresponding author.

Figure 1: An Example of LRC.

high-quality legal resources in a low-cost way tolegal outsiders, who suffer from the complicatedterminologies. On the other hand, machine judgeassistants for professionals would help improvethe efficiency of the judicial system. Besides, au-tomated judgment system can help in improvingjuridical equality and transparency. From anotherperspective, there are currently 7 times much morecivil cases than criminal cases in mainland China,with annual rates of increase of 10.8% and 1.6%respectively, making judgment prediction in civilcases a promising application (Zhuge, 2016).

Previous works (Aletras et al., 2016; Katz et al.,2017; Luo et al., 2017; Sulea et al., 2017) formal-ize judgment prediction as the text classificationtask, regarding either charge names or binary judg-ments, i.e., support or reject, as the target classes.These works focus on the situation where only oneresult is expected, e.g., the US Supreme Court’sdecisions (Katz et al., 2017), and the charge nameprediction for criminal cases (Luo et al., 2017).Despite these recent efforts and their progress, au-tomatic judgment prediction in civil law system isstill confronted with two main challenges:

arX

iv:1

809.

0653

7v1

[cs

.AI]

18

Sep

2018

github.com

github.com

One-to-Many Relation between Case andPlea. Every single civil case may contain multiplepleas and the result of each plea is co-determinedby related law articles and specific aspects of theinvolved case. For example, in divorce proceed-ings, judgment of alienation of mutual affection isthe key factor for granting divorce but custody ofchildren depends on which side can provide betteran environment for children’s growth as well asparents’ financial condition. Here, different pleasare independent.

Heterogeneity of Input Triple. Inputs to ajudgment prediction system consist of three het-erogeneous yet complementary parts, i.e., fact de-scription, plaintiff’s plea, and related law arti-cles. Concatenating them together and treatingthem simply as a sequence of words as in previ-ous works (Katz et al., 2017; Aletras et al., 2016)would cause a great loss of information. This isthe same in question-answering where the dual in-puts, i.e., query and passage, should be modeledseparately.

Despite the introduction of the neural networksthat can learn better semantic representations ofinput text, it remains unsolved to incorporateproper mechanisms to integrate the complemen-tary triple of pleas, fact descriptions, and law ar-ticles together.

Inspired by recent advances in question an-swering (QA) based reading comprehension (RC)(Wang et al., 2017; Cui et al., 2017; Nguyen et al.,2016; Rajpurkar et al., 2016), we propose the Le-gal Reading Comprehension (LRC) frameworkfor automatic judgment prediction. LRC incorpo-rates the reading mechanism for better modelingof the complementary inputs above-mentioned, asis done by human judges when referring to le-gal materials in search of supporting law articles.Reading mechanism, by simulating how humanconnects and integrates multiple text, has provenan effective module in RC tasks. We argue thatapplying the reading mechanism in a proper wayamong the triplets can obtain a better understand-ing and more informative representation of theoriginal text, and further improve performance .To instantiate the framework, we propose an end-to-end neural network model named AutoJudge.

For experiments, we train and evaluate our mod-els in the civil law system of mainland China. Wecollect and construct a large-scale real-world dataset of 100, 000 case documents that the Supreme

People’s Court of People’s Republic of China hasmade publicly available. Fact description, pleas,and results can be extracted easily from thesecase documents with regular expressions, sincethe original documents have special typographi-cal characteristics indicating the discourse struc-ture. We also take into account law articles andtheir corresponding juridical interpretations. Wealso implement and evaluate previous methods onour dataset, which prove to be strong baselines.

Our experiment results show significant im-provements over previous methods. Further exper-iments demonstrate that our model also achievesconsiderable improvement over other off-the-shelfstate-of-the-art models under classification andquestion answering framework respectively. Ab-lation tests carried out by taking off some compo-nents of our model further prove its robustness andeffectiveness.

To sum up, our contributions are as follows:(1) We introduce reading mechanism and re-

formalize judgment prediction as Legal ReadingComprehension to better model the complemen-tary inputs.

(2) We construct a real-world dataset for exper-iments, and plan to publish it for further research.

(3) Besides baselines from previous works, wealso carry out comprehensive experiments com-paring different existing deep neural networkmethods on our dataset. Supported by these ex-periments, improvements achieved by LRC proveto be robust.

2 Related Work

2.1 Judgment Prediction

Automatic judgment prediction has been studiedfor decades. At the very first stage of judgmentprediction studies, researchers focus on mathemat-ical and statistical analysis of existing cases, with-out any conclusions or methodologies on how topredict them (Lauderdale and Clark, 2012; Segal,1984; Keown, 1980; Ulmer, 1963; Nagel, 1963;Kort, 1957).

Recent attempts consider judgment predictionunder the text classification framework. Most ofthese works extract efficient features from text(e.g., N-grams) (Liu and Chen, 2017; Sulea et al.,2017; Aletras et al., 2016; Lin et al., 2012; Liu andHsieh, 2006) or case profiles (e.g., dates, terms,locations and types) (Katz et al., 2017). All thesemethods require a large amount of human effort

to design features or annotate cases. Besides, theyalso suffer from generalization issue when appliedto other scenarios.

Motivated by the successful application of deepneural networks, Luo et al. (Luo et al., 2017) in-troduce an attention-based neural model to predictcharges of criminal cases, and verify the effective-ness of taking law articles into consideration. Nev-ertheless, they still fall into the text classificationframework and lack the ability to handle multipleinputs with more complicated structures.

2.2 Text Classification

As the basis of previous judgment predictionworks, typical text classification task takes a sin-gle text content as input and predicts the categoryit belongs to. Recent works usually employ neuralnetworks to model the internal structure of a sin-gle input (Kim, 2014; Baharudin et al., 2010; Tanget al., 2015; Yang et al., 2016).

There also exists another thread of text classifi-cation called entailment prediction. Methods pro-posed in (Hu et al., 2014; Mitra et al., 2017) areintended for complementary inputs, but the mech-anisms can be considered as a simplified versionof reading comprehension.

2.3 Reading Comprehension

Reading comprehension is a relevant task to modelheterogeneous and complementary inputs, wherean answer is predicted given two channels of in-puts, i.e. a textual passage and a query. Consid-erable progress has been made (Cui et al., 2017;Dhingra et al., 2017; Wang et al., 2017). Thesemodels employ various attention mechanism tomodel the interaction between passage and query.Inspired by the advantage of reading comprehen-sion models on modeling multiple inputs, we ap-ply this idea into the legal area and propose legalreading comprehension for judgment prediction.

3 Legal Reading Comprehension

3.1 Conventional Reading Comprehension

Conventional reading comprehension (He et al.,2017; Joshi et al., 2017; Nguyen et al., 2016;Rajpurkar et al., 2016) usually considers readingcomprehension as predicting the answer given apassage and a query, where the answer could bea single word, a text span of the original passage,chosen from answer candidates, or generated byhuman annotators.

Generally, an instance in RC is represented asa triple 〈p, q, a〉, where p, q and a correspond topassage, query and answer respectively. Givena triple 〈p, q, a〉, RC takes the pair 〈p, q〉 as the in-put and employs attention-based neural models toconstruct an efficient representation. Afterwards,the representation is fed into the output layer toselect or generate an answer.

3.2 Legal Reading ComprehensionExisting works usually formalize judgment pre-diction as a text classification task and focus onextracting well-designed features of specific cases.Such simplification ignores that the judgment of acase is determined by its fact description and mul-tiple pleas. Moreover, the final judgment shouldact up to the legal provisions, especially in civillaw systems. Therefore, how to integrate the in-formation (i.e., fact descriptions, pleas, and lawarticles) in a reasonable way is critical for judg-ment prediction.

Inspired by the successful application of RC,we propose a framework of Legal ReadingComprehension(LRC) for judgment prediction inthe legal area. As illustrated in Fig. 1, for eachplea in a given case, the prediction of judgmentresult is made based the fact description and thepotentially relevant law articles.

In a nutshell, LRC can be formalized as the fol-lowing quadruplet task:

〈f, p, l, r〉, (1)

where f is the fact description, p is the plea, l isthe law articles and r is the result. Given 〈f, p, l〉,LRC aims to predict the judgment result as

r = argmaxr∈{support, reject}

P (r|f, p, l). (2)

The probability is calculated with respect to theinteraction among the triple 〈f, p, l〉, which willdraw on the experience of the interaction between〈passage, question〉 pairs in RC.

To summarize, LRC is innovative in the follow-ing aspects:

(1) While previous works fit the problem intotext classification framework, LRC re-formalizesthe way to approach such problems. This newframework provides the ability to deal with theheterogeneity of the complementary inputs.

(2) Rather than employing conventional RCmodels to handle pair-wise text information in thelegal area, LRC takes the critical law articles into

Pair-Wise Attentive Reader

mGRU

Text Encoder

Bi-GRU

Plea

I

ask

fordivorce

up

Fact

They

gave

birth….

since

birth

uf

Law

If

they

… grant

ulmemory

αt,1

αt,2

αt,3

αt,4

xp=[up,cf]

Output

vp

vlCNN

v*=[vp,vl]

xl=[ul,cf]

Figure 2: An overview of AutoJudge.

consideration and models the facts, pleas, and lawarticles jointly for judgment prediction, which ismore suitable to simulate the human mode of deal-ing with cases.

4 Methods

We propose a novel judgment prediction modelAutoJudge to instantiate the LRC framework. Asshown in Fig. 2, AutoJudge consists of three flexi-ble modules, including a text encoder, a pair-wiseattentive reader, and an output module.

In the following parts, we give a detailed intro-duction to these three modules.

4.1 Text EncoderAs illustrated in Fig. 2, Text Encoder aims to en-code the word sequences of inputs into continuousrepresentation sequences.

Formally, consider a fact description f ={wf

t }mt=1, a plea p = {wpt }nt=1, and the relevant

law articles l = {wlt}kt=1, where wt denotes the t-

th word in the sequence andm,n, k are the lengthsof word sequences f, p, l respectively.

First, we convert the words to their respectiveword embeddings to obtain f = {wf

t }mt=1, p ={wp

t }nt=1 and l = {wlt}kt=1, where w ∈ Rd. After-

wards, we employ bi-directional GRU (Cho et al.,2014; Bahdanau et al., 2015; Chung et al., 2014) toproduce the encoded representation u of all wordsas follows:

uft = BiGRUF (u

ft−1,w

ft ),

upt = BiGRUP (u

pt−1,w

pt ),

ult = BiGRUL(u

lt−1,w

lt).

(3)

Note that, we adopt different bi-directional GRUsto encode fact descriptions, pleas, and law articlesrespectively(denoted as BiGRUF , BiGRUP , and

BiGRUL). With these text encoders, f , p, and lare converting into uf = {uf

t }mt=1, up = {upt }nt=1,

and ul = {ult}kt=1.

4.2 Pair-Wise Attentive ReaderHow to model the interactions among the inputtext is the most important problem in reading com-prehension. In AutoJudge, we employ a pair-wiseattentive reader to process 〈uf ,up〉 and 〈uf ,ul〉respectively. More specifically, we propose touse pair-wise mutual attention mechanism to cap-ture the complex semantic interaction between textpairs, as well as increasing the interpretability ofAutoJudge.

4.2.1 Pair-Wise Mutual AttentionFor each input pair 〈uf ,up〉 or 〈uf ,ul〉, we em-ploy pair-wise mutual attention to select relevantinformation from fact descriptions uf and producemore informative representation sequences.

As a variant of the original attention mecha-nism (Bahdanau et al., 2015), we design the pair-wise mutual attention unit as a GRU with internalmemories denoted as mGRU.

Taking the representation sequence pair〈uf ,up〉 for instance, mGRU stores the fact se-quence uf into its memories. For each timestampt ∈ [1, n], it selects relevant fact information cftfrom the memories as follows,

cft =

m∑i=1

αt,iufi . (4)

Here, the weight αt,i is the softmax value as

αt,i =exp(at,i)∑mj=1 exp(at,j)

. (5)

Note that, at,j represents the relevance betweenupt and uf

j . It is calculated as follows,

at,j = VT tanh(Wfufj +Wpup

t +Upvpt−1). (6)

Here, vpt−1 is the last hidden state in the GRU,

which will be introduced in the following part. Vis a weight vector, and Wf , Wp, Up are attentionmetrics of our proposed pair-wise attention mech-anism.

4.2.2 Reading MechanismWith the relevant fact information cft and up

t , weget the t-th input of mGRU as

xpt = up

t ⊕ cft , (7)

where ⊕ indicates the concatenation operation.Then, we feed xp

t into GRU to get more infor-mative representation sequence vp = {vp

t }nt=1 asfollows,

vpt = GRU(vp

t−1,xpt ). (8)

For the input pair 〈uf ,ul〉, we can get vl ={vl

t}kt=1 in the same way. Therefore, we omit theimplementation details Here.

Similar structures with attention mechanism arealso applied in (Wang et al., 2017; Rocktaschelet al., 2016; Wang and Jiang, 2016; Bahdanauet al., 2015) to obtain mutually aware represen-tations in reading comprehension models, whichsignificantly improve the performance of this task.

4.3 Output LayerUsing text encoder and pair-wise attentive reader,the initial input triple 〈f, p, l〉 has been convertedinto two sequences, i.e., vp = {vp

t }nt=1 andvl = {vl

t}kt=1, where vlt is defined similarly to vp

t .These sequences reserve complex semantic infor-mation about the pleas and law articles, and filterout irrelevant information in fact descriptions.

With these two sequences, we concatenate vp

and vl along the sequence length dimension togenerate the sequence v∗ = {vt}n+k

t=1 . Sincewe have employed several GRU layers to encodethe sequential inputs, another recurrent layer maybe redundant. Therefore, we utilize a 1-layerCNN (Kim, 2014) to capture the local structureand generate the representation vector for the fi-nal prediction.

Assuming y ∈ [0, 1] is the predicted probabil-ity that the plea in the case sample would be sup-ported and r ∈ {0, 1} is the gold standard, Auto-Judge aims to minimize the cross-entropy as fol-lows,

L = − 1

N

N∑i=1

[rilnyi + (1− ri)ln(1− yi)], (9)

where N is the number of training data. As all thecalculation in our model is differentiable, we em-ploy Adam (Kingma and Ba, 2015) for optimiza-tion.

5 Experiments

To evaluate the proposed LRC framework and theAutoJudge model, we carry out a series of exper-iments on the divorce proceedings, a typical yetcomplex field of civil cases. Divorce proceedingsoften come with several kinds of pleas, e.g. seek-ing divorce, custody of children, compensation,and maintenance, which focuses on different as-pects and thus makes it a challenge for judgmentprediction.

5.1 Dataset Construction for Evaluation

5.1.1 Data CollectionSince none of the datasets from previous workshave been published, we decide to build a newone. We randomly collect 100, 000 cases fromChina Judgments Online1, among which 80, 000cases are for training, 10, 000 each for validationand testing. Among the original cases, 51% aregranted divorce and others not. There are 185, 723valid pleas in total, with 52% supported and 48%rejected. Note that, if the divorce plea in a caseis not granted, the other pleas of this case willnot be considered by the judge. Case materialsare all natural language sentences, with averagely100.08 tokens per fact description and 12.88 perplea. There are 62 relevant law articles in total,each with 26.19 tokens averagely. Note that thecase documents include special typographical sig-nals, making it easy to extract labeled data withregular expression.

5.1.2 Data Pre-ProcessingWe apply some rules with legal prior to preprocessthe dataset according to previous works (Liu et al.,2003, 2004; Bian and Shun-yuan, 2005), whichhave proved effective in our experiments.

Name Replacement2: All names in case doc-uments are replaced with marks indicating theirroles, instead of simply anonymizing them, e.g.<Plantiff>, <Defendant>, <Daughter x> andso on. Since “all are equal before the law”3,

1http://wenshu.court.gov.cn2We use regular expressions to extract names and roles

from the formatted case header.3Constitution of the People’s Republic of China

http://wenshu.court.gov.cn

names should make no more difference than whatrole they take.

Law Article Filtration : Since most acces-sible divorce proceeding documents do not con-tain ground-truth fine-grained articles4, we use anunsupervised method instead. First, we extractall the articles from the law text with regular ex-pression. Afterwards, we select the most relevant10 articles according to the fact descriptions asfollows. We obtain sentence representation withCBOW (Mikolov et al., 2013a,b) weighted by in-verse document frequency, and calculate cosinedistance between cases and law articles. Word em-beddings are pre-trained with Chinese Wikipediapages5. As the final step, we extract top 5 rel-evant articles for each sample respectively fromthe main marriage law articles and their interpreta-tions, which are equally important. We manuallycheck the extracted articles for 100 cases to en-sure that the extraction quality is fairly good andacceptable.

The filtration process is automatic and fully un-supervised since the original documents have noground-truth labels for fine-grained law articles,and coarse-grained law-articles only provide lim-ited information. We also experiment with theground-truth articles, but only a small fraction ofthem has fine-grained ones, and they are usuallynot available in real-world scenarios.

5.2 Implementation Details

We employ Jieba6 for Chinese word segmentationand keep the top 20, 000 frequent words. Theword embedding size is set to 128 and the otherlow-frequency words are replaced with the mark<UNK>. The hidden size of GRU is set to 128for each direction in Bi-GRU. In the pair-wise at-tentive reader, the hidden state is set to 256 formGRu. In the CNN layer, filter windows are setto 1, 3, 4, and 5 with each filter containing 200feature maps. We add a dropout layer (Srivastavaet al., 2014) after the CNN layer with a dropoutrate of 0.5. We use Adam(Kingma and Ba, 2015)for training and set learning rate to 0.0001, β1 to0.9 , β2 to 0.999, ε to 1e − 8, batch size to 64.We employ precision, recall, F1 and accuracy forevaluation metrics. We repeat all the experiments

4Fine-grained articles are in the Juridical Interpretations,giving detailed explanation, while the Marriage Law onlycovers some basic principles.

5https://dumps.wikimedia.org/zhwiki/6https://github.com/fxsjy/jieba

for 10 times, and report the average results.

5.3 Baselines

For comparison, we adopt and re-implement threekinds of baselines as follows:

Lexical Features + SVM We implement anSVM with lexical features in accordance with pre-vious works (Lin et al., 2012; Liu and Hsieh, 2006;Aletras et al., 2016; Liu and Chen, 2017; Suleaet al., 2017) and select the best feature set on thedevelopment set.

Neural Text Classification Models We im-plement and fine-tune a series of neural text classi-fiers, including attention-based method(Luo et al.,2017) and other methods we deem important.CNN (Kim, 2014) and GRU (Cho et al., 2014;Yang et al., 2016) take as input the concate-nation of fact description and plea. Similarly,CNN/GRU+law refers to using the concatenationof fact description, plea and law articles as inputs.

RC Models We implement and train someoff-the-shelf RC models, including r-net(Wanget al., 2017) and AoA(Cui et al., 2017), which arethe leading models on SQuAD leaderboard. In ourinitial experiments, these models take fact descrip-tion as passage and plea as query. Further, Lawarticles are added to the fact description as a partof the reading materials, which is a simple way toconsider them as well.

5.4 Results and Analysis

From Table 1, we have the following observations:(1) AutoJudge consistently and significantly

outperforms all the baselines, including RC mod-els and other neural text classification models,which shows the effectiveness and robustness ofour model.

(2) RC models achieve better performancethan most text classification models (excludingGRU+Attention), which indicates that readingmechanism is a better way to integrate informa-tion from heterogeneous yet complementary in-puts. On the contrary, simply adding law articlesas a part of the reading materials makes no differ-ence in performance. Note that, GRU+Attentionemploy similar attention mechanism as RC doesand takes additional law articles into considera-tion, thus achieves comparable performance withRC models.

https://dumps.wikimedia.org/zhwiki/

https://github.com/fxsjy/jieba

Models P R F1 Acc.MaxFreq 52.2 100 68.6 52.2

SVM* 57.8 53.5 55.6 55.5

CNN 76.1 81.9 79.0 77.6CNN+law 74.4 79.4 77.0 76.0

GRU 79.2 72.9 76.1 76.6GRU+law 78.2 68.2 72.8 74.4

GRU+Attention* 79.1 80.7 80.0 79.1

AoA 79.3 78.9 79.2 78.3AoA+law 79.0 79.2 79.1 78.3

r-net 79.5 78.7 79.2 78.4r-net+law 79.3 78.8 79.0 78.3

AutoJudge 80.4 86.6 83.4 82.2

Table 1: Experimental results(%). P/R/F1 are re-ported for positive samples and calculated as themean score over 10-time experiments. Acc isdefined as the proportion of test samples classi-fied correctly, equal to micro-precision. MaxFreqrefers to always predicting the most frequent label,i.e. support in our dataset. * indicates methodsproposed in previous works.

(3) Comparing with conventional RC mod-els, AutoJudge achieves significant improvementwith the consideration of additional law arti-cles. It reflects the difference between LRCand conventional RC models. We re-formalizeLRC in legal area to incorporate law articlesvia the reading mechanism, which can enhancejudgment prediction. Moreover, CNN/GRU+lawdecrease the performance by simply concate-nating original text with law articles, whileGRU+Attention/AutoJudge increase the perfor-mance by integrating law articles with attentionmechanism. It shows the importance and ratio-nality of using attention mechanism to capture theinteraction between multiple inputs.

The experiments support our hypothesis as pro-posed in the Introduction part that in civil cases,it’s important to model the interactions amongcase materials. Reading mechanism can well per-form the matching among them.

5.5 Ablation Test

AutoJudge is characterized by the incorporation ofpair-wise attentive reader, law articles, and a CNNoutput layer, as well as some pre-processing withlegal prior. We design ablation tests respectively toevaluate the effectiveness of these modules. Whentaken off the attention mechanism, AutoJudge de-grades into a GRU on which a CNN is stacked.

Models F1 Acc.AutoJudge 83.4 82.2

w/o reading mechanism 78.9(↓ 4.5) 78.2(↓ 4.0)w/o law articles 79.6(↓ 3.8) 78.4(↓ 3.8)CNN→LSTM 77.6(↓ 5.8) 77.7(↓ 4.5)

w/o Pre-Processing 81.1(↓ 2.3) 80.3(↓ 1.9)w/o law article selection 80.6(↓ 2.8) 80.5(↓ 1.7)

with GT law articles 85.1(↑ 1.7) 84.1(↑ 1.9)

Table 2: Experimental results of ablation tests (%).

When taken off law articles, the CNN output layeronly takes {vPt }

LPt=1 as input. Besides, our model is

tested respectively without name-replacement orunsupervised selection of law articles (i.e. passingthe whole law text). As mentioned above, we sys-tem use law articles extracted with unsupervisedmethod, so we also experiment with ground-truthlaw articles.

Results are shown in Table 2. We can infer that:(1) The performance drops significantly after

removing the attention layer or excluding the lawarticles, which is consistent with the comparisonbetween AutoJudge and baselines. The result ver-ifies that both the reading mechanism and incorpo-ration of law articles are important and effective.

(2) After replacing CNN with an LSTM layer,performance drops as much as 4.4% in accuracyand 5.7% in F1 score. The reason may be theredundancy of RNNs. AutoJudge has employedseveral GRU layers to encode text sequences. An-other RNN layer may be useless to capture se-quential dependencies, while CNN can catch thelocal structure in convolution windows.

(3) Motivated by existing rule-based works, weconduct data pre-processing on cases, includingname replacement and law article filtration. If weremove the pre-processing operations, the perfor-mance drops considerably. It demonstrates thatapplying the prior knowledge in legal filed wouldbenefit the understanding of legal cases.

Performance Over Law Articles It’s intu-itive that the quality of the retrieved law articleswould affect the final performance. As is shown inTable 2, feeding the whole law text without filtra-tion results in worse performance. However, whenwe train and evaluate our model with ground trutharticles, the performance is boosted by nearly 2%in both F1 and Acc. The performance improve-ment is quite limited compared to that in previouswork (Luo et al., 2017) for the following reasons:

(1) As mentioned above, most case documentsonly contain coarse-grained articles, and only asmall number of them contain fine-grained ones,which has limited information in themselves. (2)Unlike in criminal cases where the application ofan article indicates the corresponding crime, lawarticles in civil cases work as reference, and can beapplied in both the cases of supports and rejects.As law articles cut both ways for the judgment re-sult, this is one of the characteristics that distin-guishes civil cases from criminal ones. We alsoneed to remember that, the performance of 84.1%in accuracy or 85.1% in F1 score is unattainable inreal-world setting for automatic prediction whereground-truth articles are not available.

Reading Weighs More Than Correct LawArticles In the area of civil cases, the under-standing of the case materials and how they inter-act is a critical factor. The inclusion of law articlesis not enough. As is shown in Table 2, comparedto feeding the model with an un-selected set oflaw articles, taking away the reading mechanismresults in greater performance drop7. Therefore,the ability to read, understand and select relevantinformation from the complex multi-sourced casematerials is necessary. It’s even more important inreal world since we don’t have access to ground-truth law articles to make predictions.

5.6 Case StudyVisualization of Positive Samples We visu-

alize the heat maps of attention results8. Asshown in Fig. 3, deeper background color repre-sents larger attention score.

The attention score is calculated with Eq. (5).We take the average of the resulting n × m at-tention matrix over the time dimension to obtainattention values for each word.

The visualization demonstrates that the atten-tion mechanism can capture relevant patterns andsemantics in accordance with different pleas indifferent cases.

Failure Analysis As for the failed sam-ples, the most common reason comes from theanonymity issue, which is also shown in Fig. 3.As mentioned above, we conduct name replace-ment. However, some critical elements are alsoanonymized by the government, due to the privacy

73.9% vs. 1.7% in Acc, and 4.4% vs. 2.8% in F1.8Examples given here are all drawn from the test set

whose predictions match the real judgment.

Figure 3: Visualization of Attention Mechanism.

issue. These elements are sometimes important tojudgment prediction. For example, determinationof the key factor long-time separation is relevantto the explicit dates, which are anonymized.

6 Conclusion

In this paper, we explore the task of predictingjudgments of civil cases. Comparing with con-ventional text classification framework, we pro-pose Legal Reading Comprehension frameworkto handle multiple and complex textual inputs.Moreover, we present a novel neural model, Au-toJudge, to incorporate law articles for judgmentprediction. In experiments, we compare our modelon divorce proceedings with various state-of-the-art baselines of various frameworks. Experimentalresults show that our model achieves considerableimprovement than all the baselines. Besides, visu-alization results also demonstrate the effectivenessand interpretability of our proposed model.

In the future, we can explore the following di-rections: (1) Limited by the datasets, we can onlyverify our proposed model on divorce proceed-ings. A more general and larger dataset will ben-efit the research on judgment prediction. (2) Judi-cial decisions in some civil cases are not always bi-nary, but more diverse and flexible ones, e.g. com-pensation amount. Thus, it is critical for judgmentprediction to manage various judgment forms.

ReferencesNikolaos Aletras, Dimitrios Tsarapatsanis, Daniel

Preotiuc-Pietro, and Vasileios Lampos. 2016. Pre-dicting judicial decisions of the european court ofhuman rights: A natural language processing per-spective. PeerJ Computer Science, 2.

Baharum Baharudin, Lam Hong Lee, and KhairullahKhan. 2010. A review of machine learning al-gorithms for text-documents classification. JAIT,1(1):4–20.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In Proceedings ofICLR.

Guo-wei Bian and Teng Shun-yuan. 2005. Integratingquery translation and text classification in a cross-language patent access system. In Proceedings ofNTCIR-7 Workshop Meeting, pages 252–261.

Kyunghyun Cho, Bart Van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder-decoderfor statistical machine translation. Computer Sci-ence.

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,and Yoshua Bengio. 2014. Empirical evaluation ofgated recurrent neural networks on sequence model-ing. In Proceedings of NIPS.

Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang,Ting Liu, and Guoping Hu. 2017. Attention-over-attention neural networks for reading comprehen-sion. In Proceedings of ACL.

Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang,William W Cohen, and Ruslan Salakhutdinov.2017. Gated-attention readers for text comprehen-sion. In Proceedings of ACL.

Wei He, Kai Liu, Yajuan Lyu, Shiqi Zhao, XinyanXiao, Yuan Liu, Yizhong Wang, Hua Wu, QiaoqiaoShe, Xuan Liu, Tian Wu, and Haifeng Wang. 2017.Dureader: a chinese machine reading comprehen-sion dataset from real-world applications. arXivpreprint arXiv:1711.05073.

Baotian Hu, Zhengdong Lu, Hang Li, and QingcaiChen. 2014. Convolutional neural network architec-tures for matching natural language sentences. InProceedings of NIPS, pages 2042–2050.

Mandar Joshi, Eunsol Choi, Daniel S Weld, and LukeZettlemoyer. 2017. Triviaqa: A large scale distantlysupervised challenge dataset for reading comprehen-sion. In Proceedings of ACL.

Daniel Martin Katz, Michael J Bommarito II, and JoshBlackman. 2017. A general approach for predict-ing the behavior of the supreme court of the unitedstates. PloS one, 12(4).

R Keown. 1980. Mathematical models for legal pre-diction. Computer/LJ, 2:829.

Yoon Kim. 2014. Convolutional neural networks forsentence classification. In Proceedings of EMNLP.

Diederik Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In Proceedingsof ICLR.

Fred Kort. 1957. Predicting supreme court decisionsmathematically: A quantitative analysis of the ”rightto counsel” cases. American Political Science Re-view, 51(1):1–12.

Benjamin E Lauderdale and Tom S Clark. 2012. Thesupreme court’s many median justices. AmericanPolitical Science Review, 106(4):847–866.

Wan-Chen Lin, Tsung-Ting Kuo, Tung-Jia Chang,Chueh-An Yen, Chao-Ju Chen, and Shou-de Lin.2012. Exploiting machine learning models for chi-nese legal documents labeling, case classification,and sentencing prediction. In Processdings of RO-CLING, page 140.

Chao-Lin Liu, Cheng-Tsung Chang, and Jim-How Ho.2004. Case instance generation and refinement forcase-based criminal summary judgments in chinese.JISE.

Chao Lin Liu, Jim How Ho, and Jim How Ho. 2003.Classification and clustering for case-based criminalsummary judgments. In Proceedings of the Interna-tional Conference on Artificial Intelligence and Law,pages 252–261.

Chao-Lin Liu and Chwen-Dar Hsieh. 2006. Exploringphrase-based classification of judicial documents forcriminal charges in chinese. In Proceedings of IS-MIS, pages 681–690.

Yi Hung Liu and Yen Liang Chen. 2017. A two-phasesentiment analysis approach for judgement predic-tion. Journal of Information Science.

Bingfeng Luo, Yansong Feng, Jianbo Xu, XiangZhang, and Dongyan Zhao. 2017. Learning to pre-dict charges for criminal cases with legal basis. InProceedings of EMNLP.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013a. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013b. Distributed representa-tions of words and phrases and their compositional-ity. In Proceedings of NIPS, pages 3111–3119.

Bhaskar Mitra, Fernando Diaz, and Nick Craswell.2017. Learning to match using local and distributedrepresentations of text for web search. In Proceed-ings of WWW, pages 1291–1299.

Stuart S Nagel. 1963. Applying correlation analysis tocase prediction. Texas Law Review, 42:1006.

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao,Saurabh Tiwary, Rangan Majumder, and Li Deng.2016. Ms marco: A human generated machinereading comprehension dataset. arXiv preprintarXiv:1611.09268.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100,000+ questions formachine comprehension of text. In Proceedings ofEMNLP.

Tim Rocktaschel, Edward Grefenstette, Karl MoritzHermann, Tomas Kocisky, and Phil Blunsom. 2016.Reasoning about entailment with neural attention.In Proceedings of ICLR.

Jeffrey A Segal. 1984. Predicting supreme court casesprobabilistically: The search and seizure cases,1962-1981. American Political Science Review,78(4):891–900.

Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: a simple way to prevent neural networksfrom overfitting. JMLR, 15(1):1929–1958.

Octavia Maria Sulea, Marcos Zampieri, Mihaela Vela,and Josef Van Genabith. 2017. Exploring the use oftext classi cation in the legal domain. In Proceedingsof ASAIL workshop.

Duyu Tang, Bing Qin, and Ting Liu. 2015. Documentmodeling with gated recurrent neural network forsentiment classification. In Proceedings of EMNLP,pages 1422–1432.

S Sidney Ulmer. 1963. Quantitative analysis of judi-cial processes: Some practical and theoretical appli-cations. Law & Contemp. Probs., 28:164.

Shuohang Wang and Jing Jiang. 2016. Learning natu-ral language inference with lstm. In Proceedings ofNAACL.

Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang,and Ming Zhou. 2017. Gated self-matching net-works for reading comprehension and question an-swering. In Proceedings of ACL, volume 1, pages189–198.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alexander J Smola, and Eduard H Hovy. 2016. Hi-erarchical attention networks for document classi-fication. In Proceedings of NAACL, pages 1480–1489.

Pingping Zhuge. 2016. Chinese Law Yearbook. TheChinese Law Yearbook Press.

Documents

Automatic Judgment Prediction via Legal Reading Comprehension