8
The ReCAP Corpus: A Corpus of Complex Argument Graphs on German Education Politics Lorik Dumani, Manuel Biertz, Alex Witry, Anna-Katharina Ludwig, Mirko Lenz, Stefan Ollinger, Ralph Bergmann, Ralf Schenkel {dumani,biertz,s4alwitr,s2aaludw,s4mtlenz,bergmann,schenkel}@uni-trier.de Trier University Trier, Germany ABSTRACT The automatic extraction of arguments from natural language texts is a highly researched area and more important than ever today, as it is nearly impossible to manually capture all arguments on a controversial topic in a reasonable amount of time. For testing dif- ferent algorithms such as the retrieval of the best arguments, which are still in their infancy, gold standards must exist. An argument consists of a claim or standpoint that is supported or opposed by at least one premise. The generic term for a claim or premise is Argumentative Discourse Unit (ADU). The relationships between ADUs can be specified by argument schemes and can lead to large graphs. This paper presents a corpus of 100 argument graphs in German as well as 100 in English language, which is unique in its size and the utilisation of argument schemes. The corpus is built from natural language texts like party press releases and parliamen- tary motions on education policies in the German federal states. Each high-quality text is presented by an argument graph and cre- ated by the use of a modified version of the annotation tool OVA. The final argument graphs resulted by merging two previously independently annotated graphs based on detailed discussions. We obtained argument graphs in English language by using the often used outstanding translator DeepL. CCS CONCEPTS Information systems Content analysis and feature se- lection; Document structure; KEYWORDS argument mining, argumentation schemes, gold standard corpus, knowledge representation 1 INTRODUCTION Argument mining (AM), i.e. the automatic extraction of arguments in natural language texts, is a nascent field in computational ar- gumentation. It goes beyond topic identification, summarisation, and stance detection and identifies the reasons provided in a text to follow a course of action or to accept a judgement. An argu- ment is generally understood to be a combination of a claim (or conclusion) and a premise (or reason), and an inference rule linking the two [18]. Inference within this micro-structure can further be explained as argument scheme which can support or limit the con- clusion. Besides the inference, an argument scheme also describes the pattern of premises which are pertinent for the inference. A deeper explanation of these schemes is given in Section 2 of this paper. Research on AM needs high-quality pre-annotated corpora in order to verify the validity of approaches. By now existing corpora either have serious issues with regard to their quality, or they are available only in English language: The Internet Argument Corpus (IAC) [31], which consists of 390,704 posts in 11,800 discussions that were extracted from the online debate site 4forums.com is created with “little argumentation theory sitting behind it” [17]. AIFdb [13] is a database that allows to store and to retrieve argu- ment structures in the Argument Interchange Format (AIF) [4] and could provide a good source, but as nearly all corpora most of the arguments are in English language only. The Potsdam Microtexts Corpus [21], being the exception to the rule, consists of short texts that respond to a trigger question. It does not fulfil our requirements in three ways: First, it does not utilise argument schemes, which are helpful for both human annotators and machine learning algo- rithms [16]. Second, the corpus is artificially created so that every document or graph is composed of about five elements (conclusions and premises), thus it does not provide a real world scenario. Third, with only 112 graphs 1 and the corresponding number of premises and conclusions, the corpus is rather small [17]. Hence, we created a new corpus which avoids these pitfalls. In this paper, we provide a high-quality corpus in German and Eng- lish language which we make publicly available to the argument mining and argument retrieval community, built of texts ranging in their length from press releases to election manifestos, and in- cluding not only inferences between claims and their respective premises but also argumentation schemes and hence additional information about the premises’ interrelations and the inference rule (or warrant) utilised. The annotations were conducted on texts in German language. We translated the corpus in English by using the outstanding translator DeepL 2 [6, 8]. This work is part of the ReCAP project [2] which is part of the DFG priority program robust argumentation machines (RATIO) 3 . Bergmann et al [2] propose an architecture of an argumentation machine which should reason on a knowledge level formed by arguments and argumentation structures, e.g. the corpus presented in this paper. The remainder of this paper is organised as follows: Section 2 gives an introduction to argumentation theory. In particular, it explains the inference schemes used in our corpus. Then, Section 3 explains our modified variant of the OVA tool [11], which was used to annotate texts, i.e. the manual transformation of texts into graphs. Section 4 describes the annotation process and annotator training, 1 http://angcl.ling.uni-potsdam.de/resources/argmicro.html 2 https://www.deepl.com/translator 3 www.spp-ratio.de

The ReCAP Corpus: A Corpus of Complex Argument Graphs on

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The ReCAP Corpus: A Corpus of Complex Argument Graphs on

The ReCAP Corpus: A Corpus of Complex Argument Graphs onGerman Education Politics

Lorik Dumani, Manuel Biertz, Alex Witry, Anna-Katharina Ludwig, Mirko Lenz, Stefan Ollinger,Ralph Bergmann, Ralf Schenkel

{dumani,biertz,s4alwitr,s2aaludw,s4mtlenz,bergmann,schenkel}@uni-trier.deTrier UniversityTrier, Germany

ABSTRACTThe automatic extraction of arguments from natural language textsis a highly researched area and more important than ever today,as it is nearly impossible to manually capture all arguments on acontroversial topic in a reasonable amount of time. For testing dif-ferent algorithms such as the retrieval of the best arguments, whichare still in their infancy, gold standards must exist. An argumentconsists of a claim or standpoint that is supported or opposed byat least one premise. The generic term for a claim or premise isArgumentative Discourse Unit (ADU). The relationships betweenADUs can be specified by argument schemes and can lead to largegraphs. This paper presents a corpus of 100 argument graphs inGerman as well as 100 in English language, which is unique in itssize and the utilisation of argument schemes. The corpus is builtfrom natural language texts like party press releases and parliamen-tary motions on education policies in the German federal states.Each high-quality text is presented by an argument graph and cre-ated by the use of a modified version of the annotation tool OVA.The final argument graphs resulted by merging two previouslyindependently annotated graphs based on detailed discussions. Weobtained argument graphs in English language by using the oftenused outstanding translator DeepL.

CCS CONCEPTS• Information systems → Content analysis and feature se-lection; Document structure;

KEYWORDSargument mining, argumentation schemes, gold standard corpus,knowledge representation

1 INTRODUCTIONArgument mining (AM), i.e. the automatic extraction of argumentsin natural language texts, is a nascent field in computational ar-gumentation. It goes beyond topic identification, summarisation,and stance detection and identifies the reasons provided in a textto follow a course of action or to accept a judgement. An argu-ment is generally understood to be a combination of a claim (orconclusion) and a premise (or reason), and an inference rule linkingthe two [18]. Inference within this micro-structure can further beexplained as argument scheme which can support or limit the con-clusion. Besides the inference, an argument scheme also describesthe pattern of premises which are pertinent for the inference. Adeeper explanation of these schemes is given in Section 2 of thispaper.

Research on AM needs high-quality pre-annotated corpora inorder to verify the validity of approaches. By now existing corporaeither have serious issues with regard to their quality, or they areavailable only in English language: The Internet Argument Corpus(IAC) [31], which consists of 390,704 posts in 11,800 discussionsthat were extracted from the online debate site 4forums.com iscreated with “little argumentation theory sitting behind it” [17].AIFdb [13] is a database that allows to store and to retrieve argu-ment structures in the Argument Interchange Format (AIF) [4] andcould provide a good source, but as nearly all corpora most of thearguments are in English language only. The Potsdam MicrotextsCorpus [21], being the exception to the rule, consists of short textsthat respond to a trigger question. It does not fulfil our requirementsin three ways: First, it does not utilise argument schemes, whichare helpful for both human annotators and machine learning algo-rithms [16]. Second, the corpus is artificially created so that everydocument or graph is composed of about five elements (conclusionsand premises), thus it does not provide a real world scenario. Third,with only 112 graphs1 and the corresponding number of premisesand conclusions, the corpus is rather small [17].

Hence, we created a new corpus which avoids these pitfalls. Inthis paper, we provide a high-quality corpus in German and Eng-lish language which we make publicly available to the argumentmining and argument retrieval community, built of texts rangingin their length from press releases to election manifestos, and in-cluding not only inferences between claims and their respectivepremises but also argumentation schemes and hence additionalinformation about the premises’ interrelations and the inferencerule (or warrant) utilised. The annotations were conducted on textsin German language. We translated the corpus in English by usingthe outstanding translator DeepL2 [6, 8].

This work is part of the ReCAP project [2] which is part of theDFG priority program robust argumentation machines (RATIO)3.Bergmann et al [2] propose an architecture of an argumentationmachine which should reason on a knowledge level formed byarguments and argumentation structures, e.g. the corpus presentedin this paper.

The remainder of this paper is organised as follows: Section 2gives an introduction to argumentation theory. In particular, itexplains the inference schemes used in our corpus. Then, Section 3explains our modified variant of the OVA tool [11], which was usedto annotate texts, i.e. the manual transformation of texts into graphs.Section 4 describes the annotation process and annotator training,

1http://angcl.ling.uni-potsdam.de/resources/argmicro.html2https://www.deepl.com/translator3www.spp-ratio.de

Page 2: The ReCAP Corpus: A Corpus of Complex Argument Graphs on

Dumani et al.

Section 5 gives a brief insight into the corpus, and Section 6 showssome possible applications for the corpus in recent research.

2 ARGUMENTATION THEORETICALCONSIDERATIONS

In this section we scratch the surface of argumentation theoryand describe argumentation schemes which are important for theinferences between ADUs. We will need this as the corpus is basedon this theory.

Beginning in the 1950’s, argumentation theory experienced agreat turn which is largely attributed to Toulmin [28]. He arguedagainst a scholarship largely focused on the logical soundness ofarguments and dismissing anything else as fallacious, and advo-cated, taking the example of legal argumentation, an argumenta-tion theory which originates from empirical practice, thus looksat argumentation as it occurs in every-day life. While the classicalapproach looked at the premises and the conclusion and analysed,whether the latter followed logically from the former, Toulminrecognised that the step from the data to the conclusion is oftenonly presumptive and not logically cogent. In the following decades,several approaches emerged which followed his example.

Our approach is built on the work of Walton [32]. As a dialecticaltheory the approach perceives argumentation as a rule-guided ex-change, is not interested in logical correctness or rhetorical successof an argument, and has essentially a dialogue in mind. As Wal-ton denotes: “the offering of an argument presupposes a dialoguebetween two sides” [32]. A dialogue is understood as a “type ofgoal-directed conversation” of at least two people which take turnsin their contributions [32]. To think of argumentation as dialogue isintuitive as long as we analyse parliamentary debates, for instance.However, we intend to include only monologic texts into our corpus.Thus, we assume that they are written with a dialogic intentiondirected towards an anonymous, at least larger, audience, becausethe reader “should raise critical questions about the argument heor she has been presented for acceptance” [32].

An argument consists of a claim or standpoint that is supportedor opposed by at least one premise [27]. The claim is the centralcomponent of an argument and often also controversial [25]. Itsacceptance is either increased or decreased by premises [30]. A sup-port or attack relation is marked by an inference from the premise tothe claim. Premises and claims can be subsumed under the genericterm argumentative discourse unit (ADU) [20]. Since ADUs canhave both outgoing and incoming edges, large graphs (called ar-gument graphs in the following) can emerge. The main point of anargumentative text is a claim, referred to as the major claim [26].

Walton’s major contribution is a comprehensive catalogue ofargumentation schemes found in natural language texts [34], hencenot derived from normative ideals (unlike Pragma-Dialectics’ argu-mentative patterns [29]). Walton’s argumentation schemes have beenempirically discovered in natural language, thus they are less nor-mative than other approaches [23], and are promising for machinelearning and artificial intelligence application.

“Argumentation schemes are stereotypical patterns of reason-ing”4 and stand in the tradition of Aristotelian topoi [33]5. Theyare a combination of inference and material relations [17], and fa-cilitate the detection of arguments for both human annotators andAI algorithms [16]. This sets them apart from other approaches,for instance the very general Toulmin model [28] or the likewisegeneral, but (over-)complex, Argumentum Model of Topics as co-herently presented by Rigotti and Greco [23]. Schemes do not onlyfacilitate detection, they also help to avoid biases. Furthermore,too much would be lost in a simple pro/contra-dichotomisation ofarguments [19]. Instead, argument schemes allow for greater differ-entiation, e.g. whether an author reasons about the consequencesof an action or whether he supports his claim with an expert’sopinion. Even some of Toulmin’s examples show great resemblanceto schemes [28].

The inference built in argumentation schemes can be presump-tive and defeasible, i.e. in contrast to deductive and inductive infer-ences the conclusion does not follow necessarily or with statisticalprobability from the premise(s), but is better understood as an as-sumption [32].

Table 1: The scheme for argument from positive conse-quences.

ADU DescriptionPremise If A is brought about, good consequences

will plausibly occur.Conclusion Therefore, A should be brought about.

To provide an example for such a scheme, Table 1 shows the argu-ment from positive consequences as identified in the compendiumof Walton et al. [34]. As the corpus is available in German andEnglish language and we intend to address a wider readership, allthe following examples are in English language only. The argumentfrom positive consequences consists of two elements, a premiseassuming positive consequences of actionA and a conclusion thatAshould, therefore, be conducted [34]. Thus, this scheme has a courseof action as its conclusion, and is inherently presumptive since itspremise tries to forecast future consequences [32]. In “real” textshowever, the premise rarely takes this form and the conclusion isoften omitted. For example, the premise ““Introducing a universalbasic income would improve people’s living conditions”” alreadyimplies through the word ““improve”” that the consequences arejudged to be positive, thus the sentence will rarely be succeededby an additional value judgement reading ““and improving peo-ple’s living conditions is a positive thing”” to act as support for thepremise. Likewise, it is not necessary to spell the conclusion ““auniversal basic income should be brought about””, though maybeit is not quite as unusual. Schemes can grow more complex whenthey involve several steps of reasoning.

Yet, notwithstanding their usefulness, the vast number of 60different schemes (subschemes not included) identified by Waltonet al. [34] complicates annotator training (students often have is-sues in differentiating the schemes [16]) and is quite demanding4Note here that the authors use argumentation in the sense of micro-level arguments.Their argumentation schemes are hence patterns of single arguments.5Hannken-Illjes describes argument schemes as formal topoi [9].

Page 3: The ReCAP Corpus: A Corpus of Complex Argument Graphs on

The ReCAP Corpus

in the usage of computational resources [17]. Two solutions areavailable, either building a hierarchical classification system (as e.g.proposed by Walton and Macagno [33]) or choosing a subset ofschemes adapted to the research purpose (as done by Hansen andWalton [10]).

In order to store argumentation and respective schemes, the Ar-gument Interchange Format (AIF) [4] and its argument graphs canbe used. AIF represents an abstract model, which aims to representand exchange arguments between various argumentation tools. Ar-gument graphs store ADUs and the inference link (scheme) connect-ing them for multiple arguments. As they also store interrelationsof arguments, they allow representation of the argumentative struc-ture of a text (possibly concluding in one major claim of the textas a whole). The AIF Ontology defines two types of nodes to buildargument graphs: information nodes (I-nodes) and scheme nodes(S-nodes). I-nodes relate to content and represent ADUs that dependon the domain of discourse. Scheme nodes are divided further intosubclasses: Rule application nodes (RA-nodes) denote specific infer-ence relations, conflict application nodes (CA-nodes) denote specificinference relations. A rephrase of one proposition is marked with arephrase relation (MA-nodes). Please note that argument schemesare only defined for RA-nodes. As the format is extendable thereis potential for a theoretically unlimited number of scheme types.Nodes may have different attributes such as title, text, type (e.g.decision, action, goal, belief) and more. These attributes may varyupon the use case. A Node A supports a node B if and only if thereis an edge connecting A to B. This means that the edges have anassociated direction.

3 MODIFICATIONS TO OVAOur annotations were performed with a modified version of theOnline Visualisation of Argument tool (OVA) by Arg-Tech [11]6,which succeeds Araucaria [22] and provides a web-based facilityto annotate argumentation in text files in conformation with theAIF standard. Since argument schemes can be used independentlyfrom languages, OVA has already been used for annotations in avariety of languages, including Chinese, Hindi, and Ukrainian7. Forour purposes, we set up an adapted version8, while the use and theappearance remain largely identical.

Arg-Tech provides two versions of OVA, namely OVA and OVA+.OVA+ allows the user to specify locutions and transitions to thenodes according to Inference Anchoring Theory, and is designed inparticular for the analysis of dialogues [11]. Since we intended toanalyse large, monologic texts only, and OVA+ graphs grow utterlycomplex even with short texts, we decided for the use of OVA.

Figure 1 shows the graphical user interface of the modified tool.The content of a text file is taken as input on the left side. Theannotator creates an ADU (blue box) by selecting a text passageand clicking to a free space in the main frame on the right side. Inthe example, the argument constructed from the plain text consistsof a claim (at the top) and a premise (at the bottom). The text in thegenerated nodes can also be modified and filled with new content,e.g. to resolve pronouns. These so-called “reconstructions” ensure

6http://ova.arg-tech.org7http://corpora.aifdb.org/araucaria8http://ova.uni-trier.de

Figure 1: Example of an annotated document with the mod-ified OVA tool

that an ADU is self-contained and understandable without otherADUs. During annotation we tried to stay close to the text to createan ideal-typical argument. In the example the word “It” in the lowernode has been modified to “Full-time school”.

Edges are formed between multiple ADUs and specify the argu-ment schemes. Those will be displayed as separate nodes (greenbox) with arrows depicting the direction of the edge. There are sev-eral types of edges using different colouring, while inference nodes(RA, green color) are the ones used most in our annotations. Whilein modified OVA inference nodes offer a list of possible schemes,all other edge types (e.g. Conflict and Rephrase) only provide onegeneric scheme. If a scheme has been chosen, the connected nodescan be marked according to their role, i.e. as conclusion or oneof the several premises a scheme might contain. This is, however,not necessary, since otherwise it would not be possible to annotateenthymemes9. So-called descriptors – the “description”-column inTable 1 – help with the annotation to assign the roles, and havebeen adapted to further our annotators’ understanding.

We added the expert opinion descriptor to each scheme, sincewe experienced that generally an argument from expert opinion –claiming the truth of a judgement by referring to the authority of asource – also includes other schemes, e.g. argument from positiveconsequences. This is especially the case when the author quotesthe person at the end of a sentence, e.g. “according to Kim Doe”. Toaccount for this and to avoid complicated constructions with limitedgain, it is now for example possible to add an expert assessment toan argument from positive consequences or any other scheme. Theexpert then supports the inference step included in the scheme.

In the example in Figure 1 the scheme argument from positiveconsequences was chosen as inference between premise and claim.The Positive Consequences scheme consists of three descriptors.The premise has the descriptor descp ““If A is brought about, goodconsequences will plausibly occur””, the claim has the descriptor

9Enthymemes are arguments where one or more premises or the conclusion is leftimplicit [34].

Page 4: The ReCAP Corpus: A Corpus of Complex Argument Graphs on

Dumani et al.

descc ““Therefore, A should be brought about””. Optionally, a de-scriptor desce can be assigned to an ADU if the statement comesfrom an expert. This has the role “Expert E asserts that propositionA is true/false”. As illustrated in the example, the selection of thisscheme and the assignment of the descriptors descp and descc as““Full-time school means more than just childcare”” and “We shouldcontinue the expansion of full-time schools on voluntary basis aswe did before” respectively is appropriate.

The modified OVA version also features new attributes to nodes.Many texts provide arguments to convince the reader of a corethesis. This core thesis is important for further steps and can beflagged as “major claim” [26] in modified OVA. Additionally ar-gument nodes now have attributes to indicate the start and endposition in the origin text. This is important as annotators can alterthe displayed text during reconstruction. The position attributehelps to keep track of the original text and to measure the inter-annotator agreement.

4 BUILDING THE CORPUS4.1 Choice of SchemesWe decided not to build a classification system, but to determinean appropriate subset of schemes. Thus, we turned to the AIFdbdatabase and searched for the most-used argumentation schemes10.Table 2 provides an overview of the most used schemes in the AIFdbin total, as well as the number of graphs, and the different datasets a scheme occurs in. Altogether, we crawled 102 data sets witha total of 10,622 graphs from the AIFdb database. Taking Table 2into consideration, it is obvious that the graphs are relatively small.Despite the high number of uses of the default inference scheme(i.e. no scheme used at all), we found a large overlap betweenthe schemes of the most-used schemes in AIFdb and Hansen andWalton [10], who as we do also studied political texts, namely theOntario provincial election campaigns from 2011. Hence, we builta subset of 18 schemes and one residual scheme (default inference)based on the list of Hansen andWalton, which we later on expandedbased on our annotators’ feedback.

Table 3 illustrates the schemes used by Hansen and Walton incomparison with our scheme set. Please note that 95.3% of the totalof 256 arguments, which they collected, have been classified withthe scheme set as depicted in Table 3. The remaining 4.7 % couldnot be classified at all.

4.2 Annotator Training and ProcessOur annotations were conducted by two student assistants comingfrom political science, media studies, as well as law studies. Hence,they are educated and trained to work with argumentative textswritten for public discourse.

The annotators completed a tutorial of about four hours in whichwe introduced them to the OVA-Tool, a pre-built set of scheme ex-amples from our own annotations, and our extensive annotationguidelines11 which are based on Stab and Gurevych [26] but alsoinclude argumentation schemes. Furthermore, theoretical funda-mentals were built for a deeper understanding of the project. The

10http://corpora.aifdb.org/11The link to the annotation guidelines can be found here: https://basilika.uni-trier.de/nextcloud/s/BHg3vyQwYxDIrpA

Table 2: Analysis of the most used schemes on AIFdb.Schemes (same or variants) we used also in our scheme setare emphasised. Dataset from August, 2019.

Scheme TotalOccurrences

Occurrencesin uniquegraphs

Occurrencesin data setsin AIFdb

Default Inference 22984 5175 92JP-Reason 713 148 5Example 316 201 29Argument To Moral Virtue 180 90 3ERPractical Reasoning 164 56 3Cause To Effect 97 40 7Expert Opinion 94 83 20Argument To Practical Wisdom 82 50 3Positive Consequences 63 43 6Evidence To Hypothesis 41 23 6Practical Reasoning 40 16 7Analogy 31 30 13Argument To Good Will 30 26 3Argument From Authority 26 22 10Reason 26 9 1Negative Consequences 21 15 6Popular Opinion 20 12 11

Table 3: Comparison of scheme sets. ✓ marks the use of ascheme set. (✓) denotes that it is unclear which version ofthe scheme is utilised. Walton et al. [34] is incomplete andserves only as illustration.

Waltonet al. [34]

Hansen &Walton [10]

ReCAP

Position to Know ✓ ✓ ✓Expert Opinion ✓ ✓ (Author-

ity)✓

Popular Opinion ✓ ✓Example ✓ ✓Analogy ✓ ✓ ✓Alternatives (cognitive schemes) ✓ (✓) ✓Alternatives (normative schemes) ✓ (✓)Values ✓ ✓Practical Reasoning ✓ ✓Cause to Effect ✓ ✓ ✓Correlation to Cause ✓ ✓Sign ✓ ✓ ✓Positive Consequences ✓ ✓ ✓Negative Consequences ✓ ✓ ✓Generic Ad Hominem ✓ ✓Inconsistent Commitment ✓ ✓ ✓Circumstantical Ad Hominem ✓ ✓ ✓Rule ✓ ✓Fairness ✓ ✓Unfairness ✓ ✓Misplaced Priorities ✓ ✓Appeal to Sympathy ✓Explanation ✓Residual Category ✓ (Can’t

classify)✓ (De-faultInference)

first weeks of the annotating-process were supervised and guidedmore closely – as problems tend to occur during the actual pro-cess of annotating –, and we stayed in close contact in order topermanently receive and give feedback. For instance, this exchangeproduced rules to solve situations where more than one possible

Page 5: The ReCAP Corpus: A Corpus of Complex Argument Graphs on

The ReCAP Corpus

major claim could be identified and a hierarchy of schemes to applywhen different schemes were conceivable.

After the training, annotation work was bipartite: In a first stepthe annotators worked on each text individually based on the expe-rience they gained and the annotation guidelines. Secondly, bothindividual versions of the same graph are merged into a final one.This merging process was not an automated one but performedmanually and discoursively by the annotators. It consisted of a three-step-system that was applied in every merging process: Identifyingthe major claim and their concord, the overall structure of the indi-vidual graphs and finally the discursive merging based on one ofthe annotations. That guaranteed an ongoing exchange, served asa reliability check, and provided an opportunity to discuss minorproblems, e.g. in the understanding of a text. Working in directand frequent exchange with the annotators, unsolvable problemsemerging at this stage were discussed in bi-weekly ReCAP plenumsto decide for the best solution.

While only the merged versions are to be considered the reliablegold standards, this approach nonetheless produced three high-quality annotations for each document (in German language), whichcan be used for further tests and other purposes as there exist fewGerman corpora.

4.3 Inter-Annotator AgreementFor the inter-annotator agreement, we evaluated two aspects: first,the segmentation in ADUs, and second, the agreement of decisionson which claim was selected as the major claim. We did not eval-uate the choice of schemes because the texts were independentlyannotated and resulted in different graphs. For an inter-annotatoragreement for schemes, graphs would need to be identical before-hand (apart from the chosen scheme).

We have measured the inter-annotator agreement for makingthe ADUs with Cohen’s κ [5] which also takes coincidental agree-ments into account. Its formula is κ = Pr (a)−Pr (e)

1−Pr (e) where Pr (a) isthe relatively observable agreement and Pr (e) is the hypotheticalprobability of a random agreement [3]. An agreement of κ = 1 isperfect, an agreement of κ = 0 equals coincidence. We built a set ofpossible ADUs that the annotators could have annotated as follows:Since ADUs are generally separated by punctuation marks, we splitall 100 texts into a total of 18,920 clauses by using separators “.”,“,”, “?”, “!”, “:”, “;”. As some ADUs also cross sentence boundaries,we also included the ADUs that were actually set by the two an-notators. For each of the obtained ADUs to be checked we thenexamined whether the annotators tagged their start and/or end po-sitions. For beginning positions we reached κ = 0.5595, for endingκ = 0.5989, and for both together, i.e. equal ADUs, we obtainedκ = 0.4982. We can observe that the agreement for ending positionsof ADUs is higher than for starting positions. Both of these valuesare higher than the value for complete ADUs. According to Landisand Koch [12], κ ∈ [0.41, 0.60] implies a fair agreement. Hence, itis very important that the annotators first annotated the graphsindependently of each other and then came to an agreement onhow the gold standard should look like.

Figure 2 shows the proportions of identical and diverging majorclaims found in the individual phase of the annotation process overtime. Besides detecting identical and divergent major claims, we

10/18−12/18 01/19−03/19 04/19−06/19 07/19−09/19

0.0

0.5

1.0

1.5

2.0IdenticalIntersectingDiverging

Figure 2: Proportions of identical (light grey), intersect-ing (medium grey), and diverging (dark grey) major claimsfound in the merging process, grouped by quarters.

also noticed a hybrid state of intersecting major claims. Whilstboth annotators identified the major claim as two different I-nodes,their content and declaration were similar in some way. We used todescribe them in the term of “semantically identical”. Those inter-secting major claims could differ regarding their topic, coherenceand how they were reconstructed by the annotator, yet it was clearin the merging process that they meant the same.

4.4 Reflexions on Annotation ProcessTwo aspects are noteworthy: First, each of the annotators developedan individual style of annotating which caused huge differencesin the first quarter annotations. This is also reflected in the highproportion of diverging major claims. This margin got reduced overtime as both annotators drew advantage from the discursive merg-ing process and individual errors continuously started to disappear.However, while the annotations as a whole became more similar, anoteworthy divergence of identified major claims remained. Thissecond trend demonstrates that identifying the main intention of atext is far more interpretative than its component arguments. It isnot uncommon for the annotators to identify the same premises andconclusions and find different major claims nonetheless. We havefound that a major claim has an average length of 135 charactersand occurs after an average of 32 % of the characters in the text.We investigated this appearance more closely and found that thelargest fraction of major claims (58 out of 100) occurred in the first25 % of the text. In 13 cases they occur between 25 and 50 % ofthe characters, in 11 cases between 50 and 75 % and in 16 casesin the last 25 % of the text. Two cases could not be identified. Forthe distribution across text types, see Figure 4. In order to resolvedivergences we developed two merging rules with regard to majorclaims: Normativity trumps description, more content trumps lesscontent (as long as it still constitutes a conclusion).

5 THE CORPUSOur final corpus consists of 100 argument graphs and is availableon request from the authors. Used text sources are German stateparliamentary motions, press releases, position papers, and partymanifestos which deal with education issues in Bavaria (BY), Ham-burg (HH), and Rhineland Palatinate (RLP). Topics covered are

Page 6: The ReCAP Corpus: A Corpus of Complex Argument Graphs on

Dumani et al.

reforms of school systems in general, the integrative comprehen-sive school (RLP), the Stadtteilschule (a Hamburgian alternativeto the Gymnasium), and digitisation as a general challenge (par-ticularly BY). Figure 3 illustrates a relatively large argumentationgraph in our corpus (176 nodes, 154 edges) and serves to illustratethe complexity.

Once we completed the annotations, we translated the texts inthe nodes with DeepL and received an English corpus. In order toensure the translated corpus’ quality, we randomly checked thetranslations with DeepL and found that the translations were ofvery high quality.

Documents were either available as web page or as portabledocument file, so we decided to save web pages as PDFs as well.Due to the text positions which were saved for the mapping of thenodes, the PDFs had to be transformed to text files in a standard-ised manner, which also removed hyphenations, kept paragraphstogether, and the more. For this purpose, we wrote a text extractorfor the PDF files, whose outcomes needed slight manual editingonly. In total, the 100 documents contain 2,479 premises and conclu-sions (argumentative discourse units (ADUs) in terms of Peldszusand Stede [20]). The total number of edges in our corpus is 2281,whereof the large majority (91.1 %) are represented as inferencenodes. The remaining part is split between conflict nodes (4.3 %)and rephrase nodes (4.6 %). Please note that these two types are notenriched with schemes and descriptors. The average numbers ofnodes and edges are 25.33 and 20.78, median numbers of nodes andedges are 17 and 15, respectively, while they include 1.112 sentencesand 6.789 words on average, 1 and 6 on median, respectively.

The selection of the documents followed three main considera-tions: (1) a rough thematic coherence per state without neglectionof topical diversity, (2) an experience-based intuition regardingthe argumentative quality of the text, and (3) public availability.Education as a topic has been chosen since it is one of the fewdomains where the German federal states enjoy an exclusive leg-islative competence in the Federal Republic of Germany (article 70(1) in conjunction with articles 73 and 74 of the German federalconstitution). We can thus profit from different education systems,since the diverging education discourses, which nonetheless sharea language and general concepts (like Abitur (comparable to the A-levels), Grundschule (comparable to primary school), and the more),enable us to test for the transferability of arguments by case-basedreasoning (CBR) as described by Bergmann et al. [2].

Table 4 shows the most used schemes on inference nodes inour corpus. In comparison to AIFdb (Table 2) our corpus shows adifferent proportion of scheme usage. Default inference has onlybeen used 5 times in our corpus, while it is the most commoninference scheme in the AIFdb collection. Our corpus specifiesthe inferences between claim and premises more precisely whichresults in higher quality graphs. However, schemes that we alsouse in our annotations are highlighted in Table 2. One commonpattern which appeared in the argument structures is that the majorclaim forms a central premise from which multiple other claims arederived using the practical reasoning argumentation scheme.

Table 4: Analysis of used schemes in ReCAP Corpus.

Scheme TotalOccurrences

Occurrencesin uniquegraphs

Positive Consequences 471 76Practical Reasoning 353 61Negative Consequences 297 62Sign 280 78Cause to Effect 195 50Example 173 49Unfairness 48 23Misplaced Priorities 47 31Rule 41 22Fairness 31 20Alternatives (Cognitive Schemes) 29 16Position to Know 28 22Popular Opinion 28 23Circumstantial Ad Hominem 13 12Expert Opinion 13 8Danger Appeal 12 10Analogy 6 6Inconsistent Commitment 6 5Default Inference 7 6Default Rephrase 106 61Default Conflict 97 36

6 POSSIBLE APPLICATIONSThere is a plethora of literature on argument mining, and the “post-processing” of mined arguments, some of them have already beendiscussed in the introduction.

In the combined approach of Lawrence and Reed [14], they useargument scheme structure as one of three layers. The other layersare identification using discourse indicators and topical similarity.In the argument scheme layer, they use argument scheme structureinformation obtained with a Naïve Bayes classifier of the scikit-learn12 Python package, trained on the AIFdb data and fed with amanually crafted list of keywords for each descriptor in the scheme,enlarged by similar words from WordNet. Yet, they do this only forthe two schemes argument from expert opinion and argument frompositive consequences. Together with the other layers, they achieveimpressive results of 0.91 precision, and 0.77 recall (F1, 0.83), yetthey test only on 36 preconnected propositions with two schemes.If applied to our corpus, the test could be repeated on a largerscale, with more schemes, and would thus achieve greater validity.Furthermore, one could compare the performance of discourseindicators and schemes in English and German language.

Lenz et al. [15] use the Potsdam Microtexts Corpus [21] to eval-uate graph similarity measurements, but they suffer all the disad-vantages of the corpus as mentioned in Section 1. With the newReCAP corpus, they would profit from argument graphs of greatercomplexity which are based on non-artificial texts.

12https://scikit-learn.org/

Page 7: The ReCAP Corpus: A Corpus of Complex Argument Graphs on

The ReCAP Corpus

Figure 3: Example of a large argumentation graph

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e S

tart

ing

Pos

ition

of M

ajor

Cla

im in

Cha

ract

ers

Motions Commentaries Position Papers Press Releases

Figure 4: Distribution of Relative Major Claim PositionsAcross Text Types. Total numbers: 17 motions, 9 positionpapers, 66 press releases, 5 newspaper commentaries, and1 manifesto (not shown).

ArgumenText is a system presented by Stab et al. [24] “fortopic-relevant argument search in heterogeneous texts”13. It followsa similar idea as the ReCAP argumentation machine in that itmines for arguments in large amounts of web-scraped texts, thoughthey do not make use of argument schemes. The system is able tooutput ranked arguments for a given topic, i.e. user-given query.Results for German language are poor with virtually no directlyrelevant argument and a lot of absolutely unrelated items in theoutput. Performance in English is significantly better, yet the shareof unrelated items is still large14. In their own evaluation, theyreport a high recall of 0.89 compared to expert annotators but a lowprecision of only 0.47, which explains the high output of non-relateditems. Results of Lawrence and Reed [14] indicate that precisioncould be optimised with the help of argument schemes15.

13Available on https://www.argumentsearch.com.14These impressions are based on a number of queries entered into ArgumenText.15Discourse indicators help as well, but they occur only infrequent and, hence, are notsufficient. Lawrence and Reed [14] report 0.04 recall for discourse indicators alone.

We also provide a python framework16 to ensure that the graphscan be loaded, used, or processed. It is also at PyPI17

7 CONCLUDING REMARKSWe presented a new gold standard corpus for the training andtesting of argument mining approaches which is made publiclyavailable. Its feature set is unique: It is built from natural languagetexts, which are drawn from three similar yet different discourses,annotated with a carefully crafted list of argument schemes, and,finally, it is one of the few corpora in German language. There arealready three findings worth reporting here:

First, political science often works with party manifestos, e.g. inspatial analysis of party positions. Party manifestos are aggregatedpolitical beliefs of party members weighed by their importanceand hence in general an important object of study. An argumentanalysis yet reveals that – though voluminous – manifestos aremore like demand catalogues. While certainly explainable it goesagainst a political science intuition of manifestos as high qualitysources18.

Second, annotator training is key for reliable results. Ambiguitiescan produce great differences in human output, as our annotatorsknow to report. This issue is widely neglected in argument miningliterature, but should experience greater attention to achieve com-parability between studies, and reliability of checks against goldstandards.

Third, a new annotator training based on a classification of ar-gumentation schemes as often discussed [16, 17, 23, 33], could fa-cilitate scheme recognition by annotators and hence allow to builda more comprehensive subset of schemes. In particular, it wouldbe useful to include the “Explanation” category from Hansen andWalton [10] in future research, which would both help with thedifferentiation between arguments and explanations, and providecontextual information in the final argument graph.

In future work we will find and build arbitrary queries to developand enhance tools for information retrieval and case-based reason-ing which use our corpus. In previous work, we investigated intextual similarity methods to find similar claims for query claims [7]as well as similarity methods for graph retrieval [1, 15].

ACKNOWLEDGMENTSWe would like to thank Lasse Cronqvist and Simon Jakobs for theirinvaluable help in finding a subject for the corpus that is perfectlysuited to our task and, amongst other things, allows the synthesis

16https://github.com/ReCAP-UTR/Argument-Graph17https://pypi.org/project/recap-argument-graph/18See e.g. the extensive work of the Manifesto Project, https://manifestoproject.wzb.eu.This impression is not only build on the manifesto in the corpus but as well on themany more considered for analysis.

Page 8: The ReCAP Corpus: A Corpus of Complex Argument Graphs on

Dumani et al.

of arguments. We would also like to thank them for finding verycompetent student assistants, who have created this corpus.

We would like to thank DeepL for providing their API, whichwe have used to translate the texts in the nodes into the Englishlanguage.

This work has been funded by the Deutsche Forschungsgemein-schaft (DFG) within the project ReCAP, Grant Number 375342983 -2018-2020, as part of the Priority Program “Robust ArgumentationMachines (RATIO)” (SPP-1999).

REFERENCES[1] Ralph Bergmann, Mirko Lenz, Stefan Ollinger, and Maximilian Pfister. 2019. Sim-

ilarity Measures for Case-Based Retrieval of Natural Language Argument Graphsin Argumentation Machines. In Proceedings of the Thirty-Second InternationalFlorida Artificial Intelligence Research Society Conference, Sarasota, Florida,USA, May 19-22 2019., Roman Barták and Keith W. Brawner (Eds.). AAAI Press,329–334. https://aaai.org/ocs/index.php/FLAIRS/FLAIRS19/paper/view/18189

[2] Ralph Bergmann, Ralf Schenkel, Lorik Dumani, and Stefan Ollinger. 2018. ReCAP -Information Retrieval and Case-Based Reasoning for Robust Deliberation and Syn-thesis of Arguments in the Political Discourse. In Proceedings of the Conference"Lernen, Wissen, Daten, Analysen", LWDA 2018, Mannheim, Germany, August22-24, 2018. 49–60. http://ceur-ws.org/Vol-2191/paper6.pdf

[3] Elena Cabrio and Serena Villata. 2018. Five Years of Argument Mining: aData-driven Analysis. In Proceedings of the Twenty-Seventh International JointConference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm,Sweden. 5427–5433. https://doi.org/10.24963/ijcai.2018/766

[4] Carlos Iván Chesñevar, Jarred McGinnis, Sanjay Modgil, Iyad Rahwan, ChrisReed, Guillermo Ricardo Simari, Matthew South, Gerard Vreeswijk, and StevenWillmott. 2006. Towards an argument interchange format. Knowledge Eng.Review 21, 4 (2006), 293–316. https://doi.org/10.1017/S0269888906001044

[5] J. Cohen. 1960. A Coefficient of Agreement for Nominal Scales. Educational andPsychological Measurement 20, 1 (1960), 37.

[6] Devin Coldewey and Frederic Lardinois. 2017. DeepL schools other onlinetranslators with clever machine learning. https://techcrunch.com/2017/08/29/deepl-schools-other-online-translators-with-clever-machine-learning/. Lastaccess: 12th March, 2020.

[7] Lorik Dumani and Ralf Schenkel. 2019. A Systematic Comparison of Methodsfor Finding Good Premises for Claims. In Proceedings of the 42nd InternationalACM SIGIR Conference on Research and Development in Information Retrieval,SIGIR 2019, Paris, France, July 21-25, 2019., Benjamin Piwowarski, Max Chevalier,Éric Gaussier, Yoelle Maarek, Jian-Yun Nie, and Falk Scholer (Eds.). ACM, 957–960.https://doi.org/10.1145/3331184.3331282

[8] Stephan Finsterbusch. 2017. Besser als Google.https://www.faz.net/aktuell/wirtschaft/unternehmen/start-up-deepl-die-zukunft-der-uebersetzung-von-sprachen-15174164.html.Last access: 12th March, 2020.

[9] Kati Hannken-Illjes. 2018. Argumentation. Narr Francke Attempto.[10] Hans V. Hansen and Douglas Walton. 2013. Argument kinds and argument roles

in the Ontario provincial election, 2011. Journal of Argumentation in Context 2,2 (2013), 226–258. https://doi.org/10.1075/jaic.2.2.03han

[11] Mathilde Janier, John Lawrence, and Chris Reed. 2014. OVA+: an Argument Anal-ysis Interface. In Computational Models of Argument - Proceedings of COMMA2014, Atholl Palace Hotel, Scottish Highlands, UK, September 9-12, 2014. 463–464. https://doi.org/10.3233/978-1-61499-436-7-463

[12] J. Richard Landis and Gary G. Koch. 1977. An Application of Hierarchical Kappa-type Statistics in the Assessment of Majority Agreement among Multiple Ob-servers. Biometrics 33, 2 (1977), 363–374. http://www.jstor.org/stable/2529786

[13] John Lawrence. 2017. About AIFdb. http://www.arg-tech.org/~john/johnlawrence.net/project.php?p=aifdb. Last access: November 08, 2019.

[14] John Lawrence and Chris Reed. 2015. Combining Argument Mining Tech-niques. In Proceedings of the 2nd Workshop on Argumentation Mining,ArgMining@HLT-NAACL 2015, June 4, 2015, Denver, Colorado, USA. 127–136.http://aclweb.org/anthology/W/W15/W15-0516.pdf

[15] Mirko Lenz, Stefan Ollinger, Premtim Sahitaj, and Ralph Bergmann. 2019.Semantic Textual Similarity Measures for Case-Based Retrieval of Argu-ment Graphs. In Case-Based Reasoning Research and Development - 27thInternational Conference, ICCBR 2019, Otzenhausen, Germany, September8-12, 2019, Proceedings (Lecture Notes in Computer Science), Kerstin Bach andCindy Marling (Eds.), Vol. 11680. Springer, 219–234. https://doi.org/10.1007/978-3-030-29249-2_15

[16] Fabrizio Macagno and Douglas Walton. 2015. Classifying the Patterns of NaturalArguments. Philosophy & Rhetoric 48, 1 (2015), 26–53. http://www.jstor.org/stable/10.5325/philrhet.48.1.0026

[17] Fabrizio Macagno, Douglas Walton, and Chris Reed. 2017. ArgumentationSchemes. History, Classifications, and Computational Applications. FLAP 4,8 (2017). http://www.collegepublications.co.uk/downloads/ifcolog00017.pdf

[18] Marie-Francine Moens. 2018. Argumentation mining: How can a machine acquirecommon sense and world knowledge? Argument & Computation 9, 1 (2018),1–14. https://doi.org/10.3233/AAC-170025

[19] Thomas Niehr and Karin Böke. 2010. Diskursanalyse unter linguistischer Per-spektive. In Forschungspraxis (4th ed.), Reiner Keller, Andreas Hirseland, WernerSchneider, and Willy Viehöver (Eds.). Interdisziplinäre Diskursforschung, Vol. 2.VS Verlag für Sozialwissenschaften, 359–385.

[20] Andreas Peldszus and Manfred Stede. 2013. From Argument Diagrams toArgumentation Mining in Texts: A Survey. IJCINI 7, 1 (2013), 1–31. https://doi.org/10.4018/jcini.2013010101

[21] Andreas Peldszus and Manfred Stede. 2016. Rhetorical structure and argumen-tation structure in monologue text. In Proceedings of the Third Workshop onArgument Mining, hosted by the 54th Annual Meeting of the Association forComputational Linguistics, ArgMining@ACL 2016, August 12, Berlin, Germany.https://www.aclweb.org/anthology/W16-2812/

[22] Chris Reed and Glenn Rowe. 2004. Araucaria: Software for Argument Analysis,Diagramming and Representation. International Journal onArtificial IntelligenceTools 13, 4 (2004), 983. https://doi.org/10.1142/S0218213004001922

[23] Eddo Rigotti and Sara Greco. 2019. Inference in Argumentation. Num-ber 34 in Argumentation Library. Springer Nature. https://doi.org/10.1007/978-3-030-04568-5

[24] Christian Stab, Johannes Daxenberger, Chris Stahlhut, Tristan Miller, BenjaminSchiller, Christopher Tauchmann, Steffen Eger, and Iryna Gurevych. 2018. Argu-menText: Searching for Arguments in Heterogeneous Sources. In Proceedingsof the Conference of the North American Chapter of the Association forComputational Linguistics: System Demonstrations 2019. Association for Com-putational Linguistics, 21–25. https://www.aclweb.org/anthology/N18-5005

[25] Christian Stab and Iryna Gurevych. 2014. Identifying Argumentative Dis-course Structures in Persuasive Essays. In Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Processing, EMNLP 2014, October25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of theACL. 46–56. https://www.aclweb.org/anthology/D14-1006/

[26] Christian Stab and Iryna Gurevych. 2016. Guidelines for Annotating Argumenta-tion Structures in Persuasive Essays. https://www.informatik.tu-darmstadt.de/ukp/research_6/data/argumentation_mining_1/argument_annotated_essays_version_2/index.en.jsp. Last access: November 08, 2019.

[27] Manfred Stede, Stergos D. Afantenos, Andreas Peldszus, Nicholas Asher, andJérémy Perret. 2016. Parallel Discourse Annotations on a Corpus of Short Texts.In Proceedings of the Tenth International Conference on Language Resourcesand Evaluation LREC 2016, Portorož, Slovenia, May 23-28, 2016. http://www.lrec-conf.org/proceedings/lrec2016/summaries/477.html

[28] Stephen Edelston Toulmin. 2003. The Uses of Argument (updated ed.). Cam-bridge University Press.

[29] Frans H. van Eemeren. 2016. Identifying Argumentative Patterns. Argumentation30, 1 (01 Mar 2016), 1–23. https://doi.org/10.1007/s10503-015-9377-z

[30] Frans H. van Eemeren, Bart Garssen, Erik C. W. Krabbe, A. Francisca SnoeckHenkemans, Bart Verheij, and Jean H. M. Wagemans (Eds.). 2014. Handbook ofArgumentation Theory. Springer. https://doi.org/10.1007/978-90-481-9473-5

[31] Marilyn A. Walker, Jean E. Fox Tree, Pranav Anand, Rob Abbott, and JosephKing. 2012. A Corpus for Research on Deliberation and Debate. In Proceedingsof the Eighth International Conference on Language Resources and Evaluation,LREC 2012, Istanbul, Turkey, May 23-25, 2012. 812–817. http://www.lrec-conf.org/proceedings/lrec2012/summaries/1078.html

[32] Douglas Walton. 2006. Fundamentals of Critical Argumentation. CambridgeUniversity Press.

[33] Douglas Walton and Fabrizio Macagno. 2015. A classification system for ar-gumentation schemes. Argument & Computation 6, 3 (2015), 219–245. https://doi.org/10.1080/19462166.2015.1123772

[34] Douglas Walton, Chris Reed, and Fabrizio Macagno. 2008. ArgumentationSchemes. Cambridge University Press. http://www.cambridge.org/us/academic/subjects/philosophy/logic/argumentation-schemes