30
DEREKO (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther König-Baumer, Wolfgang Lezius (IMS) Frank H. Müller, Tylman Ule (SfS) Project leaders: Prof. Dr. Erhard Hinrichs (SfS), Prof. Dr. Christian Rohrer (IMS) Duration: May 1, 1999 – January 31, 2002 IMS: Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart SfS: Seminar für Sprachwissenschaft, Universität Tübingen February 25, 2002

(DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

Embed Size (px)

Citation preview

Page 1: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

DEREKO

(DEutschesREferenzKOrpus)GermanReferenceCorpus

FinalReport(Part I)

StefanieDipper, HannahKermes,Dr. EstherKönig-Baumer, WolfgangLezius(IMS)

FrankH. Müller, TylmanUle (SfS)

Projectleaders:Prof.Dr. ErhardHinrichs(SfS),Prof.Dr. ChristianRohrer(IMS)Duration:May 1, 1999– January31,2002

IMS: Institut für MaschinelleSprachverarbeitung,UniversitätStuttgartSfS:Seminarfür Sprachwissenschaft,UniversitätTübingen

February25,2002

Page 2: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

Contents

1 Intr oduction 2

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Cooperations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Corpus Acquisition and DocumentMarkup 5

3 Linguistic Annotation 6

3.1 MorphosyntacticAnnotation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.2 SyntacticAnnotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2.1 Chunks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2.2 TopologicalFields . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2.3 Clauses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3 ImplementingRobustLarge-ScaleLinguisticAnnotation . . . . . . . . . . . . 12

4 Corpus Exploitation 17

4.1 A querylanguagefor structurallyannotatedcorpora. . . . . . . . . . . . . . . 17

4.2 ThetreebanksearchengineTIGERSearch. . . . . . . . . . . . . . . . . . . . 18

4.2.1 Searchengine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2.2 Graphviewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3 Querycollections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1

Page 3: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

Chapter 1

Intr oduction

1.1 Overview

The DEREKO project(DeutschesReferenzkorpus)hasbeena joint projectof the Institut fürdeutscheSprache(IDS) in Mannheim,the Institut für MaschinelleSprachverarbeitung(IMS)in Stuttgart,andtheSeminarfür Sprachwissenschaft(SfS) in Tübingen.Theprojecthasbeenfundedfrom May 1999throughJanuary2002(IDS: March2002)by theMinistry of Science,ResearchandtheArts of theStateof Baden-Württemberg.

Theprojectwassetup in orderto improve the infrastructurefor text-basedlinguistic researchanddevelopmentby building a huge,automaticallyannotatedGermantext corpusandthecor-respondingtoolsfor corpusannotationandexploitation.This raisedthefollowing issues:

1. corpusacquisition

2. corpuspreparation

3. corpusexploitation

The taskof corpusacquisitionconsistedof marketing activities andcontractnegotiationsinorderto convince publishinghousesandindividualsto grantresearchlicensesfor their texts.(Responsibility:IDS)

Corpuspreparation involvedseveralstepsof ’ text enrichment’.Themeta-information(author,dateof publication,etc.)of a text hasto beencodedin normalisedmarkup.Thetext hasto besegmented(i.e. the surfacestructureof the text hasto be detectedandmarked up, includingparagraphs,sentences,andword forms).Furthermore,in orderto make thetextsmorevaluablefor researchersinterestedin a wide rangeof linguistic phenomena,a partial syntacticanalysishasbeencarriedout, in additionto POStaggingandlemmatisation,andall additionalinfor-mationwasaddedvia a customisedmarkupscheme.(Responsibilities:IDS for themarkupofmeta-informationandsentenceandparagraphsegmentation;SfSfor linguisticannotation,withsomeinput on thelexical level from theIMS)

In orderto make useof the linguistically annotatedtext corpus,powerful specialisedtools for

2

Page 4: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

corpusexploitation areneeded.The basictool developedfor this purposeis a queryengine(’TIGERSearch’),which canaccessstructuraltext annotationin an efficient manner. On thisbasis,querycollectionswerebuilt which helpanswerthequestionswhich lexicographersandlinguistsmay have, for examplein which lexical andsyntacticcontexts the word “streichen”(paint;erase)occursandhow oftenit occursin thesecontexts. (Responsibility:IMS)

Thetoolsandresourcesdevelopedfor theDEREKO projectprovideanimportantrepositoryforthe Kompetenzzentrumfür Text- und Informationstechnologie,wherethey will be developedfurtherandmadeavailableto thelanguagetechnologycommunityandto academicresearchersin thefield of linguisticsandcomputationallinguistics.

Thisdocumentandadditionaldocumentationfor DEREKO canbefoundattheprojectwebsite:

http://www.sfs.uni-tuebingen. de/de reko /

1.2 Cooperations

Thework for theDEREKO projecthasbeencarriedout in closeinteractionwith thefollowingprojects:

� TIGER(fundedby DFG)

Partners:ComputationalLinguistics,UniversitätdesSaarlandes;IMS; Institut für Ger-manistik,UniversitätPotsdam

� Kompetenzzentrumfür Text- und Informationstechnologie(funded by MWK Baden-Württemberg)

Partners:IMS andSfS

� SFB441:LinguistischeDatenstrukturen(fundedby DFG)

Partner:SfS

� TMR – LearningComputationalGrammars(fundedby EC)

Partners:University of Groningen,The Netherlands;University of Antwerp, Belgium;SRI Cambridge,United Kingdom; University College Dublin, Ireland; University ofGeneva,Switzerland;SfS;XRCE Grenoble,France

The TIGER project sharedthe developmentof the specialisedquery engineTIGERSearch.The DEREKO andTIGER corporahave the ’STTS’ tagset[26] in commonfor the part-of-speechannotation.A certainamountof exchangetook placewith respectto issuesof syntacticannotation.

TheKompetenzzentrumtook therole of theprototypicaluser. Tasksfrom thelexicon develop-mentat theIMS itself (’IMSLex’) andfrom joint projectswith dictionarypublishersgave riseto theconstructionof specificquerycollections.

3

Page 5: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

Both SFB 441 and TMR-LCG developedannotationstandardsand tools jointly with theDEREKO projectandsharedtheresultingresources.

4

Page 6: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

Chapter 2

Corpus Acquisition and DocumentMarkup

Due to delayscausedby negotiationswith publishers,the full DEREKO corpuswill beavail-ableonly whentheprojectendsfor theprojectpartnerIDS MannheimonMarch31,2002.Theothertwo projectpartners,IMS StuttgartandSfSTübingen,haveagreedonusingthetaznews-papercorpus(September1986to May 1999),which bothpartnershave licensed,asaninterimsolution.Thetazcorpuscontainsmorethan200million words,whichis approximately20%thesizeof thefinal DEREKO corpus.Thetazcorpuscanthereforebeconsideredto beareasonableapproximationof thefinal corpusfor testingtheannotationandquerytools.

The IDS Mannheimhascurrentlyprepareda samplecorpusof abouttenmillion wordsin thefinal DEREKO markup.Thetazcorpushasalsobeencastinto this format,andall SfSandIMStools wereadaptedaccordingly, so that oncethe DEREKO corpushasbeenpreparedby theIDS, it canbeannotatedandqueriedsubsequentlyby thetoolspreparedby SfSandIMS.

5

Page 7: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

Chapter 3

Linguistic Annotation

The main goal of the DEREKO corpusis to provide a large generalpurposeresourcefor theGermanlanguage.A linguistusingsucharesourcewill expectdetailedyetreliableinformation.The stateof the art in syntacticannotation,however, shows that beyond the syntacticlevelof chunks,automaticsyntacticannotationhasto dealwith rapidly increasingambiguity, andconsequently, thequalityof automaticannotationdeclines.

For DEREKO, a finite-stateapproachto parsingwasadopted,solving both the problemsofspeedand accuracy outlined above. Finite-stategrammarscan be appliedefficiently, so thathugevolumesof text canbeprocessedquickly. Second,thephenomenathatcanbedescribedby finite-stategrammarscoincidewith thosesyntacticphenomenathat are only moderatelyambiguous.Annotationusingfinite-stategrammarsis very useful for linguistic researchandlanguagetechnologyapplications,as the overall syntacticambiguity is reduced,and furtherannotationcantake direct advantageof it, asis exemplifiede.g.by the querycollectionspre-sentedin section4.3.In thefollowing sectionsthelinguisticmarkupis presentedin moredetail(sect.3.1 and3.2), andthe robust andefficient DEREKO linguistic annotationsystemis pre-sented(sect.3.3).

3.1 Mor phosyntacticAnnotation

Partof speech(POS)annotationis a robuststandardtechniqueusedasabasisfor furtheranno-tationin theDEREKO annotationsystem.In orderto provide linguistswith morefine-grainedinformation,baseformsandmorphologicalinformation have also beenaddedfor eachwordform tokenin thecorpus.

Parts of Speech

The STTStag setdevelopedby IMS andSfS [26] definesthe setof POSlabelsusedin theDEREKO linguistic annotation.In orderto train POStaggersappropriately, a resourceof ap-prox.

�������������manuallyannotatedword form tokenshasbeensetup, consistingof texts from

newspapers(approx.� �������� tokens),novels(approx. ����������� tokens)andothertext types(ap-

6

Page 8: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

prox. � �������� tokens).ThePOStaggersaretrainedwith eithernews texts, texts from novels,orall texts, andtheir mostprobableanalysesarerecordedin thelinguistic markup.Currentlythetnt andBrill taggersareused[4, 5]. Theoutputof all taggersis combinedto achieveasingleso-lution by majorityvoting[30], andPOSsarerevisedusinginput from syntacticannotation[22].All informationaddedby thetagging,correction,andvotingstepsis recordedin themarkupsothatresearchersusingDEREKO mayaccessit easilyto inspector improveresults.

Mor phology and Baseform

Reducingfull forms to baseformsprovidesa usefulabstractionfor querypurposes,so that acorpususermaysearchfor, e.g.,all examplesof a verbwithout explicitly listing all theverb’sfull forms.Consequently, baseformsareaddedin theDEREKO linguistic annotationusingtheDMOR (DeutscheMorphologie,[25]) tool from theIMS. Informationfrom DMOR is alsousedto addall possiblemorphologicalanalysesfor each(word form, POS)tuple.A mappingfromtheSTTSPOStagsto theDMOR analyseshasbeendevisedkeepingonly informationwhichis most useful for further syntacticannotation,including case,number, gender, person,andinflectiontype[29].

POSinformationis availablefor eachword form in thecorpus.Morphologicalandbaseforminformation is avaliable for all word forms coveredby DMOR making the morphosyntacticannotationprovidedin DEREKO a reliablestartingpoint for furtherannotation.1

3.2 SyntacticAnnotation

TheDEREKO linguisticannotationfollowsthemodelof topologicalfieldsstandardlyassumedbysyntacticdescriptionsfor German[14,8,12, 10]. In additionto topologicalfields,clausesandchunksareannotatedin orderto prepareunrestrictedlanguagedatafor globalsentenceanalysis.Theannotationtaskhasbeensplit to dealfirst with thosestructuresthatcanbeprocessedwithfinite-statetechniques,and that take advantageof syntacticrestrictionsonly. Verb argumentstructure,e.g.,canbeaddedlaterusingmorepowerful formalisms(see[22]).

3.2.1 Chunks

Chunksaredefinedasnon-recursivecontinuouskernelsof phrases[2]. Thismeansthatchunksmaycontainchunksof othercategoriesbut that they maynot containchunksof thesamecat-egory. A typical exampleto illustrate this is the treatmentof PP-attachmentin chunks.TheDEREKO linguistic annotationdoesnot resolve this issuebecauseof thenon-recursive natureof chunks(seefigure 3.1: a farmer from Bayrischzell). As a consequence,somephrasesmaybesplit up,becausein German,prepositionalphrasesmayalsomodify adjectival phrases.The

1In thetazcorpus,thereare �� ������� ����� wordform typesand ������ ������ ����� wordform tokens,of which �� ������� �����arenot analysedby DMOR, correspondingto ������ ����� word form types.Of these,theeightmostfrequentpunctu-ationformssumup to �� ����� ����� word form tokens,and �����! ����� typesoccuronly once.

7

Page 9: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

prepositionalchunk(PC)durch jahrelangeFehlentscheidungenin theexamplein figure3.22 ispartof a complex adjectivephrasewith theadjectivehochverschuldetenasits head.Figure3.2showstwo separatechunks,however, becauseanadjectivechunkcannotcontainanotheradjec-tive chunk(like the onecontainedin the PC).As a nounchunk(NC) cannotcontainanotherNC for thesamereason,thePCin figure3.2 is not partof thefollowing NC, either. As a con-sequence,thearticleder cannotbein onechunkwith hochverschuldetenBahnasthestructureis discontinuousnow, becausethePCis a constituentseparatingthearticleandtherestof theNC. Thus,constituentslikearticlesor prepositionswhichcannotbeattachedto their respectivechunksareleft ‘stranded’.

Figure3.1:Postmodifyingprepositionalchunk

Figure3.2:Chunkstructureof a complex prepositionalphrase

Therearefivemajortypesof chunks.Verbchunks(VC )3, nounchunks(NC),adjectivechunks(AJ C),adverbchunks(AVC) andprepositionalchunks(PC).Prepositionalchunksalwayscon-tainanounchunk,andnounchunksmaycontainadverbchunksandadjectivechunks.Adjectivechunksmaycontainadverbchunks.Verbchunksarecontainedin nootherchunkandthey alsocannotcontainany otherchunk.

2at the by years-lastingmisjudgementshighly-indebtedrailways; at the railways,which washighly indebteddueto misjudgementslastingyears

3The ‘ ’ standsfor oneletter if it occurswithin a chunknameandfor oneor morelettersif it occursat thebeginningor endof a chunkname.

8

Page 10: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

As the annotationof verb chunksis quite complex, it is outlinedonly in generaltermshere(pleasecf. [21] for adetaileddescription).Verbchunksplayaspecialrolein thechunkstructure,asthey areboth chunksandpartof theclausalframe(and,thus,topologicalfields).This factandthefactthatverbchunks(andtheircombination)alreadycontaina lot of informationaboutthestructureandtypeof thesentencethey occurin hasleadto a very fine-graineddistinctionamongthedifferentverbchunks,asopposedto theotherkindsof chunks.Verbchunksarecate-gorisedonthebasisof theirsyntacticdistributionandtheir innerstructure.As regardssyntacticdistribution, therearethreemaintypesof verbchunks:VCL-, VCR- andVCF-. VC L -chunksarechunksconstitutingtheleft partof theclausalframeandVC R -chunksconstitutetherightpartof theclausalframe.VC F -chunksarechunkswhich in thebasicwordorderwouldbetheright partof theclausalframebut which arefrontedand,thus,topicalised.

Figure3.3:Verbchunkin relativeclause

The following lettersin the nameof verb chunkscorrespondto the secondand third lettersin the verb’s tag, which denoteverb type (V=lexical, A=auxiliary, M=modal) andfiniteness(F=finite,I=infiniti ve,P=perfectparticiple)4. Thesequenceof theverbsin thechunktypenamecorrespondsto the syntacticdependence,which is the oppositeof the sequenceof the occur-renceof the verbsin the sentence.Thus,in the sequenceDialoge, die nicht ohneWeitereszuverstehensind theverbalcomplex zuverstehensind is assignedthechunktypeVC R for rightpart of clausalframe, AF for auxiliary finite, VZ for lexical verb/infinitive with ’zu’ (seefigure3.3: dialogues,that not without further [explanation] to understandare; dialoguesyoucannotunderstandoffhand).

4An exceptionis beingmadein the treatmentof the infinitive with ’zu’, where’I’ is replacedwith ’Z’ in thechunktypename(seefigure3.3),andwith theimperative,wherea ’B’ is assigned.

9

Page 11: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

Table3.1:Overview of theChunkLabels

ChunkLabel Definition

AJAC attributiveadjectivechunkAJACTRUNC AJAC with truncatedadjectiveAJACC at leasttwo coordinatedAJACsAJVC predicativeadjectivechunk/adverbchunkAJVCTRUNC AJVC with truncatedadjective/adverbAJVCC at leasttwo coordinatedAJVCsAVC adverbchunkAVCC coordinatedAVCNC nounchunkNCTRUNC NC with truncatednounNCC two NCscoordinatedby coordinatorNCell elliptical nounchunk(i.e. withoutheadnoun)PC prepositionalchunkVC verbchunkVCTRUNC verbchunkwith truncatedverbVCL verbchunkasleft partof clausalframeVCR verbchunkasright partof clausalframeVCF verbchunkin topicalised(fronted)position

3.2.2 TopologicalFields

Topologicalfields describesectionsin the Germansentencewith respectto the distributionalpropertiesof theverb. In German,theverbcomplex is dividedinto two partsin theaffirmativesentencewith the finite verb comingfirst andthe restof the verb complex following later onto the end of the sentence.This constructionis called the Satzklammer(clausalframe), thefinite verb being the left part of the clausalframe (Linke Klammer, LK) and the non-finiteverbcomplex beingits right part(RechteKlammer, RK). Thus,in theaffirmativesentence,thesectionsprecedingthefinite verbmaybecalledtheVorfeld (front field, VF), thesectionin theclausalframe the Mittelfeld (middle field, MF) and the sectionfollowing the non-finiteverbcomplex theNachfeld(postfield, NF). If clausesarecoordinated,theremayalsobeaKOORD-Feld (coordinatorfield, KOORDF).Sentencesin which the verb is fronted (e.g. imperativesandyes/no-questions)arelike affirmative sentencesexceptthat they lack a VF. Subordinatedsentencesaredifferentin thatthefinite verbgoestogetherwith theverbcomplex (beingthelastelementin it). They alsolackaVF, andtheirLK is occupiedby thecomplementiserfield (CF).

Themodelof topologicalfieldsdescribesthedistribution of constituentsrelative to theclausalframe. It is thereforeprimarily a distributional model, not giving any accountof the verb-argumentstructureandnotrevealingtherelationbetweentheconstituentswithin thetopologicalfields,either. In fact, thevery structureof constituentswithin topologicalfields is left open.Itis, however, importantto point out thata lot of constituentorderphenomenacanbedescribedrelative to topologicalfields.

10

Page 12: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

Figure3.4:Complex Sentence

The advantagesof the annotationof topologicalfields as a basisfor the annotationof verbargumentstructurecanbestbeillustratedbyanexample:Figure3.4showsasentencecontainingfive verbsin threeverb complexes.5 As the argumentsmay appearon both sidesof the verb,it is by no meansclear, wherethe respective argumentsof theverbareto be found.After theannotationof the topologicalfields,however, the scopeis reduced:The argumentsof sagte –aslong asthey aresinglephrases– mayonly occurin theprecedingVF andthefollowing MF.Thesameis trueof thefollowing clause,wherethefield structureshowsthatkönne

� � �beendet

werden is oneverbcomplex andthenit is clearthatphrasalargumentsmayonly occurin theVF andthe MF, becausethe NF containsa clause.In that subclauseit is againclear that theargumentsof theverbmustbein theMF.

Figure3.4 illustrateshow thescopeof potentialargumentsis reducedby annotatingfield struc-ture.It shouldbetakeninto account,however, that theannotationof field andclausestructureis a shallow annotationin thesamesenseastheannotationof chunksis shallow. It is, thus,notalwaysclearto whichfield or clauseasubclauseshouldbeattached.In figure3.4 it is left openwhethertheNF containingtheadverbialclauseis theNF of theprecedingsubclauseor of themainclause(i.e. theNF of thewholesentence).Theannotationalsoleavesit openwhetherthesubclauseder Streit könneauf der Stellebeigelegt werdenis a subclauseat all. Themotivationfor thisapproachis thatwithout takinginto accountvalency information,it is notclearwhethera sequenceof VF, LK andMF is a separatemain clausein an asyndeticconstruction(like inDasParlamentdebattiert,der Präsidenthandelt), or a subclause(like in figure3.4).Still, theannotationof topologicalfields prestructuresa sentenceandservesasa solid basefor furtherannotation.

3.2.3 Clauses

Therearethreedifferentkindsof clausesannotatedby theDEREKO linguistic annotationsys-tem.Generalsubclauses(SUB), relative clauses(REL) andnon-finiteclauses(INF). A clauseis definedashaving oneheadverb(with theexceptionof coordinatedheadverbs).Main clausesarenotannotatedby thesystem,becausethiswouldbebeyondthescopeof shallow annotation.As a consequence,therelationbetweensubclausesis not madeexplicit (theimplicationsbeing

5Thepresidentsaidtheattendingjournalists,thequarrel couldon thespotfinishedbe, if themselvesbothsidesagree. – Thepresidentsaidto theattendingjournalists,that thequarrel couldbefinishedimmediately, if bothsidescometo an agreement.

11

Page 13: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

discussedabove).Generally, theattachmentof subclausesis left openin shallow clauseanno-tation just like the attachmentof a prepositionalchunkis left openin chunkannotation.Thismeansfor examplethatthereferencenounof a relativeclauseis not specified.

The category REL subsumesall relative clausesintroducedby relative pronouns.Clausesin-troducedby adverbial relative pronouns(i.e. PWAV) areannotatedasSUB, becausefrom theperspective of shallow annotation,it is not clearin which casesa PWAV is a relative pronounat all. Thecategory INF subsumesall non-finiteclausescontaininganinfinitive(not thosecon-taining a participle),both introducedby a complementiserandnon-introduced.The categorySUB subsumesall otherintroducedsubclauses,i.e. mainly adverbialclauses.Recursivenessishandledby applyingtheannotationcascadefor subclausestwice.As theannotationof clausesis shallow, this doesnot affect sequentialsubclauses(becausethey arenot attachedat all), butonly caseswhereembeddingoccursin the MF. Thus,in the DEREKO linguistic annotation,a subclausenever containsa subclausein its MF which againcontainsanothersubclauseinits MF. Thesecasesarerare,however, becausecentre-embeddingis limited dueto cognitivelimitations.6

3.3 Implementing Robust Lar ge-ScaleLinguistic Annotation

TheDEREKO linguistic annotationsystemhasto annotatehugevolumesof unrestrictedtext,integrating several types of linguistic information. Its implementationthereforefocusesonspeedandrobustness,takingadvantageof existing toolsandmarkupstandards.

Extending the CES for Mor e VerboseLinguistic Annotation

XML is widely usedas an encodingformat and hasproved to be very suitablefor encod-ing linguistic phenomenadueto its inherenttreestructure.TheDEREKO linguistic markupisthereforeencodedin XML on output,andalsofor processingpurposes.

TheDEREKO linguisticannotationschemeextendstheCorpusEncodingStandard(CES)[15]below thesentencelevel to cover, e.g.theannotationof chunksandtopologicalfields,or infor-mationfor POStagmajority voting. It thereforecomplementsthe IDS extensionsapplyingtosentencesandlargerunits.

TheCESis extendedusingin-line linguisticannotation.Theoriginalproposalof usingastand-off annotationschemewasnot followedmainly for thefollowing reasons:

� Theoriginal unannotatedcorpuscanbereproducedfrom theannotatedcorpus,becausee.g.all formattinginformationis preservedin themarkup.

� A separateversionof the unannotatedcorpusis mainly advisablewhenseveral orthog-onal annotationsareaddedlater. The annotationschemeusedin the DEREKO project,

6A counterexampleis a sentencelike: WennDein Freund,der, wenner denMundaufmacht, lügt, kommt,gehich. (Whenyour friend, who,whenever he themouthopens,lies, comes,I leave;I leavewhenyour friend comeswholieswheneverheopenshis mouth.)

12

Page 14: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

however, is consideredto bea startingpoint for furtherannotationitself.

� It is assumedthattheDEREKO linguisticannotationwill beusefulfor mostapplications.Processinglinks from the annotationto the original corpus(i.e. usingstand-off annota-tion) may outweighthe benefitof having a smallersource,when the links have to befollowedfor thevastmajority of applications.

� Many currentXML processingtools supportefficient processingof a streamof XMLdata,andstandardsfor linking XML documentsarenot yetaswidely supported.

TheDEREKO linguistic annotationsystemcoverssentencesandsmallertextualunits:

Tokens are the basic units of all subsequentannotation.They are detectedwith a finite-state grammar recognising also more complex combinationsof tokens appearing,e.g.,in dates,or in numbersfollowed by measurementunits (i.e. a basic namedentity recog-nition is performed). Phenomenarecognised in the tokenisation step include clitics(<t i=” f=’gib’/><t i=’CLITIC’ f="’s"/> <t i=” f=’mir’/> ) andurls (<t i=’URL’ f=’http://www.sfs.uni-tuebingen. de/’ /> ). A con-servative approachwas followed, preferring to group less charactersto form a tokenwhen a decision is uncertain,so that, e.g. “17. Juni ” is recognisedas <t i=’DAY’f=’17.’/> <t i=’MONTH’ f=’Juni’/> , but, e.g. “das 100. Mal ” is tokenisedas <t i=” f=’das’/> <t i=’NUM’ f=’100’/><t i=’PUNCT’ f=’.’/> <ti=” f=’Mal’/> , becausethefull stopmayeitherbelongto theprecedingtokenor not.

Each token is assignedone or more POS tags, and each (token, POS) tuple is assignedoneor more lemmasandmorphologicalambiguityclasses.Figure3.5 shows how the XMLtree structureis usedto encodethe possibly ambiguousanalyses.In figure 3.5, the wordform “der ” has two POS analyses(article or relative pronoun)and the article is the pre-ferredreading(<P t=’ART’ r=’1’ /><P t=’PRELS’ r=’2’/> ). EachPOSin turnmay have more than one baseform,that again is connectedto one or more morphologicalanalyses(“der ” as an article may either be nominative, and the correspondingbaseformis “der ”, or dative or genitive, and the correspondingbaseformis “die ”). All informationmay comefrom morethanonesource,andeachsourcemayassigna certaintyto its analysis(<j n=’TaggerA/News’ c=’0.3’ /> ). Includingverboseinformationfor POStagging,lemmaandmorphology, offersresearcherstheopportunityto checksecond-bestPOSanalyses,or disambiguatingmorphology, which in thepresentsystemis notyet accomplished.

All whitespacein the original documentis recordedbetweenthe token (<t/> ) elements,sothattheannotationis non-destructivewith respectto thesourcetext. Tokenisationandthecor-respondingmarkupis describedin moredetail in [29].

Chunks, Topological Fields and Clauses are all encodedas separateelementswrappingthe structurethey dominate(as <ch/> , <fd/> and <cl/> elements,respectively). Theyall have an attribute (c ) encodingthe specifickind of chunk,field, or clause,as in e.g.<chc=’NC’ ><t f=’Dialoge’/></ch> (omitting sub-token information). The categories

andannotationstrategiesaredescribedin moredetail in [29].

13

Page 15: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

<t f=’der’ i=’’><P t=’PRELS’ r=’2’ c=’0.2’>

<j n=’TaggerA/News’ c=’0.3’/> <j n=’TaggerB/News’ c=’0.1’/><b f=’der’> <j n=’MorphA’ c=’1’/> <m d=’nsm’/> </b><b f=’die’> <j n=’MorphA’ c=’1’/> <m d=’dsf’/> </b>

</P><P t=’ART’ r=’1’ c=’0.8’>

<j n=’TaggerA/News’ c=’0.7’/> <j n=’TaggerB/News’ c=’0.9’/><b f=’der’>

<j n=’MorphA’ c=’1’/> <m d=’nsm’/> </b><b f=’die’>

<j n=’MorphA’ c=’1’/> <m d=’dsf’/><m d=’gp0’/><m d=’gsf’/></b>

</P></t>

Figure3.5:EncodingAlternativePOS,LemmasandMorphologicalAnalyses

Theoutputproducedby theDEREKO linguisticannotationsystemcanbeimportedstraightfor-wardly into thequerytool describedin chapter4.

Ar chitecture of the DEREKO Linguistic Annotation System

Theannotationsystemis implementedasa cascadeof toolsall addingcertaininformationto acommonXML datastructure,which canbeconsideredto be theblackboard of theDEREKOlinguistic annotationsystem.Eachtool relieson informationthat hasbeenaddedin previoussteps,andtheinformationit addsis in turn usedby tools later in thecascade.Thepartsof thecascadearegroupedhierarchically, becausegroupsof tools canbe identifiedsolving a majortasklike segmentingtext, or addingmorphosyntacticandsyntacticinformationto thetext (seefigure3.6, left handside).Eachgroup,again,hassub-groupsthat solve subtasksof themajortask.Themajortaskof morphosyntacticannotation,e.g.,startswith thesubtaskof addingPOSinformation.In the POStaggingsubtask,several taggerstrainedwith differentdataaddtheirinformationoneafter theother. After all taggershave finished,thePOStag is determinedviamajority voting,finishingthePOSsubtask(seefigure3.6,right handside).

Separatinglinguistic annotationinto clear-cut tasks,sub-tasks,and tools, hasseveral advan-tages:

� HypothesisTesting

All relatedtools in a subtaskmay be easily regroupedandexecutedon the sameinputdata.For thenounchunkannotation,e.g.,severalhypothesescouldbequickly evaluated,andthebesttool andsubtasksequencewasfoundrapidly, resultingin atop-downbottom-upapproach.

� Extensibility

New processingstepsmaybeintroducedanywherewithout changingtherestof thesys-tem.Obviousplacesfor new stepsincludethe input end,wherefilters arealreadyavail-

14

Page 16: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

Segmentation

Paragraph

Token

Sentence

Morphosyntax

POS

Lemma

Morphology

Syntax

Topological Fields

Clauses

Chunks

Str

eam

of X

ML

Tex

t

POS

Tagger A / News

Tagger A / Novel

Tagger B / News

Tagger B / Novel

Majority Voting

Major Task: Morphosyntax

POS Subtask

Tool: Tagger

Figure3.6:Architectureof theLinguisticAnnotationSystem

ablethat process,e.g.plain text or HTML in additionto the native DEREKO format7,or theoutputend,wheremorerefinedannotationstepsmayfollow. However, thesystemmayalsobeextendedanywheremid-stream.In orderto improveawell-known POStag-gingproblem,e.g.,a tagcorrectionsystemhasbeenappliedimmediatelyaftertheclausalframeis annotatedwithoutchangingtherestof thecascade[22].

� Robustness

Subsequentannotationstepsmaytakeadvantageof previoussteps,e.g.chunksareusuallyannotatedonly within topologicalfields.However, if any annotationstepsfail, subsequentstepsmaystill annotatetheinput text ignoringthefailedsteps.In case,e.g.,averbframeandthecorrespondingtopologicalfieldscannotbedetectedcorrectlydueto POStaggingerrors,chunkswill still be annotated,using the syntacticallyunannotatedsequenceoftokensin asentence.Theresultingsentencecontainslessdetailedbut still usefulsyntacticannotation.

7Whenprocessingthe native DEREKO format preparedby the IDS, the paragraphandsentencesegmentingsubtasksaredropped.

15

Page 17: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

� Scalability

Eachmajor task,sub-task,or tool maybeexecutedon a differentcomputer, or on a dif-ferentprocessorfor multi processorsystems.Thetasksmaybeexecutedparallely, wherea singledatastreamconnectsall taskson all computers.Computersmayalsobechosenthatexcel in certaintasks(e.g.with eithera fastprocessoror quickdiskaccess).Only theslowesttool on thefastestprocessorlimits overall throughput.

� Debugging

An error may be spottedquickly by executingall tasks,subtasks,or tools individually,only stoppingwhenthe error occurs.All annotationis given in the commoninput andoutputformat,andtoolshavebeendevisedfor browsingit, thatcanbeusedto inspecttheoutputof eachsingletool.

� ReusingExistingToolsandLibraries

UsingXML astheonly connectionbetweenthetoolsletsyouincorporateagrowing num-berof general-purposeXML toolsor toolssupportingrapidprototyping(e.g.xmlperlor fsgmatch [11]). If presenttools lack sufficient speed,a specialisedtool canbe im-plementedquickly usinghigh performanceXML librarieswhich areavailableasopensourcesoftware(e.g.themajority voting tool hasbeenimplementedalongtheselines).

Thetoolsdevelopedfor theannotationof theDEREKO corpushavewideapplicabilitybeyondthetasksfor which they weredeveloped.Becauseof theirmodulardesignandimplementation,modulescanbereusedfor awidevarietyof languagetechnologyapplications,includingaspre-processorsfor deeperlinguistic analysis,termextraction,andotherinformation-retrieval tasks,as well as for the constructionof languagemodelsin speechrecognitionapplications.Theannotatedcorporathemselvescanbe usedas training materialfor statisticalparsers,asgoldstandardsfor theevaluationof NLP systems,andasdatarepositoriesfor linguistic research.

16

Page 18: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

Chapter 4

Corpus Exploitation

In order to make useof a linguistically annotatedtext corpus,tools for corpusexploitationareneeded.We have developeda treebankquery language(Sect.4.1) and implementedthislanguage,i.e. thetreebankqueryengine’TIGERSearch’(Sect.4.2).Furthermore,wehavebuiltquerycollectionswhichaddressspecificlexicographictasks(Sect.4.3).For efficiency reasons,a certainlevel of the queryresultsis alsoannotatedto thecorpusin orderto complementthecorpusannotation,whichhasbeendescribedin theprevioussections.

4.1 A query languagefor structurally annotatedcorpora

In cooperationwith theTIGERproject,aspecializedcorpusquerylanguagehasbeendesigned.This languagehasbeendescribedin full detailin [18] resp.[19]. Subsequently, wewill giveanoverview of themajorfeaturesof thelanguage.

A syntacticderivationconsistsof nodesandrelationsamongnodes.NodescanbequeriedbyBooleanexpressionsover feature-valuepairs.The text corporain DEREKO have beenanno-tatedwith theword andthepos (part-of-speech)featureon the level of words,andwith thecat (syntacticcategory) featureon thelevel of derivednodes.A possiblequerywouldbe:

[word="Abend" & pos="NN"]

Relationsamongnodesareexpressedin termsof thetwo elementaryoperators. (directprece-dence)for thehorizontalorderingand> (directdominance)for theverticalordering.Thereis asetof derivednoderelationslike>* (dominance)and.* (precedence)etc.Samplequery:

[cat="PC"] > [pos="APPRART"]

Relationalexpressionscan be combinedby (restricted)Booleanexpressions(negation ex-cluded).Example:

17

Page 19: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

([cat="NC"] >* [pos=/AD.*/]) & ([cat="NC"] > [pos="NE"])

The example above shows that regular expressionscan be usedto describefeaturevalues(/AD.*/ ).

Thetwo nodedescriptions([cat="NC"] canreferto differentnodes.Theidentityof thetwoNCnodescanbeenforcedby insertinga variable:

(#x1:[cat="NC"] >* [pos=/AD.*/)

& (#x1:[cat="NC"] > [pos="NE"])

In addition, the usercandefinea type hierarchy. Typesare very useful to formulatequeriesin a moreconciseway. For example,onecould definea type nominal which standsfor thepossiblylessintuitiveregularexpression"N.*" :

[pos=nominal] .* [pos="VVFIN"]

For the convenienceof the users,a seriesof predicateshasbeenadded.For example,if morecomplex annotationwereaddedto theDEREKO corpus,onecould imposethata structurebediscontinuouslike in thecaseof anNPwhich is modifiedby anextraposedrelativeclause:

#np:[cat="NP"] & (#np >RC []) & discontinuous(#np)

4.2 The tr eebanksearch engineTIGERSearch

Thequerylanguagehasbeenrealizedby a specialisedsearchengine,TIGERSearch.In orderto createa really usefultool, thefollowing moduleshavebeenadded:

� sophisticatedgraphicaluserinterface(TIGERGraphViewer) for browsing the queryre-sults

� graphicalregistry tool (TIGERRegistry) for easycorpusadministration

� XML-import of corpora

Import filters areavailablefor theDEREKO corpusformatandfor a rangeof otherwell-known corpora(PennTreebank[23], NEGRA [28], TIGER [7], SusanneandChristine[24], Penn-HelsinkiParsedCorpusof Middle English[1], VerbMobil [13])

� XML- andSVG-animation-exportof queryresults

SampleXSLT-stylesheetsfor the creationof someother formats(e.g. plain text) havebeenincluded.

18

Page 20: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

The differentmoduleswork togetherasshown in Fig. 4.1. The TIGERRegistry tool providesimport filters for thetransformationof varioustreebankformatsinto theTIGER-XML format.Onthisbasis,theTIGERSearchindexercanbecalledin orderto createacorpusindex. A queryis processedby parsingit, transformingit into aJavaobject.Thisqueryobjectis thenevaluatedagainstthecorpusindex. Queryresultobjectscanbeeithertransformedinto TIGER-XML forfurtherprocessing,or canbeviewedby theGraphViewer.

NeGra

UPenn

Susanne

TIGER−XMLCorpusindex

Result objects

Queryobject

TIGER−XML

GraphViewer

TIGERSearch

TIGERRegistryInput filter

Input filter

Input filter

indexer

Query processor

Outputfilter

ParserQuery

Figure4.1:Architectureof theTIGERSearchsoftware

TheTIGERSearchsoftwarehasbeenimplementedin Java for severalreasons:

� platform-independence

TIGERSearchis availablefor all majorplatformswhichsupportJava1.3:MicrosoftWin-dows,Solaris,Linux, andMacOSX.

� realizationof softwareengineeringstandards

Java as an object-orientedlanguagemakes it much easierto write conceptuallywell-definedprograms.Variouserror sources,which occurin otherprogramminglanguages,areexcludedby therigid typecheckingby theJavacompiler.

� supportof client-serverapplicationsandmulti-threads.

SinceJava providesbuilt-in functionality for distributedcomputing,the TIGERSearchimplementationcan be easily adaptedto future requirements(web-basedapplications,parallelcomputationon largecorpora).

19

Page 21: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

� Unicodesupport

Due to the Java built-in Unicodesupport,TIGERSearchcan be usedwith corporainnon-Westernscripts.(At thecurrentstage,e.g.a JapanesecorpuscouldbeimportedintoTIGERSearch,but thereis not yet aconvenientmannerto key in theJapanesecharactersfor queryingthecorpus.)

Manualsanddownloadinformationfor TIGERSearchcanbeobtainedfrom theTIGERSearchwebsite:

http://www.ims.uni-stuttgart. de/pr ojek te/TI GER/TIGERSear ch/

Subsequently, wegivea little moredetailwrt. thesearchengineandthegraphviewer.

4.2.1 Search engine

Whenwe startedto think aboutthe searchengine,we hadto decidewhetherto build a new,dedicatedenginefrom scratch,or whetherto usean existing databaseengine(c.f. [16]). Wedecidedfor thefirst alternativeof building anew enginefor thefollowing reasons:

� Genuinesearchmachinesfor XML-based(i.e.hierarchicallystructured)datawerehardlyavailablewhenwestartedtheprojectin 1999.

� Thelinguisticallymotivatedcorpusrepresentationformatwehadchosenincludedone(orseveral)complicationsin additionto thehierarchicalinformation:crossingedgesandso-called’secondary’edges[27]. It wasnot clear, whetherso-calledXML-baseddatabaseswould efficiently supportthe accessto ’near-trees’ which violate the strict embeddingconditionsof propertrees.

� Efficient databaseengines,in the way they are requiredto processlarge corpora,usu-ally arevery expensivecommercialsoftware- which cannotbeaffordedby theintendedTIGERSearchusers(researchersin thehumanities).Thereasonwhycommercialdatabasesoftwareis expensive is not only theamountof work which wentinto efficient searchal-gorithmsbut alsothe costfor moduleslike transactionmanagement,databaserecoverymechanismsetc.However, for corpusapplications,only the (efficient) searchmoduleisneeded.

� In our own implementation,the indexing mechanisms,query optimization,and queryfiltering canbetunedmoreeasilyto thepropertiesof thekind of dataandquerieswhichoccurin linguistic applications.

In order to ensureefficient queryprocessingwe have chosenan index-basedapproach.Datastructuresarebuilt over the text representationof the original corpusandtransformedinto abinarycorpusrepresentation.In addition,many partialsearchesareperformedduringindexingin orderto saveprocessingtimeduringqueryprocessing.

The query evaluationalgorithm is a straight-forward implementationof chronologicalback-tracking.A queryis serializedasa list of treenodesandtheir respective constraints.This list

20

Page 22: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

is traversedfrom left to right. A querynodeis matchedwith a nodefrom thecorpus,andnoderelationsarecheckedagainstthecorpusdefinitionby accessingtheindex. Whenever a conflictoccurs,thealgorithmdiscardsthecurrentchoice,andchoosesanothermatchingnodefrom thecorpus.Oncenomorealternativesareavailable,thechoicefor thepreviousquerynodeis recon-sidered.Insteadof this ’depth-first’search,whichbuilts oneresulttreeat a time,a breadth-firstsearchbasedon ’join’ operationscouldhavebeenadoptedin theway it is usedin conventionaldatabasemanagementsystems.However, consideringthe implementationof a rathergeneralconceptof logical variables,thebacktrackingapproachseemedmorestraight-forward.In addi-tion, thedepth-firstapproachis lessspace-consuming.Thedrawbackis that,in its naiveversion,all nodesof thecorpushave to beinspected,oneaftertheother. For this reason,thebasealgo-rithm hasbeenimprovedin two respects:

� searchspacefilters

If possible,thequeryis logically weakenedto oneof its subformulas,anatomicconstraint(feature-valueequation),whichmatchesonly avery limited numberof corpusnodes,e.g.[word="Haus"] . It is sufficient to evaluatetheremainderof thequeryformulaon thissmallcorpusextract(cf. [3]).

� queryoptimization

Searchspacefilters make only sense,if they canbedeterminedat low costandproducea drasticreductionof thesearchspace.Therefore,additionalefficiency canbegainedbyreorderingthe(atomic)subformulasof aqueryaccordingto theirprospective ’evaluationcost’ in termsof numberof matchingcorpusnodes,cf. [20]. Sincethe ’cost’ can bereadoff from theindex, thesubformulareorderingoperationis fast,andhelpsto cut theevaluationtime evenin thosecases,wherea searchspacefilter cannotreducethesearchspacein a reasonablemanner.

Thereis amplespacefor further work on treebank-specificsearchspacefilters andqueryop-timization.The problemis to find filters andoptimizationstrategieswhich reducethe searcheffort drasticallywhile beingidentifiableat low cost.

4.2.2 Graph viewer

The GraphViewer allows for the graphicaldisplayof the resultsof a TIGERSearchquery. Itsmajorfeaturesare:

� navigationthroughthelist of resultsentences

� navigationthroughmatchingsubgraphs

� focussingonmatch

� implodingandexplodingof subgraphs

� tooltip function(nodefeaturesappearwhenmousepointsto thenode)

21

Page 23: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

� printingandexportof variousgraphicsformats(jpeg, png,tif f, animatedSVG images)

� customisability:userscansetcolorsandotherdisplayparameteraccordingto their ownrequirements

Sincethereis aclear-cut interfacebetweentheTIGERSearchengineandtheGraphViewer, theGraphViewercanbeseenasastand-alonetool to beusedwith otherapplicationsaswell.

4.3 Query collections

Oneapplicationof a linguisticallyannotatedcorpusis to giveevidencefor certainlexical prop-ertiesof words,cf. [9]. Whendecidingaboutthecontentof a lexical entry, onehasto take intoaccountat leastthetwo following criteria:

� ’collocational’ variation

In whichsyntacticandcollocationalcontextsdoesawordoccur?

� frequency considerations

How frequentaretheseoccurrences?Are they worthbeinglisted?

Let’sconsiderthecollocationpattern

(Support)Verb . . .sicher, dass (4.1)

The rathercomprehensive DudenUniversalwörterbuch der DeutschenSprache [17] lists onlyonepossiblesupportverb(sein) for this kind of pattern:

Esist sicher, dasser zustimmt.Ich bin sicher, dasser noch kommt.

(4.2)

Corpusinvestigationshowsthat,of course,seinis themostfrequentverbin co-occurrencewithsicher, dass, but thatthereareothers,which aremoreor lesscommon:

verb freq. example

sein sicher, dass 113gelten sicher, dass 11 Esgilt alssicher, dassGeorgeBushseinVetoeinlegt.stellen sicher, dass 8 Ein Gesetzstellt sicher, dassvondenkassiertenMit-

telnzehnProzentfür Verwaltungsaufwandabgezogenwerdendürfen.

wissen sicher, dass 2 . . .wusste ich sicher, dassich jedenTag wiederkom-menkann.

22

Page 24: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

Thefrequency figures(for dass-clausesin theNachfeld)havebeenextractedfrom anewspapercorpuswith 40million tokens.

In orderto extractsuchcorpusevidencein a reliablemanner, a ratherdetailedsyntacticanaly-sis is required.Thesyntacticanalysismustcover complementclauses(dass-clauses)andwordordervariationof verbsin German.As arguedfor in Sect.3.2,weusefinite-statebasedshallowsyntacticanalysismethods,and, in our case,the finite-statebasedcorpusquery languageofthe IMS CorpusWorkbench(CQP)[6]. Sincethedevelopmentof thesearchenginefor struc-turally annotatedcorpora(TIGERSearch)ran in parallel to the implementationof the querycollections,thelatterhave beenrealizedwith CQP. Thequerycollectionscanbeeasilyportedto TIGERSearch,onceit supportsmechanismsfor queryabbreviations(templates)andcorpustransformation.

Our CQPquerieshavebeengroupedin two layers:

� The first layer of CQP queriesbuilts recursive syntacticstructuresof restricteddepthof embeddingwhich canbe discoveredin a reliablemanner, i.e. which areusuallynotambiguouswith otherpossibleanalyses(asit wouldbecasewith e.g.PP-attachment).

� Thesecondlayerof CQPqueriesextractcorpusevidencefor specificlexicographictasks.

For more efficient CQP-internalprocessing,the query resultsof the first layer are addedtothe corpusannotation,so that the basicsyntacticanalysesof the first layer do not have to bereconstructedfor eachqueryof thesecondlayer.

On top of thechunkannotation,whichhasbeendescribedin Sect.3.2,thefirst layerof queriesidentifiesthefollowing syntacticphrases:

� pre-headrecursive embedding,e.g.,PPsmodifying APs,which areleft strandedby theanalysisin Sect.3.2,areassembledto form complex APs,asthetreediagramin Fig. 4.2exemplifies.This leadsto therecursive embeddingof theNP die Köpfeder Apostel(theheads[of] theapostles)in theNPkleinen,überdieKöpfederApostelgesetztenFlammen(small,above theheadsof theapostlessetflames).

� post-headrecursiveembeddingof

– genitiveNPs.In Fig. 4.2thegenitiveNPderApostel([of] theapostles)is embeddedin theNPdieKöpfeder Apostel(theheads[of] theapostles)asa modifier.

– namedentities.In the exampleeiner Fahrt desShuttle“Endeavour” (a trip of theshuttle“Endeavour”) theNP “Endeavour” is embeddedasa modifier in the largerNP

� PP-attachmentin somespecificcases,e.g.,in NP constructionsincludingparenthesis.IneineWegbeschreibungdurch dieHölle ([a] directionsthrough[the] hell), thePPdurch dieHölle is attachedto the precedingheadnoun,the whole constructionbeingsurroundedby doublequotes.

� chunkswith specificinternalstructures,e.g.,news agency markers(LONDON(adp)) ormeasureadjectivephrases(10 Hektargrößer(10acreslarger)).

23

Page 25: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

Figure4.2:Full hierarchicalstructureof complex AP

Besidestheadditionalsyntacticanalysis,certainlexical propertiesof terminalnodesareadded.They areprojectedfrom theheadto thechunkswherenecessary. Thelexical informationis usedduringtheparsingprocessto specifyrestrictionsfor parsingrules.Therulethatbuilts APs,e.g.,allows only NCs with the featureattribute year to modify the adjective in pre-headposition.TheresultareAPssuchas "$#&% die "('%)"(#+* 1993 , erbaute, Brücke , (the1993built bridge;thebridgebuilt in 1993).

The lexical propertiesof chunksare also usedduring extraction.The query can either lookfor chunkswith specificlexical propertiesor it canexcludechunkswith a specificproperty.Whenlooking for subcategorizationframes,e.g.,onemightwantto excludeNPswith temporalheads,whereas,in thecaseof collocationextraction,namedentitiesarenotof interest.As thesepropertiesareannotatedalongwith thechunks,it is easyto excludethemfrom theextractionresults.

Thechunksprovidedby theannotationdescribedin Sect.3.2areusedaccordingto their cate-gory (e.g.,NC, AC, or PC)andincludedin theCQP-Analysisto build up largerchunkswherenecessaryandpossible,otherwisethey areleft asthey are.

Onceall thequeryresultsof thefirst querylayerhavebeenannotated,thequeriesof thesecondlayer which are to extract the corpusevidencecan make useof the annotatedchunks.As abig part of the analysishasalreadybeenperformedoff-line, the extraction query can leavethis partof theanalysisasideandcanitself bekeptsimpler. Insteadof rebuilding therulesforchunkssuchasNP, AP, PPanew eachtime,extractionqueriescansimply referto theannotatedstructures.

24

Page 26: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

Dueto theincrementallayersof queries,wehaveabettercontrolon theway in whichsyntacticanalysisproceeds.For example,in the off-line structurebuilding process,onecan allow forovergeneralizations(by relieving somesyntacticrestrictions),which canbe correctedon thenext level of analysis.Consider, for example,the rule which builds an AP with an embeddedPP. In an intermediatestepany adjective is allowed to take any numberof precedingPPsascomplement,knowing that quite a bit of noiseis produced.At the endof the off-line work,these’insecure’structuresarecheckedfor correctness.Thestructuresareacceptedif they proveto bein a securecontext, otherwisethey arerejected.In orderto beaccepted,APsembeddingPPshave to be insidean NP with at leastone’secure’elementprecedingit. In (4.3), the APwegen desRegensnasseis rejectedand replacedby the smallerAP nasse, the samefor theinsecureNP, which is replacedby anNP to compriseonly nasseFüßeinsteadof the rejectedinsecureAP and the headnoun. In (4.4), however, the AP wegen desRegensnassenis notrejectedasthearticledieensuresthecorrectnessof theAP.

Lisa hat [NP [AP [PP wegendesRegens]nasse]Füße]bekommen.Lisa hasbecauseof therain wet feetgot.Lisahasgotwet feetbecauseof therain

(4.3)

Lisa hat [NP die [AP [PP wegendesRegens]nassen]Füße]abgetrocknet.Lisa hasthebecauseof therain wet feetdried.

Lisahasdriedthefeetthatwerewet becauseof therain.(4.4)

Note that we have implementeddedicatedgrammaticaldescriptionswhich have certainlimi-tations.For example,the querywhich extractsthe examplesfor the sicher, dasscontexts alsoextractsfalsepositiveslike theverbaufregen in Bei der letztenwar ich kein bißchenaufgeregtund mir sicher, dassder Ball reingeht. The falsepositive is a resultof the fact that the queryis still very generalandnot restrictedenough.It doesnot identify the elliptical constructionof theexample.This problemcould besolvedby restrictingthequeryto a greaterextent, i.e.by specifyingtheelementsprecedingtheadjective in moredetail.In addition,exampleswith adativeobjectwouldhave to befilteredout.

Thecorpusqueryfor thepattern(4.1) hasbeencodedin sucha mannerthat it canbeusedtoextract thesametypeof syntacticpatternfor otheradjectivesaswell. In addition,thequeryisparametrisablewith respectto the positionof the dass-clause.In total, we have implementedparametrisablecorpusqueriesfor thefollowing lexical andsyntacticphenomena:

� adjectivesin controlverbconstructionstakingothersubordinateclauses

� adjectivesin controlverbconstructionstakinginfinitival clauses

� verbnouncollocations

� verbnouncollocationfor aspecificnoun

� verbPPcollocationswith theprepositionan

� adjectivePPpairsextractedfrom complex APs,resultingin:

25

Page 27: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

– adjectivePOpairs

– verbPOpairs(PPoccurringwith deverbaladjective)

In joint projectswith publishinghouses,ourquerycollectionsprovedto beof greathelpfor thelexicographers’work.

26

Page 28: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

Bibliography

[1] Penn-Helsinkiparsedcorpusof Middle English,2000.

[2] StevenAbney. PartialParsingvia Finite-StateCascades.In Proceedingsof theESSLLI-96Workshopon“RobustParsing”, Prague,1996.

[3] HansArgenton.IndexierungundRetrieval vonFeature-BäumenamBeispielder linguis-tischenAnalysevonTextkorpora, volume40. infix Verlag,SanktAugustin,1998.

[4] ThorstenBrants. TnT – a statisticalpart-of-speechtagger. In Proceedingsof the 6thConferenceon AppliedNatural Language Processing(ANLP-2000),April, Seattle,WA,2000.find.

[5] Eric Brill. A simplerule-basedpart-of-speechtagger. In Proceedingsof ANLP-92,3rdConferenceon AppliedNatural LanguageProcessing, pages152–155,Trento,IT, 1992.

[6] Oliver Christ, Bruno M. Schulze,and EstherKönig. CorpusQueryProcessor(CQP).User’s Manual. Institut für Maschinelle Sprachverarbeitung,Universität Stuttgart,Stuttgart,Germany, 1999.

[7] StefanieDipper, ThorstenBrants,WolfgangLezius,OliverPlaehn,andGeorgeSmith.TheTIGER treebank.Presentedat theThird Wokshopon Linguistically InterpretedCorpora(LINC), 2001.

[8] Erich Drach. Grundgedanken der Deutschen Satzlehre. Diesterweg, Frankfurt/Main,1937.

[9] JudithEckle andUlrich Heid. Extractingraw materialfor a Germansubcategorizationlexiconfrom newspapertext. In Proceedingsof the4th InternationalConferenceonCom-putationalLexicography, COMPLEX’96, Budapest,Hungary, 1996.

[10] OskarErdmann.GrundzügederdeutschenSyntaxnach ihrergeschichtlichenEntwicklung,volumeErsteAbteilung. Cotta,Stuttgart,1886.

[11] Claire Grover, Colin Matheson,Andrei Mikheevy, andMarc Moens. Lt ttt - a flexibletokenisationtool. In 2ndInternationalConferenceon LanguageResources& Evaluation(LREC2000),31MAY- 2 JUNE2000, Athens,Greece,2000.

[12] S. H. A. Herling. überdie Topik derdeutschenSprache.In Abhandlungendesfrankfur-terischenGelehrtenvereinsfür deutsche Sprache, volumeDrittes Stück,pages296–362,394.1821.

27

Page 29: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

[13] ErhardW. Hinrichs,Juli Bartels,YasuhiroKawata,Valia Kordoni,andHeike Telljohann.TheVerbmobil treebanks.In Schukat-TalamazziniErnstG. andWernerZühlke,editors,KONVENS-2000Sprachkommunikation, pages107–112.VDE-Verlag,2000.

[14] TilmanHöhle.DerBegriff ‘Mittelfeld’, AnmerkungenüberdieTheoriedertopologischenFelder. In AktendesSiebtenInternationalenGermanistenkongresses, pages329–340,Göttingen,1986.

[15] Nancy Ide, PatriceBonhomme,andLaurentRomary. XCES: An XML-basedencodingstandardfor linguistic corpora.In 2ndInternationalConferenceon Language Resources& Evaluation(LREC2000),31 May–2June2000, Athens,Greece,2000.

[16] LauraKallmeyer. A query tool for syntacticallyannotatedcorpora. In ProceedingsofJoint SIGDAT Conferenceon Empirical Methodsin Natural Language ProcessingandVeryLargeCorpora, HongKong,October2000.

[17] Annette Klosa, Kathrin Kunkel-Razum, Werner Scholze-Stubenrecht,and MatthiasWermke. DUDENDeutschesUniversalwörterbuch. Dudenverlag,Mannheim,2001.

[18] EstherKönig andWolfgangLezius. The TIGER language- a descriptionlanguageforsyntaxgraphs.Part1: User’s guidelines. Technicalreport,IMS, Universityof Stuttgart,2001.

[19] EstherKönig andWolfgangLezius. The TIGER language- a descriptionlanguageforsyntaxgraphs.Part2: Formaldefinition. Technicalreport,IMS, Universityof Stuttgart,2001.

[20] JasonMcHugh. Data Managementand queryprocessingfor semistructureddata. PhDthesis,Departmentof ComputerScience,StanfordUniversity, March2000.

[21] FrankHenrik Müller. Shallow parsingstyle-bookfor german.Technicalreport,Seminarfür Sprachwissenschaft,UniversitätTübingen,2001.

[22] Frank Henrik Müller and Tylman Ule. Satzklammerannotierenund Tagskorrigieren.Ein mehrstufigesTop-Down-Bottom-Up-Systemzur flachen,robustenAnnotierungvonSätzenim Deutschen.In HenningLobin, editor, ProceedingsderGLDV-Frühjahrstagung2001, pages225–234,Gießen,2001.Gesellschaftfür LinguistischeDatenverarbeitung.

[23] PennTreebankProject,Universityof Pennsylvania. BracketingGuidelinesfor TreebankII Style, 1995.

[24] Geoffrey Sampson. English for the Computer:The SUSANNECorpus and AnalyticScheme. ClarendonPress(Oxford UniversityPress),1995.

[25] AnneSchiller. DMOR: Benutzerhandbuch.Technicalreport,IMS, Universityof Stuttgart,1995.

[26] Anne Schiller, SimoneTeufel, Christine Stöckert, and Christine Thielen. VorläufigeGuidelinesfür dasTaggingdeutscherTextcorporamit STTS. Technicalreport,Univer-sität Stuttgart,Institut für maschinelleSprachverarbeitung,andSeminarfür Sprachwis-senschaft,UniversitätTübingen,1995.

28

Page 30: (DEutsches REferenzKOrpus) German Reference Corpus Final ... · (DEutsches REferenzKOrpus) German Reference Corpus Final Report (Part I) Stefanie Dipper, Hannah Kermes, Dr. Esther

[27] W. Skut, Brigitte Krenn, Thorsten Brants, and Hans Uszkoreit. An annotationschemefor free word order languages. In ANLP, 1997, 1997. http://www.coli.uni-sb.de/thorsten/anlp97/.

[28] Wojciech Skut, ThorstenBrants,Brigitte Krenn, and HansUszkoreit. A linguisticallyinterpretedcorpusof Germannewspapertexts. In ESSLLIWorkshopon RecentAdvancesin CorpusAnnotation, Saarbrücken,1998.Universityof Saarbrücken.

[29] Tylman Ule. DEREKO linguistic markup. Technicalreport, Seminarfür Sprachwis-senschaft,UniversitätTübingen,2001.

[30] HansvanHalteren,JakubZavrel,andWalterDaelemans.Improvingdatadrivenwordclasstaggingby systemcombination.In Proceedingsof COLING-ACL ’98, August, Montreal,Canada,1998.Associationfor ComputationalLinguistics.

29