new/s/leak -- Information Extraction and Visualization for ... · data from the document, including the document creation dates, subject, message sender document creator, and other

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics—System Demonstrations, pages 163–168,Berlin, Germany, August 7-12, 2016. c©2016 Association for Computational Linguistics

new/s/leak – Information Extraction and Visualization forInvestigative Data Journalists

Seid Muhie Yimam† and Heiner Ulrich‡ and Tatiana von Landesberger� andMarcel Rosenbach‡ and Michaela Regneri‡ and Alexander Panchenko† and

Franziska Lehmann� and Uli Fahrer† and Chris Biemann† and Kathrin Ballweg�

†FG Language TechnologyComputer Science Department

Technische Universitat Darmstadt

�Graphic Interactive Systems GroupComputer Science Department

Technische Universitat Darmstadt

‡SPIEGEL-VerlagHamburg, Germany

Abstract

We present new/s/leak, a novel tool de-veloped for and with the help of journal-ists, which enables the automatic analysisand discovery of newsworthy stories fromlarge textual datasets. We rely on differ-ent NLP preprocessing steps such namedentity tagging, extraction of time expres-sions, entity networks, relations and meta-data. The system features an intuitiveweb-based user interface based on net-work visualization combined with data ex-ploring methods and various search andfaceting mechanisms. We report the cur-rent state of the software and exemplifyit with the WikiLeaks PlusD (Cablegate)data.

1 Introduction

This paper presents new/s/leak1, the network ofsearchable leaks, a journalistic software for inves-tigating and visualizing large textual datasets (seelive demo here2). Investigation of unstructureddocument collections is a laborious task: Thesheer amount of content can be vast, for instance,the WikiLeaks PlusD3 dataset contains around250 thousand cables. Typically, these collectionslargely consist of unstructured text with additionalmetadata such as date, location or sender and re-ceiver of messages. The largest part of thesedocuments are irrelevant for journalistic investi-gations, concealing the crucial storylines. For in-stance, war crime stories in WikiLeaks were hid-

1http://newsleak.io2http://bev.lt.informatik.tu-darmstadt.de/

newsleak/3https://wikileaks.org/plusd/about

den and scattered among hundreds of thousandsof routine conversations between officials. There-fore, if journalists do not know in advance whatto look for in the document collections, they canonly vaguely target all people and organizations(named entities) of public interest.

Currently, the discovery of novel and relevantstories in large data leaks requires many peopleand a large time budget, both of which are typi-cally not available to journalists: If the documentsare confidential like datasets from an informer,only a few selected journalists will have access tothe classified data, and those few will have to carrythe whole workload. On the other hand, if the doc-uments are publicly available (e.g. a leak postedon the web), the texts have to be analyzed underenormous time pressure, because the journalisticvalue of each story decreases rapidly if other me-dia publish it before.

There is a plethora of tools (see Section 2)for data journalist that automatically reveal andvisualize interesting correlations hidden withinlarge number-centric datasets (Janicke et al., 2015;Kucher and Kerren, 2014). However, these toolsprovide very limited automatic support with vi-sual guidance through plain text collections. Sometools include shallow natural language processing,but mostly restricted to English. There is no toolthat works for multiple languages, handles a largenumber of text collections, analyzes and visual-izes named entities along with the relations be-tween them and allows editing what is consideredas an entity. Moreover, available software is usu-ally not open source but rather expensive and oftenrequires substantial training, which is unsuitablefor a journalist under time pressure and no priorexperience with such software.

163

The goal of new/s/leak is to provide journalistswith a novel visual interactive data analysis sup-port that combines the latest advances from naturallanguage processing approaches and informationvisualization. It enables journalists to swiftly pro-cess large collections of text documents in order tofind interesting pieces of information.

This paper presents the core concepts and ar-chitecture behind the new/s/leak. We also showan in-depth analysis of user requirements and howwe implement and address these. Finally, we dis-cuss our prototype, which will be made availablein open source4 under a lenient license.

2 Related work

Kirkpatrick (2015) states that investigative jour-nalist should look at the facts, identify what iswrong with the situation, uncover the truth, andwrite a story that places the facts in context. Thiswork explains traditional story discovery enginecomponents which, on the technical side, consistof a knowledge base, an inference engine and aneasy to use user interface for visualization.

Discussions with our partner journalists confirmthis view: newsworthy stories are not evident fromleaked documents, but following up on informa-tion from leaks might reveal relevant pointers forjournalistic research.

Cohen et al. (2011) outline a vision for “a cloudfor the crowd” system to support collaborative in-vestigative journalism, in which computational re-sources support human expertise for efficient andeffective investigative journalism.

The DocumentCloud5 and the Overview6

project are the most popular tools comparable tothe new/s/leak, both designed for journalists deal-ing with large set of documents. DocumentCloudis a tool for building a document archive for thematerial related to an investigation. Similarly toour system, people, places and organizations arerecognized in documents. We put additional fo-cus on the UI by adding a graph-based visualiza-tion and better support for document browsing.Overview is designed to help journalists find sto-ries in large number of documents by topical clus-tering. This is a complementary approach to ours:rather than keyword-based topics, our tool centersaround entities and text-extracted relations. Fur-

4http://github.com/tudarmstadt-lt/newsleak5https://www.documentcloud.org/home6https://https://www.overviewdocs.com

ther, we added advanced editing capabilities forusers to add entities and edit the whole network.

Aleph.grano.cc visualizes connections of peo-ple, places and organizations extracted from textdocuments, but leaves the relationship types un-derspecified. Detective.io, a platform for collabo-rative network analysis, does some of the visual-izations and annotations we are working on. How-ever, there is a crucial difference: Detective.io as-sumes a rigid data structure (e.g.“corporate net-works”) and the user has to fill the underlyingdatabase entirely manually. Our tool targets onunstructured sources and extracts networks fromtext, enabling navigation to the sources. Jigsaw7

extracts named entities from text collections andcomputes various plots from them, but received lit-tle attention from journalists due to its unintuitiveuser interface.

With new/s/leak, we want to develop existingtools a step further, by combining the automationof entity and relationship extraction with an intu-itive and appealing visual interface.

New/s/leak is based on two prior systems devel-oped namely from the works of (Benikova et al.,2014), Network of the Day8 and (Kochtchi et al.,2014), Network of Names9. Both tools automat-ically analyze named entities and their relation-ships and present an interactive network visual-ization that allows to retrieve the original sourcesfor displayed relations. In the current project, weadd support for faceted data browsing, a timeline,a better access to source documents and a possi-bility to tag and edit visualizations.

3 Objectives and User Requirements

The objective of new/s/leak is to support investiga-tive data journalists in finding important facts andstories in large unstructured text collections. Thetwo key elements are an easy-to-use interactive vi-sualization and linguistic preprocessing.

To gain a more precise focus for our devel-opment, we conducted structured interviews withpotential users. They seek for an answer to thequestion ”Who does what to whom?” – possiblyamended with ”When and where?”. At the sametime, they always need access to the source docu-ments, to verify the machine-given answers. The

7http://www.cc.gatech.edu/gvu/ii/jigsaw8http://tagesnetzwerk.de9http://maggie.lt.informatik.tu-darmstadt.

de/thesis/master/NetworksOfNames

164

outcome of these interviews are summarized in thefollowing requirements.1) Identify key persons, places and organizations.2) Browse the collection, identify interesting doc-uments and read documents in detail.3) Analyze the connections between entities.4) Assess temporal evolution of the documents.5) Explore geographic distribution of events.6) Annotate documents with findings as well asedit the data to enhance data quality.7) Save and share selected documents.

While some of these requirements (like docu-ment browsing) are standard search features, oth-ers (like annotation and sharing) are usually notyet integrated in journalism tools. The final ver-sion of our system will include all of these fea-tures.

4 Implementation details

4.1 System

The new/s/leak tool consists of two major partsshown in Figure 1: The backend provides variousNLP tools for document pre-processing and docu-ment analysis (Sec. 5), interactive visualization forinvestigative journalism (Sec. 6). The implemen-tation of the backend and frontend components ofnew/s/leak are integrated in a modular way thatallows e.g. adding a different relation extractionmechanism or support for multiple languages.

Figure 1: Schema of new/s/leak for visual supportof investigative analysis in text collections.

4.2 Demo datasets

We demonstrate capabilities of our system on twowell-known cases:1) The well-investigated WikiLeaks PlusD “Ca-blegate” collection, a collection of diplomaticnotes from over 45 years originating from US em-bassies all over the world.2) The Enron email dataset (Klimt and Yang,2004) is a collection of email messages whichis available publicly from the Enron Corporation.

The dataset comprises over 600,000 messages be-tween 158 employees. This example shows thetools’ general applicability to email leaks.

5 Backend: information extraction

The new/s/leak backend uses different NLP toolsfor preprocessing and analysis, integrated with arelational database, a NoSQL document store andsome specific retrieval methods. First, documentsare pre-processed and converted to an interme-diate new/s/leak document representation. Thisformat is generic enough to represent collectionsfrom any possible source, such as emails, rela-tional databases or XML documents.

In a second step, our NLP preprocessing ex-tracts important entities, metadata (like time andlocation), term co-occurrences, keywords, and re-lationships among entities. We store relevant doc-uments and extracted entities in a PostgreSQL10

database and an ElasticSearch11 index. We ex-plain the following steps using the Wikileaks Ca-blegate dataset as an example, but the tool is notlimited to this single dataset.

5.1 Preprocessing

The first step of new/s/leak workflow is prepro-cessing the input documents and represent themin new/s/leak’s intermediate document representa-tion. Preprocessing of the Cablegate dataset re-quires truecasing the original documents whichare mostly in capitals. The work of Lita et al.(2003) indicates that truceasing improves the F-score of named entity recognition by 26%. Weused a frequency based approach for case restor-ing based on a very large true-cased backgroundcorpus. Once case is restored, we extract meta-data from the document, including the documentcreation dates, subject, message sender documentcreator, and other metadata already marked in thedataset. Metadata is stored in a database as triplesof (n, t, v) where n is the name, t is the data type(e.g. text or date) and v is the value of the meta-data. The tool further supports manual identifica-tion of metadata during the analysis and produc-tion stages.

10http://www.postgresql.org11https://www.elastic.co

165

Dataset #Documents #Entities #Relations

WikiLeaks 251,287 1,363,500 163,138,000Enron 255,636 613,225 81,309,499

Table 1: Statistics on WikiLeaks PlusD and Enron

5.2 Entity, relation, co-occurrence, keyword,and event time extractions

Recognition of named entities and related termsare key steps in the investigative data-drivenjournalistic process. For this purpose, we firstautomatically identify four classes of entities,namely person (PER), organization (ORG), loca-tion (LOC) and miscellaneous (MISC) using thenamed entity recognition tool from Epic12. We as-sume relationships between entities whenever thetwo entities co-occur in a document. In order toextract relevant keywords regarding two entities,we follow the approach by Biemann et al. (2007).Furthermore, we extract entity relation labels bycomputing document keywords using JTopia13.JTopia extracts relevant key terms for search basedon part of speech information. To label the rela-tionship between two entities, the most frequentkeywords from the documents where the two en-tities appeared together is used. To extract tempo-ral expressions, we use the Heideltime (Strotgen,2015) tool, which disambiguates temporal expres-sions based on document creation times. For theWikiLeaks dataset, it is possible to extract and dis-ambiguate more than 3.9 million temporal expres-sions.

All the extraction and processing work-flows ofthe new/s/leak components are implemented usingthe Apache Spark cluster computing frameworkfor parallel computations. Table 1 shows the dif-ferent statistics for the WikiLeaks PlusD and En-ron datasets.

6 Frontend: interactive visualization

Journalists can browse through the document col-lection using an interactive visual interface (seeFigure 2). It enables faceted document explorationwithin several views:1) Graph view: shows named entities and their re-lations.2) Map view: shows document distribution in ge-ographic space.3) Document timeline view: shows document fre-

12https://github.com/dlwh/epic13http://github.com/srijiths/jtopia

quency over time.4) Document view: is composed of a) documentlist and b) document text for reading.

The views are interactive so that the user canbrowse and explore the document collection on de-mand. The user starts with exploring entities andtheir connections in the graph view or by search-ing for entities and keywords. All interactions inthe views define a filter that constrains the currentdocument set, which in turn changes informationdisplayed in the views. The user can assess docu-ment frequency in a map or in the timeline, drilldown and select documents from the result list,and read them closely. User-selected entities arehighlighted in the documents.

Graph view: entities and their co-occurrencesThe graph view shows a set of entities as nodesand their connections as links. Node size denotesthe frequency of an entity in the document col-lection, node color denotes the entity type. Theco-occurrence of entities within the documents isshown by edge thickness and edge label.The user can explore the entities and their connec-tions via expanding the graph along the neighborsof a selected entity (plus button). The expandedentities are slowly faded in, so that the user caneasily spot which entities appeared. Moreover, theuser can drill down into the data by displaying theso-called ego network of a selected entity. Click-ing on nodes and edges retrieves the respectivedocuments.

Map view. The map view shows the documentfrequency distribution over the geographic space.Users can hover over a country to see the numberof documents mentioning this geographic area, ef-fectively using the map as faceted search.

Document timeline. The document timelineshows the number of documents over time. Weuse a bar chart with logarithmic scale as it betteradheres to the exponential document distributioncharacteristics. The users can drill down in time tosee the document distribution over years, monthsor days. It is also possible to select a time intervalfor which the corresponding documents are shownin the document view (see below).

Document view. The document view shows alist of documents with their title or subject as se-lected by the currently active filters. For largedocument collections, the documents are loaded

166

Figure 2: Interactive Visualization Interface composed of graph view on entities, document distributionover time, list of selected documents and a map of document geo-distribution.

on demand. The user can browse the list andidentify documents for close reading (bold, openfolder icon). The document text view shows thefull text of the document, where the entities dis-played in the graph are underlined. The under-line color corresponds to the type of entity. If theuser selected entities in the graph, these entitiesare highlighted with the background color of theentity. This “close reading” mode enables usersto verify hypotheses they perceive in the “distantreading” (Moretti, 2007) visualizations.

6.1 Faceted search

The different levels of visualizations enable usersto explore the dataset via navigation through doc-ument collection (Miller and Remington, 2004).Another highly complementary paradigm for in-formation retrieval used in our system, and alsoone highly expected by users, is searching. Dur-ing navigation, users explore the document collec-tion interactively, browse with the help of the vi-sualization interfaces, zoom in and out, graduallydiscovering topics, entities and facts mentioned inthe corpus.

In contrast, searching in our system as-sumes that a user issues a conventional facetedsearch (Tunkelang, 2009) query being a free com-bination of a full text queries with filters on docu-ment metadata. The result of such a faceted search

is a set of documents that satisfy the specifiedfacets, i.e. search conditions, formulated at querytime, acting as a filter. Here facets can be anymetadata initially associated with the document(e.g. date, sender or destination) extracted auto-matically (e.g. named entities or topics). Searchresults are presented to the user as a list of docu-ments or in a form of a graph that corresponds tothe document view of our system (see Figure 2).

Search and navigation compliment each otherduring a journalistic investigation. For instance, auser can get an idea about information availablein the collection via interactive visualization inter-face (see Section 6). This navigation session maytrigger a concrete journalistic hypothesis. In thiscase, during the next step a user may want to issuea specific search query to find documents that sup-port or falsify the hypothesis. To implement thissearch functionality, we rely on ElasticSearch.

7 User Study

We conducted a user study to analyze the usabilityof the provided functions in the interactive visualinterface. We asked 10 volunteer students of var-ious major subjects to perform investigative tasksusing our tool. The tasks covered all views. Theyincluded finding documents of a particular date,opening a document, selecting countries in a map,showing and expanding entity network, assessing

167

entity type etc. We assessed subjective user expe-rience.

The results showed that the implemented visu-alizations were intuitive and the interactive func-tions were easy to use for the requested tasks. Thevolunteers especially appreciated the drill downin the timeline, the fading in of newly appearednodes and edges in the graph. The legend andtooltips were very often used as a guidance inthe interface. Upon users’ feedback, we extendedzooming and panning functions in the graph andincluded highlighting of open documents in thedocument list. We improved the graph layout andlook for reducing edge and node overplotting.

8 Conclusion and future work

In this paper, we presented new/s/leak, an in-vestigative tool to support data journalists in ex-tracting important storylines from large text doc-uments. The backend of new/s/leak comprisesof different NLP tools for preprocessing, analyz-ing and extracting objects such as named enti-ties, relationships, term co-occurrences, keywordsand event time expressions. In the frontend,new/s/leak provides network visualization withdifferent views supporting navigating, annotating,and editing extracted information. We also devel-oped a demo system presenting the current state ofthe new/s/leak tool and conducted a user study toevaluate the effectiveness of the system.

We are currently extending the tool to meet theremaining journalists’ requirements. In particu-lar, we include features for annotating entities andtheir relations with explanations, for saving a par-ticular view to share it with colleagues or for lateruse. Moreover, we will provide journalists withthe possibility to further edit this sharable view.Additional features for manual data curation willenhance data quality for analysis while ensuringprotection of sources and compliance with legalissues. We will also integrate adaptive annotationmachine learning approach (Yimam et al., 2016)into new/s/leak to automatically identify interest-ing objects based on the journalists’ interactionand feedback. Further, we will investigate pullingin other information from linked open data and theweb.

Acknowledgements

The authors are grateful to data journalists atSpiegel Verlag for their helpful insights into jour-

nalistic work and for the identification of tool re-quirements. The authors wish to thank Lukas Ray-mann, Patrick Mell, Bettina Johanna Ballin, NilsChristopher Boeschen, Patrick Wilhelmi-Dworskiand Florian Zouhar for their help with system im-plementation and conduction of the user study.The work is being funded by Volkswagen Foun-dation under Grant Nr. 90 847.

ReferencesD. Benikova, U. Fahrer, A. Gabriel, M. Kaufmann,

S. M. Yimam, T. von Landesberger, and C. Biemann.2014. Network of the day: Aggregating and visual-izing entity networks from online sources. In Proc.NLP4CMC Workshop at KONVENS, Hildesheim,Germany.

C. Biemann, G. Heyer, U. Quasthoff, and M. Richter.2007. The Leipzig Corpora Collection – monolin-gual corpora of standard size. In Proc. Corpus Lin-guistics, Birmingham, UK.

S. Cohen, C. Li, J. Yang, and C. Yu. 2011. Com-putational journalism: A call to arms to databaseresearchers. In Proc. CIDR-11), pages 148–151,Asilomar, CA, USA.

S. Janicke, G. Franzini, M. F. Cheema, and G. Scheuer-mann. 2015. On Close and Distant Reading in Digi-tal Humanities: A Survey and Future Challenges. InProc. EuroVis, Cagliari, Italy.

K. Kirkpatrick. 2015. Putting the data science intojournalism. Commun. ACM, 58(5):15–17.

B. Klimt and Y. Yang. 2004. The Enron Corpus: ANew Dataset for Email Classification Research. InProc. ECML 2004, pages 217–226, Pisa, Italy.

A. Kochtchi, T. von Landesberger, and C. Biemann.2014. Networks of names: Visual explorationand semi-automatic tagging of social networks fromnewspaper articles. Computer Graphics Forum,33(3):211–220.

K. Kucher and A. Kerren. 2014. Text visualizationbrowser: A visual survey of text visualization tech-niques. online textvis.lnu.se.

L.V. Lita, A. Ittycheriah, S. Roukos, and N. Kamb-hatla. 2003. tRuEcasIng. In Proc. ACL ’03, Sap-poro, Japan.

C. S. Miller and R. W Remington. 2004. Model-ing information navigation: Implications for infor-mation architecture. Human-computer interaction,19(3):225–271.

F. Moretti. 2007. Graphs, maps, trees : abstract mod-els for a literary history. Verso, London, UK.

J. Strotgen. 2015. Domain-sensitive Temporal Taggingfor Event-centric Information Retrieval. Ph.D. the-sis, University of Heidelberg.

D. Tunkelang. 2009. Faceted search. Synthesis lec-tures on information concepts, retrieval, and ser-vices, 1(1):1–80.

S. Yimam, C. Biemann, L. Majnaric, S. Sabanovic, andA. Holzinger. 2016. An adaptive annotation ap-proach for biomedical entity and relation recogni-tion. Brain Informatics, pages 1–12.

168