10
Helping Intelligence Analysts make Connections M. Shahriar Hossain, Christopher Andrews, Naren Ramakrishnan, and Chris North Department of Computer Science, Virginia Tech, Blacksburg, VA 24061 Email: {msh, cpa, naren, north}@vt.edu Abstract Discovering latent connections between seemingly un- connected documents and constructing “stories” from scattered pieces of evidence are staple tasks in intelli- gence analysis. We have worked with government in- telligence analysts to understand the strategies they use to make connections. Beyond techniques like cluster- ing that aim to provide an initial broad summary of large document collections, an important goal of an- alysts in this domain is to assimilate and synthesize fine grained information from a smaller set of foraged documents. Further, analysts’ domain expertise is cru- cial because it provides rich contextual background for making connections and thus the goal of KDD is to augment human discovery capabilities, not supplant it. We describe a visual analytics system we have built— Analyst’s Workspace (AW)—that integrates browsing tools with a storytelling algorithm in a large screen display environment. AW helps analysts systematically construct stories of desired fidelity from document col- lections and helps marshall evidence as longer stories are constructed. Introduction What do the April’07 shootings at Virginia Tech, Bernard Madoff’s Ponzi scheme uncovered in Dec’08, and the March’09 recall of Zencore plus have in common? They are all extreme happenings that lead us to question: ‘Why didn’t somebody connect the dots?’ Our ongoing failures to do so have led to these and many other, arguably avoidable, catastrophes. Yet, piecing together a story between seem- ingly disconnected information remains an elusive skill and an understudied task. Storytelling is an accepted metaphor in analytical rea- soning and in visual analytics (Thomas and Cook (eds.) 2005). Many software tools exist to support story building activities (Eccles et al. 2008; Hsieh and Shipman 2002; Wright et al. 2006). Analysts are able to lay out evidence according to spatial cues and incrementally build connec- tions between them. Such connections can then be chained together to create stories which either serve as end hypothe- ses or as templates of reasoning that can then be prototyped. Copyright c 2011, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. However, there are severe limitations to human sensemak- ing capabilities, even on gigapixel-sized displays, when con- fronted with massive haystacks of data. Algorithmic sup- port to help sift through the myriad of possibilities is crucial here. At the same time, storytelling is not entirely automat- able since it is an exploratory activity and the analyst brings in valuable intuition and contextual cues to direct the story building process. Hence it is imperative that we view story- telling as a collaborative enterprise between algorithmic and human capabilities. The focus of this paper is on exploring document collec- tions and we present a visual analytics system called Ana- lyst’s Workspace (AW) that aids intelligence analysts in ex- ploring connections and building stories between possibly disparate end points. Our key contributions are: 1. Design considerations that have emerged from a detailed user study with five analysts working on intelligence anal- ysis tasks. 2. New algorithms that find stories through document col- lections and also help marshall evidence to support dis- covered stories. 3. Implementation of both interactive visualization and algo- rithmic storytelling support in AW; and a case study over a public domain dataset. How Analysts make Connections We recently had the opportunity to interview and perform a study with five intelligence analysts currently employed at a government organization. The detailed results are presented and discussed in [Andrews et al. 2010]. We begin by de- scribing qualitative lessons from the interviews followed by a study of their strategies in solving analysis tasks. Interviews with Analysts For the purpose of this paper, it suffices to note that the goal of the interviews was to attempt to typify how analysts approached the large quantities of data they were required to sift through, and to learn what tools they used and how they used them. From these interviews, the most interesting fact that emerged was that the analysts largely used software tools only at the beginning and at the end of their analysis. Basic search tools were used to filter down a dataset at the start of their analysis. At the end of the analysis, presen-

Helping Intelligence Analysts make Connections · 2011. 5. 27. · rithmic storytelling support in AW; and a case study over a public domain dataset. How Analysts make Connections

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • Helping Intelligence Analysts make Connections

    M. Shahriar Hossain, Christopher Andrews, Naren Ramakrishnan, and Chris NorthDepartment of Computer Science, Virginia Tech, Blacksburg, VA 24061

    Email: {msh, cpa, naren, north}@vt.edu

    Abstract

    Discovering latent connections between seemingly un-connected documents and constructing “stories” fromscattered pieces of evidence are staple tasks in intelli-gence analysis. We have worked with government in-telligence analysts to understand the strategies they useto make connections. Beyond techniques like cluster-ing that aim to provide an initial broad summary oflarge document collections, an important goal of an-alysts in this domain is to assimilate and synthesizefine grained information from a smaller set of forageddocuments. Further, analysts’ domain expertise is cru-cial because it provides rich contextual background formaking connections and thus the goal of KDD is toaugment human discovery capabilities, not supplant it.We describe a visual analytics system we have built—Analyst’s Workspace (AW)—that integrates browsingtools with a storytelling algorithm in a large screendisplay environment. AW helps analysts systematicallyconstruct stories of desired fidelity from document col-lections and helps marshall evidence as longer storiesare constructed.

    IntroductionWhat do the April’07 shootings at Virginia Tech, BernardMadoff’s Ponzi scheme uncovered in Dec’08, and theMarch’09 recall of Zencore plus have in common? Theyare all extreme happenings that lead us to question: ‘Whydidn’t somebody connect the dots?’ Our ongoing failures todo so have led to these and many other, arguably avoidable,catastrophes. Yet, piecing together a story between seem-ingly disconnected information remains an elusive skill andan understudied task.

    Storytelling is an accepted metaphor in analytical rea-soning and in visual analytics (Thomas and Cook (eds.)2005). Many software tools exist to support story buildingactivities (Eccles et al. 2008; Hsieh and Shipman 2002;Wright et al. 2006). Analysts are able to lay out evidenceaccording to spatial cues and incrementally build connec-tions between them. Such connections can then be chainedtogether to create stories which either serve as end hypothe-ses or as templates of reasoning that can then be prototyped.

    Copyright c© 2011, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

    However, there are severe limitations to human sensemak-ing capabilities, even on gigapixel-sized displays, when con-fronted with massive haystacks of data. Algorithmic sup-port to help sift through the myriad of possibilities is crucialhere. At the same time, storytelling is not entirely automat-able since it is an exploratory activity and the analyst bringsin valuable intuition and contextual cues to direct the storybuilding process. Hence it is imperative that we view story-telling as a collaborative enterprise between algorithmic andhuman capabilities.

    The focus of this paper is on exploring document collec-tions and we present a visual analytics system called Ana-lyst’s Workspace (AW) that aids intelligence analysts in ex-ploring connections and building stories between possiblydisparate end points. Our key contributions are:

    1. Design considerations that have emerged from a detaileduser study with five analysts working on intelligence anal-ysis tasks.

    2. New algorithms that find stories through document col-lections and also help marshall evidence to support dis-covered stories.

    3. Implementation of both interactive visualization and algo-rithmic storytelling support in AW; and a case study overa public domain dataset.

    How Analysts make ConnectionsWe recently had the opportunity to interview and perform astudy with five intelligence analysts currently employed at agovernment organization. The detailed results are presentedand discussed in [Andrews et al. 2010]. We begin by de-scribing qualitative lessons from the interviews followed bya study of their strategies in solving analysis tasks.

    Interviews with AnalystsFor the purpose of this paper, it suffices to note that thegoal of the interviews was to attempt to typify how analystsapproached the large quantities of data they were requiredto sift through, and to learn what tools they used and howthey used them. From these interviews, the most interestingfact that emerged was that the analysts largely used softwaretools only at the beginning and at the end of their analysis.

    Basic search tools were used to filter down a dataset atthe start of their analysis. At the end of the analysis, presen-

  • Figure 1: How intelligence analysts make connections(from (Pirolli and Card 2005).)

    tation tools (such as PowerPoint) would be used to createreports. For the middle of the analytic process, where the ac-tual sensemaking occurs, the analysts in our study reportedthat they tended to print out reports and other source mate-rials. This allowed them to easily read them, annotate themwith notes and highlights, sort them into physical folders,stack them in meaningful ways on the desk, and even layall the documents out on a large table where they could beorganized and rapidly skimmed.

    A formal way to characterize the above observations iswith reference to the schematic of Pirolli and Card (Pirolliand Card 2005). As Fig. 1 shows, the process by whichintelligence analysts make connections is frequently tenta-tive and evolutionary, with structures developing as under-standing of the data increases. There are two ‘subloops’ inFig. 1: information foraging and sense-making. Most ana-lytic systems, such as IN-SPIRE (PNNL ), Jigsaw (HCII ),ThemeRiver (Havre et al. 2002), NetLens (Kang et al. 2007)focus on support for the information foraging loop, leavingthe sensemaking to the analyst. Other tools, such as Ana-lyst’s Notebook (i2group ), Sentinel Visualizer (FMS, Inc. ),Entity Workspace (Bier et al. 2006), and Palantir (Khuranaet al. 2009) focus more on the the sensemaking loop, andwhile many of them ostensibly support foraging, the analystsreported using these tools primarily for late stage sensemak-ing and presentation.

    The key problem with this separation of the two halves ofthe sensemaking process is that the schematic is not meantto be a state diagram – it is a representation of some ofthe thought processes and structures that are identifiableduring sensemaking and a description of how they relate.There is an overall trend from a collection of raw data toa final report, but inbetween, the analyst should be rangingwidely across the entire process, building up an understand-ing through progressive foraging and structuring.

    User StudyThe tendency of analysts to resort to non-software methodsfor information organization suggested to us the potential

    Figure 2: A user works with Analyst’s Workspace on a 32megapixel display.

    for exploring the use of large screen displays and how theycan be integrated into the sensemaking process. If the sense-making can be drawn back into the computational realm, itprovides the opportunity to better support the analysts.

    We conducted a detailed user study with a large 32megapixel (10,240×3,200) display, which consists of a 4×2grid of 30′′ LCD panels, each with a maximum resolutionof 2560×1600. All of the panels in the display are driven bya single computer, allowing us to run conventional desktopapplications on the display without modification. The dis-play is configured for single-user use and is slightly curvedaround the user, who sits in the center, with the freedom torotate around to access all parts of the display (Fig. 2).

    For the study, we employed the VAST (Symposium onVisual Analytics Science and Technology) 2006 Challengedataset. This dataset contains approximately 240 documents,which are primarily synthetic news stories from a fictitiouscity newspaper. Although this is a relatively small dataset,most of it is actually noise, with only about ten of the docu-ments being relevant to uncovering the plot. Another featureof this dataset is that even if the analyst uncovers all ten doc-uments, some analysis is still required to actually determinethe nature of the synthetic threat.

    Five analysts were presented with the dataset as a direc-tory of files, with only the search facilities of Windows XP’sFile Explorer, WordPad for reading and annotating docu-ments, and a simple image viewer for the couple of imagesincluded in the dataset. We asked them to uncover the buriedplot using any approach that they desired, using the space af-forded by the display in any way that they found useful.

    A key conclusion from this study was that the large dis-play was treated in a fundamentally different way fromconventional displays. Conventional displays typically con-strain the user to working with one or two applications ordocuments at a time. Interaction in this environment is pri-marily application oriented. The large display, on the otherhand, permits the user to work with a large number of ap-plications and documents simultaneously. In our study, wefound that this simple change encouraged users to adopta more document-centric approach, working with the doc-uments in a fashion more akin to the way one would in-

  • Figure 3: An active session in Analyst’s Workspace. Full text documents and entities share the space, with a mixture of spatialmetaphors, such as clusters, graphs, and timelines all in evidence. The yellow lines are the links of the derived social network.

    teract with physical pieces of paper laid out on a physicaldesk. We found that our subjects freely moved documentsaround the space, creating a form of “semantic layer” overthe document collection, in which position on the displayhelped to convey additional semantics, such as relationshipsbetween the documents. Using space to encode extra infor-mation about the relationship between objects has a rich his-tory, rooted in human perceptual abilities (Kirsh 1995). Aprimary advantage of the use of space for this purpose isthat it is very flexible and allows the user to express transi-tory or questionable relationships in a visually salient struc-ture without committing to a strict and potentially confiningstructure (Shipman and Marshall 1999).

    For example, most of the analysts used the space to clus-ter the documents that they found important. The interestingfeature of these clusters is that they were frequently vagueand grouped documents on an assortment of different lev-els. For instance, documents in the same workspace could beclustered because they related to a particular person or place,because they had a related theme such as weapons, or evenbecause of how the analyst regarded the documents (e.g.,many of the analysts created a pile of documents that theythought were probably junk but seemed related enough thatthey did not want to close them and lose them). Sometimes,clusters would form without the analyst having any clearthought about why the documents in the collection mightfit together.

    While the study demonstrated the appeal of working spa-tially for sensemaking, it is worth noting that most analystsdid not solve the analysis task. At the end of most sessions,the analysts had all identified the major themes and createdrepresentative structures, but they did not connect the dotsto put the entire story together. Here, we can point to theimpoverished foraging support, which could not help themto identify the critical linchpins that would draw the wholestory together.

    The above observations motivated us to develop a vi-sual analytics environment—Analyst’s Workspace (AW),and open the door to our algorithmic assistance for foraging

    connections of exploration within AW. AW i) closely mim-ics information organization layouts employed by analysts,ii) relates multiple representations to accommodate differentstrategies of exploration, and iii) provide automated algo-rithmic assistance for foraging connections and hypothesisgeneration.

    Analyst’s WorkspaceAW provides the user with a plethora of interaction tools foruse with large screen displays (e.g., familiar click-and-drag,selection rectangles, multi-click selections) as well as infor-mation organization facilities (e.g., graph layout, temporalordering). Because these operations are local, they only af-fect the local area or the currently selected documents andhence enable the analyst to freely mix spatial metaphors (seeFig. 3).

    While the primary visual elements in AW are full textdocuments, we also provide support at the entity level. Doc-uments are marked up based on extracted entities, and theanalyst can use context menus to quickly identify new en-tities and create aliases between entities. Double clickingan entity of interest in a document opens an entity object,which is initially displayed as a list of documents in whichthat entity appears. Entities can also be collapsed down to arepresentational icon, and AW automatically draws links be-tween entities when they co-occur in a document. These twofeatures allow the analyst to rapidly construct and exploresocial networks, which are commonly used tools in intelli-gence analysis.

    AW also provides basic facilities for text-based search.Search results are displayed as lists of matching documentsin the space, like the entities. The documents are color codedto tell the analyst the state of a document: open, previouslyviewed, or never viewed.

    Visual links play a strong role in AW. These allow a num-ber of relationships to be expressed, freeing spatial proxim-ity to be used to express more complex relationships moredirectly related to making sense of the dataset.

    While Analyst’s Workspace is designed to support a flex-

  • Figure 4: AW’s entity browser, here showing the peopleidentified in the dataset, sorted by the number of documentsin which each appears.

    ible approach to sensemaking, it does encourage a particularanalytic approach that we observed being used by the ana-lysts. This is a strategy that Kang et al. (2009) referred to as“Find a Clue, Follow the Trail”. In this strategy, the analystidentifies some starting place and then branches out the in-vestigation from that point, following keywords and entities.

    In AW, a starting point can be provided by the entitybrowser (Fig. 4), which allows the analyst to order enti-ties by the number of occurrences in the dataset. The ana-lyst opens this entity and gets a list of documents in whichthis entity appears. The analyst then works through thesedocuments, opening new entities or performing searches asnew clues are found. Since all of the search results are inde-pendent objects in the space and there is a visual record ofwhich documents have been visited, AW can support both abreadth-first and a depth-first search through the informa-tion. As the investigation progresses, the analyst uses thespace to arrange the information as it is uncovered, buildingand rebuilding structures to reflect his or her current under-standing of the underlying narrative.

    While this approach has been shown to be fairly effective(Kang et al. 2009), it does not permit greater characteriza-tion of the dataset and does not support more complex ques-tions that the analyst might ask. For example, this approachrelies entirely on the analyst to pick the right keywords andentities to “chase,” and can miss less direct lines of investi-gation. It is common for terrorists to use multiple aliases orcode words that can easily thwart this approach. However, itis possible that common patterns of behavior or other docu-ment similarities might help the analyst to uncover some ofthese connections.

    The analyst may also need the discovery of paths throughthe dataset to be more efficient. For example, the analystmay have uncovered that a revolutionary in South Americashares the same last name as a farmer in the Pacific North-west who has been implicated in some nefarious affairs and

    wishes to ask if there is any link between them or if their lastname is a coincidence. An exhaustive background check ofthe two men is possible through AW if the dataset is rela-tively small, but it is an indirect and time consuming pro-cess.

    Algorithmic Support for StorytellingWe attempted to formalize and support the ways by which ananalyst conducts unstructured discovery, chases leads, andmarshalls evidence to support or refute potentially promis-ing chains. Our story generation framework is exploratoryin nature so that, given starting and ending documents of in-terest, it explores candidate documents for path following,and heuristics to admissibly estimate the potential for pathsto lead to a desired destination. The generated paths are thenpresented to the AW analyst who can choose to revise themor adapt them for his/her purposes.

    A story between documents d1 and dn is a sequence of in-termediate documents d2, d3, ..., dn−1 such that every neigh-boring pair of documents satisfies some user defined criteria.Given a story connecting a start and an end document (seeFig. 7 (a)), analysts perform one of two tasks: they eitheraim to strengthen the individual connections, possibly lead-ing to a longer chain (see Fig. 7 (b)), or alternatively theyseek to organize evidence around the given connection (seeFig. 7 (c)). We use the notions of distance threshold andclique size to mimic these behaviors. We designed our sto-rytelling algorithm to work with these two criteria that areunder the AW analyst’s control and experimentation. (Theseare not magic parameters whose values have to be tuned butare rather controls that mimic the natural process by whichanalysts tighten or strengthen their hypotheses.)

    The distance threshold refers to the maximum accept-able distance between two neighboring documents in a story.Lower distance thresholds impose stricter requirements andlead to longer paths. The clique size threshold refers to theminimum size of the clique that every pair of neighboringdocuments must participate in. Thus, greater clique sizesimpose greater neighborhood constraints and lead to longerpaths. See Fig 7 (d) for a new path with both stricter cliquesize and stricter distance thresholds. These two parametershence essentially map the story finding problem to one ofuncovering clique paths in the underlying induced similaritynetwork between documents.

    We use the term “clique chain” to refer to a story alongwith its surroundings connections of evidence. In contrast,a story only constitutes the junction points between con-secutive cliques. Another way to characterize them is thata clique chain constitues many stories.

    Fig. 5 describes the steps involved in generating stories forinteraction by the AW analyst. For document modeling, weuse a bag-of-words (vector) representation where the termsare weighted by tf-idf with cosine normalization. Our searchframework has three key computational stages:

    1. construction of a concept lattice,2. generating promising candidates for path following, and3. evaluating candidates for potential to lead to destination.

  • ---------------

    ---------------

    ---------------

    -------------

    Input

    documents

    Stop-word

    removal and

    stemming

    Analyst’s

    input

    Heuristic

    search

    Document

    modeling

    ---------------

    ---------------

    -------------

    ---------------

    ---------------

    -------------

    Concept

    lattice

    generation

    Figure 5: Pipeline of the storytelling framework in AW.

    C2terms: ADEFG

    docs: 3, 7

    C10terms: ADEF

    docs: 5, 3, 7

    C7terms: ADE

    docs:1, 5, 3, 7

    C11terms: BDE

    docs: 6, 7

    C6terms: DE

    docs: 8, 1, 6, 5, 3, 7

    C3terms: DFG

    docs: 2, 3, 7

    C4terms: DF

    docs: 2, 5, 3, 7

    C5terms: D

    docs: 8, 1, 6, 2, 5, 3, 7

    C8terms: A

    docs:4, 1, 5, 3, 7

    3

    2

    5

    16

    8 4

    The value in a cell (dj, tx) indicates

    the frequency of term x in doc j.

    C1terms: ABDEFG

    docs: 7

    7

    C8terms: AC

    docs:4

    d1d2d3d4d5d6d7

    tA3

    1

    2

    1

    5

    tB

    3

    4

    tC

    5

    tD2

    4

    4

    2

    3

    2

    tE5

    5

    4

    1

    1

    tF

    1

    1

    5

    2

    tG

    3

    3

    3

    d8 1 2

    Do

    cu

    me

    nts

    Terms

    Figure 6: A dataset and its concept lattice.

    Of these, the first stage can be viewed as a startup cost thatcan be amortized over multiple path finding tasks. The sec-ond and third stages are organized as part of an A* searchalgorithm that begins with the starting document, uses theconcept lattice to identify candidates satisfying the distanceand clique size requirements, and evaluates them heuristi-cally for their promise in leading to the end document.

    Concept Lattice ConstructionThe concept lattice is a data structure that models conceptualclusters of document and term overlaps and is used here as aquick lookup of potential neighbors that will satisfy the dis-tance threshold and clique constraints. Given a (weighted)term-document matrix, we use the CHARM-L (Zaki and Ra-makrishnan 2005) closed set mining algorithm on a booleanversion of this matrix to generate a concept lattice. Each con-cept is a pair: (document set, term set) as shown in Fig. 6.Further, we order the document list for each concept by thenumber of terms. Note that we can find an approximate setof nearest neighbors for a document d from the documentlist of the concept containing d and the longest term set.

    Successor GenerationSuccessor generation is the task of, given a document, usingthe distance threshold and clique size requirements to iden-tify a set of possible successors for path following. Note thatthis does not use the end document in its computation.

    The basic idea of our successor generation approach is, in

    addition to finding a good set of successor nodes for a givendocument d, to be able to have sufficient number of themso that, combinatorially, they contribute a desired numberof cliques. With a clique size constraint of k, it is not suffi-cient to merely pick the top k neighbors of the given docu-ment, since the successor generation function expects mul-tiple clique candidates. (Note that, even if we picked the topk neighbors, we will still need to subject them to a check toverify that every pair satisfies the distance threshold.) Giventhat this function expects b clique candidates (where b is thebranching factor), a minimum m documents must be identi-fied where m is given by the solution to the inequalities:(

    m− 1k

    )< b and

    (mk

    )≥ b

    For a given document, we pick the top m candidate doc-uments from the concept lattice and form combinations ofsize k. Our successor generator thus forms combinations ofsize k from these m documents to obtain a total of b k-cliques. Since m is calculated using the two inequalities,the total number of such combinations is equal to or slightlygreater than b (but never less than b). Each clique is given anaverage distance score calculated from the distances of thedocuments of the clique and the current document d. Thisaids in returning a priority queue of exactly b candidate k-cliques.

    We evaluated our successor generation mechanism bycomparing it to the brute force nearest neighbor search andthe cover tree based (Beygelzimer et al. 2006) nearest neigh-bor search mechanisms. We found that our concept latticebased successor generation mechanism works faster thanthese other approaches (not described due to space limita-tions). Therefore we adopt the concept lattice in our succes-sor generation procedure.

    Evaluating CandidatesWe now have a basket of candidates that are close to thecurrent document and we must determine which of thesehas the potential to lead to the destination document. Theprimary criteria of optimality for the A* search procedureof our framework is the cumulative Soergel distance of thepath. The Soergel distance between two documents d1andd2 is given by:

    D(d1, d2) =

    ∑t

    |wt,d1 − wt,d2 |∑t

    max (wt,d1 , wt,d2)

    where wt,di indicates the weight for term t of document di.We use the straight line Soergel distance for the heuristic

  • CIA_05 FBI_28CIA_17

    Boris Bugarov (a Russian bio-weapon scientist) was hired by Pyotr Safrygin.

    vector, moscow, pyotr, institute, live, safrygin, russia

    Algorithm connects Pieter Dopple with Safrygin, but the link is weak.

    request, live, name, september

    (a) Clique size=2, distance threshold=0.99: Example of a story with weak connections.

    (b) Clique size=2, distance threshold=0.96 : Example of a story with stricter links.

    CIA_05 FBI_28NSA_22NSA_22

    Algorithm connects Boris Bugarov (a Russian bio-weapon scientist) and PyotrSafrygin via an intercepted phone call.

    pyotr, central, airline, moscow, russia

    Algorithm finds Pieter Dopple involved in the stone business and money laundering.

    pakistan, ramundo, stone, ortiz, diamond, africa,

    tanzanite, panama, precious

    CIA_14NSA_22central, havana, cuba,

    middle, east

    Algorithm connects these two documents based on place names but the connection is vague.

    CIA_05 FBI_28

    CIA_14

    CIA_17

    Boris Bugarov (a Russian bio-weapon scientist) was hired by Pyotr Safrygin.

    CIA_34

    vector, moscow, pyotr safrygin, institute, live, russia

    pyotrpyotr safrygin

    Algorithm links Pieter Dopple with diamond transactions. Pieter Dopple has relationships with militant Islamic groups.

    request, live, name, september

    (c) Clique size=3, distance threshold=0.99 : Example of a better story with small amount of surroundingevidence.

    CIA_05 FBI_28NSA_14 NSA_07NSA_14 NSA_07

    CIA_17CIA_17CIA_17 CIA_14CIA_14CIA_14

    Boris Bugarov (a Russian bio-weapon scientist) was hired by Pyotr Safrygin.

    CIA_34 NSA_20 CIA_09

    pakistan, sell, receive, tanzanite, panama

    pakistan, sell, receive, , sell, receive, , sell, receive, , sell, receive, , sell, receive, , sell, receive, tanzanite, panamatanzanite, panamatanzanite, panamatanzanite, panama

    , sell, receive, , sell, receive, , sell, receive, , sell, receive, , sell, receive, tanzanite, panamatanzanite, panama

    , sell, receive, , sell, receive, phone, pakistan, caller, intercept, let, know

    airline, moscow, central, russia

    moscowmoscowmoscowmoscowrussiarussiarussiarussia

    moscowmoscowrussiarussia

    This clique explains some intercepted phone calls involving Pyotr Safrygin, the director of security for Central Russia Airlines. The phone conversations were transaction related and mention tanzanite.

    Algorithm connects Pieter Dopplewith diamond transactions using neighboring evidence. Pieter Dopplehas relationships with militant Islamic groups.

    NSA_18

    (d) Clique size=4, distance threshold=0.95 : Example of a better story with more surrounding evidence.

    Figure 7: A sample story illustrating the impact of change of clique size and distance threshold. The goal is to connect bio-weapon scientist Boris Bugarov with money launderer Pieter Dopple. As the distance and clique size thresholds are experi-mented with, we observe surrounding evidence connecting Pieter Dopple with militant Islamic groups.

    and, because it obeys the triangle inequality, it can be shownthat this will never estimate the cost of a path from any doc-ument d to the goal. Therefore our heuristic is admissibleand our A* search will yield the optimal path.

    It is important to note that our algorithm never explicitlycomputes or materializes the underlying network of similar-ities at any time. As a result, it is very easy for the AW ana-lyst to vary the clique size and distance thresholds to analyze

    different stories for the same start and end pairs.

    Experimental ResultsWe conduct both quantitative and qualitative evaluation ofAW’s visual and algorithmic support for storytelling. Thequestions we seek to assess are:

    1. What is the interplay between distance threshold andclique size constraints in story construction? How does

  • our heuristic fare in reference to an uninformed searchand as a function of the constraints?

    2. What is the quality of stories discovered by our algo-rithm?

    3. How do the algorithmically discovered stories compare tothose found by analysts?

    4. How can analysts mix-and-match algorithmic capabilitieswith their intuitive expertise in story construction?

    For our experiments, we used an analysis exercise(Hughes 2005) developed at the Joint Military IntelligenceCollege. The exercise dataset is sometimes referred to as theAtlantic Storm dataset.

    Evaluating Story ConstructionTo study the relationship between distance threshold andclique size constraints, we generated thousands of storieswith different distance and clique size requirements from theAtlantic Storm dataset, and computed the maximum cliquesize for which at least one story was found. As expected, wesee an anti-monotonic relationship and that it is more diffi-cult to marshall evidence as distance thresholds get stricter(Fig. 8).

    To study the performance of AW’s heuristic over a non-heuristic based search, we picked 1000 random start-enddocument pairs from our document collection and generatedstories with different distance threshold and clique size re-quirement. The non-heuristic search is simply a breadth-firstsearch version of our A* search framework (in other words,the heuristic returns zero for all inputs). Fig. 9 compares av-erage runtimes of AW’s heuristic based search against thenon-heuristic search. From top to bottom, three consecutiveplots of Fig. 9 depict the average runtimes respectively asfunctions of story length, distance threshold, and clique size.Astute readers might expect a monotonic increase of averageruntime with longer stories in Fig. 9 (top). Stories tend to

    (Stricter distance threshold )

    Larger clique size implies stricter clique requirement

    Smaller θθθθ implies stricter distance requirement

    Distance Threshold, θθθθ

    0.7

    60

    .77

    0.7

    80

    .79

    0.8

    00

    .81

    0.8

    20

    .83

    0.8

    40

    .85

    0.8

    60

    .87

    0.8

    80

    .89

    0.9

    00

    .91

    0.9

    20

    .93

    0.9

    40

    .95

    0.9

    60

    .97

    0.9

    80

    .99

    Ma

    xim

    um

    cliq

    ue

    siz

    e f

    or

    wh

    ich

    at

    lea

    st

    on

    e c

    liq

    ue

    -ch

    ain

    wa

    s f

    ou

    nd

    0

    5

    10

    15

    20

    25

    Figure 8: Atlantic storm dataset: interplay between distancethreshold and clique size constraints.

    become longer with stringent distance threshold and cliquesize. Further stringency, however, results in broken stories(the length of the story theoretically becomes infinite). Asa result, we found a smaller number of longer stories than

    The impact of the heuristc

    Story length, l

    2 4 6 8 10 12 14 16 18 20 22

    Avera

    ge t

    ime (

    sec)

    to d

    isco

    ver

    sto

    ries o

    f le

    ng

    th l

    0

    5

    10

    15

    20

    25

    30

    With heuristic

    Without heuristic

    The impact of the heuristc

    Distance threshold, θθθθ

    0.82

    0.83

    0.84

    0.85

    0.86

    0.87

    0.88

    0.89

    0.90

    0.91

    0.92

    0.93

    0.94

    Avera

    ge t

    ime (

    sec)

    to d

    isco

    ver

    sto

    ries w

    ith

    th

    resh

    old

    , θθ θθ

    0

    5

    10

    15

    20

    With heuristic

    Without heuristic

    The impact of the heuristc

    Clique size, k

    2 4 6 8 10

    Avera

    ge t

    ime (

    sec)

    to d

    isco

    ver

    sto

    ries w

    ith

    cliq

    ue s

    ize k

    0

    5

    10

    15

    20

    With heuristic

    Without heuristic

    Figure 9: We used 1000 random start-end pairs to com-pare the performance of AW’s heuristic search against un-informed search.

  • 0

    1

    2

    3

    4

    5

    6

    7

    8

    123456789 Sequence, j

    DiagonalCell weights

    1/81/81/81/81/81/81/81/8

    1/82

    1/(8*7)

    1/(8*6)

    1/(8*5)

    1/(8*4)

    1/(8*3)

    1/(8*2)

    1/8

    Seq

    ue

    nce

    , i

    Total weight ofa diagonal line

    0

    1

    2

    3

    4

    5

    6

    7

    8

    123456789 Sequence, j1/81/81/81/81/81/81/81/8

    1/82

    1/(8*7)

    1/(8*6)

    1/(8*5)

    1/(8*4)

    1/(8*3)

    1/(8*2)

    1/8

    Se

    qu

    en

    ce

    , i

    Total weight ofa diagonal line

    DiagonalCell weights

    Figure 10: (left) A dispersion plot of an ideal story. The dispersion coefficient ϑ = 1.0. (right) A dispersion plot of a non-idealstory of same length. The dispersion coefficient ϑ = 1− 38×8 −

    18×7 = 0.94.

    the shorter ones. In all the plots of Fig. 9, we calculated theaverage time over only the discovered stories. Since mostof the long stories were found quickly by our algorithms,the curves of Fig. 9 (top) increase first and then descreaseinstead of being monotonically increasing. All the plots ofFig. 9 depict that the heuristic yields significant gains overthe uninformed search.

    Evaluating Story QualityIt is difficult to objectively evaluate the quality of sto-ries. Here, we adopt Swanson’s complimentary but disjoint(CBD) hypothesis (Swanson 1991) and assess the pairwiseSoergel distance between documents in a story, betweenconsecutive as well as non-consecutive documents. An idealstory is one that meets the Soergel distance threshold θ onlybetween consecutive pairs whereas a non-ideal story “over-satisfies” the distance threshold and meets it even betweennon-consecutive pairs. As shown in Fig. 10 (left), an idealstory has only diagonal entries in its dispersion plot (con-trast with Fig. 10 (right)). If n documents of a story are d0,d1, ..., dn−1, then our formula for dispersion coefficient isgiven by:

    ϑ = 1− 1n− 2

    n−3∑i=0

    n−1∑j=i+2

    disp (di, dj)

    where

    disp (di, dj) =

    {1

    n+i−j , if D (di, dj) > θ0, otherwise

    We also compute p-values for each generated story. Re-call that at each step of the A* search we build a queueof candidate documents by investigating the corresponding

    Table 1: Sample story fragments from an analyst. How didour algorithm fare in discovering them?

    StoryFound by

    algorithm

    Found in the

    clique path

    Found by

    merging

    stories

    FBI_30FBI_35FBI_41CIA_43

    CIA_41CIA_34CIA_39NSA_09

    NSA_16

    CIA_01CIA_05CIA_34CIA_41CIA_17

    CIA_39NSA_22

    NSA_11NSA_18NSA_16

    CIA_06CIA_22CIA_21

    CIA_24FBI_24

    NSA_06CIA_32CIA_42

    NSA_16CIA_38CIA_42

    concepts of the concept lattice. To calculate the p-value ofa clique of size k, we randomly select k − 1 documentsfrom the entire candidate pool and check if all the edgesof the formed k-clique satisfy the distance threshold θ, it-erating the test 50,000 times. This allows us to find p-valuesdown to 2×10−5. We repeat this process for every junction-document of a discovered clique chain. The overall p-valueof a clique chain is calculated by multiplying all the p-valuesof every clique of the chain.

    Story ValidationWe have depicted stories with different distance and cliquesize requirements in Fig. 7. The story connects a Rus-sian bio-weapon scientist (Boris Bugarov) with a moneylaunderer (Pieter Dopple) who has ties to militant Islamicgroups. In Table 1 we compared some discovered storieswith fragments put together by analysts. The inputs fromthe analysts are not complete stories but rather scattered,

  • The analyst requests a story connecting a pair of interesting

    documents.

    Unsatisfied with the strength of the connection, the analyst requests

    information about documents in the surrounding neighborhood (i.e.,

    within the local clique).

    Having explored the local neighborhood, the analyst has identified

    two additional documents that form a more meaningful connection

    and extends the original story.

    The generated story between the two endpoints. The system has identified two linking documents, and connected them together into a linked story.

    A list of the neighbors of the third document. The lines provide visual links to open documents.

    New connections have been manually added to extend the story

    Figure 11: Illustration of AW usage.

    piecewise connections. The table illustrates that all the sto-ries were discovered by our algorithm with two exceptions:the stories were not in the directly discovered path but werepresent in the clique chain (i.e., the story did not exhibit thesame junction points), or the fragment can be discovered bymerging multiple stories together. This depicts the potentialof our heuristic in helping AW analysts discover stories al-gorithmically.

    Illustration of AW UsageFig. 11 shows an example of the usage of AW and our al-gorithms. In this scenario, the analyst requests a story con-necting a pair of interesting documents. The algorithm re-turns a story but the analyst is not satisfied with parts ofthe story. The analyst then requests information about doc-uments in the surrounding neighborhood (i.e., within the lo-cal clique) of an intermediate document. Having exploredthe local neighborhood, the analyst identified two additionaldocuments that form a more meaningful connection and ex-tends the original story. The two story fragments of Table 1that were not directly found by the algoritm could be modi-fied by the analyst to obtain more meaningful stories.

    Related LiteratureWe organize related work in this space under various cate-gories.

    Relationships via associations: Jayadevaprakash et al.(2005) advocate a transitive method to generate an associ-

    ation graph to find relationships between non-cooccurringtext objects. The authors advocate the use of transitive meth-ods because transitive methods do not require expensivetraining by human experts. Similarly, our approach doesnot require expensive training, but we situate our meth-ods in a visual analytics setting with intelligence expertsproviding active feedback in the discovery process. Vakaand Mukhopadhyay (2009) describe a method to extracttransitive associations among diseases and herbs related toAyurveda. The method is based on a text-mining techniquedesigned for discovering transitive associations among bio-logical objects. It uses a vocabulary discovery method froma subset of PubMed corpora to associate herbs and diseases.Thaicharoen (2009) aims to discover relational knowledge inthe form of frequent relational patterns and relational associ-ation rules from disjoint sets of literature. Although the aimof the research of Vaka and Mukhopadhyay and Thaicharoenis somewhat similar to our objective, we focus on findingconnecting chains in an induced similarity network of docu-ments rather than finding a chain of associations via externalknowledge.

    Topic based hypotheses generation: Jin et al. (2007)present a tool based on link analysis, and text miningmethodologies to detect links between two topics across twoindividual documents. Srinivasan (2004) presents text min-ing algorithms that are built within the framework estab-lished by Swanson (1991). The algorithms generate rankedterm lists where the key terms represent novel relationshipsbetween topics. Although we do not conduct explicit topicmodeling in our work, the requirement to impose cliqueconstraints in story construction essentially helps transduceslowly between topics.

    Classification and clustering for hypotheses genera-tion: Glance et al. (2005) describe a system that gathers spe-cific types of online content and delivers analytics based onclassification, natural language processing, and other miningtechnologies in a marketing intelligence application. Faro etal. (2009) propose a clustering method aimed at discoveringhidden relationships for hypothesis generation and suitablefor semi-interactive querying. Our method does not dependon classification/clustering for information organization butharnesses CBD structures in finding chains between docu-ments of different clusters.

    Connecting the dots: The “connecting the dots” problemhas appeared in the literature in different guises and for dif-ferent applications: cellular networks (Brassard et al. 1980),social networks (Faloutsos et al. 2004), image collections(Heath et al. 2010), and document collections (Das-Neves etal. 2005; Kumar et al. 2006; Shahaf and Guestrin 2010). Ourwork explicitly harnesses CBD structures whereas many ofthese works focused on contexts with weaker dispersion re-quirements. For instance, the model proposed by Shahaf andGuestrin (2010) explicitly requires a connecting thread ofcommonality through all documents in a story.

    DiscussionWe have described a visual analytics system (AW) thatprovides both exploratory and algorithmic support for an-alysts in making connections. Privacy considerations pro-

  • hibit us from describing the new applications that AW isbeing used for but the experimental results demonstrate itsrange of capabilities. Future work is geared toward moremixed-initiative facilities for story generation and proba-bilistic methods to accommodate richer forms of analyst’sfeedback. We are also working toward techniques to do au-tomatic story summarization and concept map generation.

    AcknowledgmentsThis work is supported in part by the Institute for CriticalTechnology and Applied Science, Virginia Tech, and the USNational Science Foundation through grant CCF-0937133.

    ReferencesAndrews, C.; Endert, A.; and North, C. 2010. Space toThink: Large High-resolution Displays for Sensemaking. InCHI ’10, 55–64.Beygelzimer, A.; Kakade, S.; and Langford, J. 2006. CoverTrees for Nearest Neighbor. In ICML ’06, 97–104.Bier, E.; Ishak, E.; and Chi, E. 2006. Entity Workspace: AnEvidence File That Aids Memory, Inference, and Reading.In ISI ’06, 466–472.Brassard, J.-P., and Gecsei, J. 1980. Path Building in Cellu-lar Partitioning Networks. In ISCA ’80, 44–50.Das-Neves, F.; Fox, E. A.; and Yu, X. 2005. ConnectingTopics in Document Collections with Stepping Stones andPathways. In CIKM ’05, 91–98.Eccles, R.; Kapler, T.; Harper, R.; and Wright, W. 2008.Stories in GeoTime. Info. Vis. 7(1):3–17.Faloutsos, C.; McCurley, K. S.; and Tomkins, A. 2004. FastDiscovery of Connection Subgraphs. In KDD ’04, 118–127.Faro, A.; Giordano, D.; Maiorana, F.; and Spampinato, C.2009. Discovering Genes-diseases Associations from Spe-cialized Literature using the Grid. Trans. Info. Tech. Biomed.13:554–560.FMS, Inc. FMS Advanced Systems Group, Sentinel Visual-izer. Last accessed: May 26, 2011, http://www.fmsasg.com/.Glance, N.; Hurst, M.; Nigam, K.; Siegler, M.; Stockton, R.;and Tomokiyo, T. 2005. Deriving Marketing Intelligencefrom Online Discussion. In KDD ’05, 419–428.Havre, S.; Hetzler, E.; Whitney, P.; and Nowell, L. 2002.ThemeRiver: Visualizing Thematic Changes in Large Doc-ument Collections. IEEE TVCG 8(1):9–20.HCII. Human Computer Interaction Institute, CarnegieMellon University, Jigsaw. Last accessed: May 26, 2011,http://www.hcii.cmu.edu/mhci/projects/jigsaw.Heath, K.; Gelfand, N.; Ovsjanikov, M.; Aanjaneya, M.; andGuibas, L. 2010. Image Webs: Computing and ExploitingConnectivity in Image Collections. In CVPR, 3432 –3439.Hsieh, H., and Shipman, F. M. 2002. Manipulating Struc-tured Information in a Visual Workspace. In UIST’02, 217–226.Hughes, F. J. 2005. Discovery, Proof, Choice: The Art andScience of the Process of Intelligence Analysis, Case Study6, “All Fall Down”, Unpublished report.

    i2group. The Analyst’s Notebook. Last accessed: May 26,2011, http://www.i2group.com/us.Jayadevaprakash, N.; Mukhopadhyay, S.; and Palakal, M.2005. Generating Association Graphs of Non-cooccurringText Objects using Transitive Methods. In SAC ’05, 141–145.Jin, W.; Srihari, R. K.; and Ho, H. H. 2007. A Text MiningModel for Hypothesis Generation. In ICTAI ’07, 156–162.Kang, H.; Plaisant, C.; Lee, B.; and Bederson, B. B. 2007.NetLens: Iterative Exploration of Content-actor NetworkData. Info. Vis. 6(1):18–31.Kang, Y.; Görg, C.; and Stasko, J. 2009. The Evaluation ofVisual Analytics Systems for Investigative Analysis: Deriv-ing Design Principles from a Case Study. In VAST, 139–146.Khurana, H.; Basney, J.; Bakht, M.; Freemon, M.; Welch, V.;and Butler, R. 2009. Palantir: a Framework for CollaborativeIncident Response and Investigation. In IDtrust ’09, 38–51.Kirsh, D. 1995. The Intelligent Use of Space. Artif. Intell.73(1-2):31–68.Kumar, D.; Ramakrishnan, N.; Helm, R. F.; and Potts, M.2006. Algorithms for Storytelling. In KDD ’06, 604–610.Pirolli, P., and Card, S. 2005. The Sensemaking Processand Leverage Points for Analyst Technology as Identifiedthrough Cognitive Task Analysis. In ICIA ’05.PNNL. Pacific Northwest National Laboratory, INSPIREvisual document analysis. Last accessed: May 26, 2011,http://in-spire.pnl.gov.Shahaf, D., and Guestrin, C. 2010. Connecting the Dotsbetween News Articles. In KDD ’10, 623–632.Shipman, F. M., and Marshall, C. C. 1999. Formality Con-sidered Harmful: Experiences, Emerging Themes, and Di-rections on the Use of Formal Representations in InteractiveSystems. CSCW 8:333–352.Srinivasan, P. 2004. Text Mining: Generating Hypothesesfrom MEDLINE. J. Am. Soc. Inf. Sci. Technol. 55:396–413.Swanson, D. R. 1991. Complementary Structures in DisjointScience Literatures. In SIGIR ’91, 280–289.Thaicharoen, S. 2009. Text Association Mining with Cross-sentence Inference, Structure-based Document Model andMulti-relational Text Mining. Ph.D. Dissertation, Univ. ofColorado at Denver.Thomas, J. J., and Cook (eds.), K. A. 2005. Illuminatingthe Path: The Research and Development Agenda for VisualAnalytics. IEEE Computer Society Press.Vaka, H. G. G., and Mukhopadhyay, S. 2009. HypothesesGeneration Pertaining to Ayurveda Using Automated Vo-cabulary Generation and Transitive Text Mining. In NBIS’09, 200–205.Wright, W.; Schroh, D.; Proulx, P.; Skaburskis, A.; and Cort,B. 2006. The Sandbox for Analysis: Concepts and Methods.In CHI ’06, 801–810.Zaki, M. J., and Ramakrishnan, N. 2005. Reasoning AboutSets Using Redescription Mining. In KDD ’05, 364–373.