4
InfoScout An interactive, entity centric, person search tool Sean McKeown School of Computing Edinburgh Napier University Edinburgh, Scotland [email protected] Martynas Buivys and Leif Azzopardi School of Computing Science University of Glasgow Glasgow, Scotland [email protected] [email protected] ABSTRACT Individuals living in highly networked societies publish a large amount of personal, and potentially sensitive, informa- tion online. Web investigators can exploit such information for a variety of purposes, such as in background vetting and fraud detection. However, such investigations require a large number of expensive man hours and human effort. This pa- per describes InfoScout, a search tool which is intended to reduce the time it takes to identify and gather subject cen- tric information on the Web. InfoScout collects relevance feedback information from the investigator in order to re- rank search results, allowing the intended information to be discovered more quickly. Users may still direct their search as they see fit, issuing ad-hoc queries and filtering existing results by keywords. Design choices are informed by prior work and industry collaboration. Keywords Web Investigation; Open-Source Intelligence; People Search- ing; Professional Search; Entity Search 1. INTRODUCTION The World Wide Web is an invaluable resource, provid- ing access to a massive amount of information covering a wide variety of subjects. Ubiquitous access to the Inter- net has made this information more accessible than ever, while blogging, social media, and communication platforms allow users to be publishers of information, rather than sim- ply consumers. Governments, businesses, and public bod- ies also contribute to this wealth of information, publishing census data, maps and financial documentation. One way in which these Big Data sources can be exploited is in a variety of Open-Source Intelligence (OSINT) 1 investigations, which 1 In this context, the term ‘Open-Source’ refers to the public avail- ability of information, rather than a software licensing paradigm. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SIGIR ’16, July 17 - 21, 2016, Pisa, Italy c 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM. ISBN 978-1-4503-4069-4/16/07. . . $15.00 DOI: http://dx.doi.org/10.1145/2911451.2911468 make use of publicly available, heterogeneous, information to pursue particular goals. The types of OSINT investigation are varied, ranging from employee vetting [4], competitive intelligence [1], fraud detection [11] and copyright protec- tion [1], to law enforcement [7] and counter-terrorism [3]. Open-Source information, particularly public social me- dia data, is widely adopted for these purposes. Lexis Nexis indicates that 81% of law enforcement professionals in the United States make use of social media platforms as an inves- tigative tool [7], with Facebook, Twitter and Youtube being the most frequented [7]. Similarly, CareerBuilder found that 43% of employers utilise social media to perform background checks on prospective employees, with half of them having rejected candidates as a result of their findings [4]. While such investigations are used in a variety of indus- tries, they are time consuming and fraught with problems. This paper describes an application which is intended to re- duce the time it takes for an investigator to find pages which are relevant to a given subject, the target of the investiga- tion, and is based on findings from industry engagement. 2. BACKGROUND While such types of search are prominent in a variety of industries, the literature in relation to the behaviour of Web investigators is sparse. High level descriptions of processes and techniques can be found in the NATO OSINT Hand- book [9], which is set in the context of military like opera- tions. Appel [1] provides a similar overview from the per- spective of a professional Web investigator, while providing additional discussion pertaining to industry best practices and ethical considerations. In order to better understand the requirements of inves- tigators, McKeown et al. [8] conducted a qualitative study which compares and contrasts the behaviour of Web inves- tigators with other search domains. It was found that the investigation is split into two major phases: (i) Background Verification and (ii) Open-Ended Search. The background verification stage primarily relies on fixed resources, such as resellers of the electoral roll, address information and com- pany directories. The latter focuses more on expanding the knowledge of a subject, largely by way of general Web search engines and social media. These investigators demonstrated low levels of search ex- pertise, with minimal use of advanced search functionalities. However, a high level of domain expertise was evident. Con- textual clues, personal knowledge and experience facilitate the discrimination of search results, as well as the assess-

InfoScout - napier.ac.uk/media/worktribe/output-168799/infoscout... · and techniques can be found in the NATO OSINT Hand-book [9], which is set in the context of military like opera-tions

  • Upload
    buinga

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

InfoScout

An interactive, entity centric, person search tool

Sean McKeownSchool of Computing

Edinburgh Napier UniversityEdinburgh, Scotland

[email protected]

Martynas Buivys and Leif AzzopardiSchool of Computing Science

University of GlasgowGlasgow, Scotland

[email protected]@glasgow.ac.uk

ABSTRACTIndividuals living in highly networked societies publish alarge amount of personal, and potentially sensitive, informa-tion online. Web investigators can exploit such informationfor a variety of purposes, such as in background vetting andfraud detection. However, such investigations require a largenumber of expensive man hours and human effort. This pa-per describes InfoScout, a search tool which is intended toreduce the time it takes to identify and gather subject cen-tric information on the Web. InfoScout collects relevancefeedback information from the investigator in order to re-rank search results, allowing the intended information to bediscovered more quickly. Users may still direct their searchas they see fit, issuing ad-hoc queries and filtering existingresults by keywords. Design choices are informed by priorwork and industry collaboration.

KeywordsWeb Investigation; Open-Source Intelligence; People Search-ing; Professional Search; Entity Search

1. INTRODUCTIONThe World Wide Web is an invaluable resource, provid-

ing access to a massive amount of information covering awide variety of subjects. Ubiquitous access to the Inter-net has made this information more accessible than ever,while blogging, social media, and communication platformsallow users to be publishers of information, rather than sim-ply consumers. Governments, businesses, and public bod-ies also contribute to this wealth of information, publishingcensus data, maps and financial documentation. One way inwhich these Big Data sources can be exploited is in a varietyof Open-Source Intelligence (OSINT)1 investigations, which

1In this context, the term ‘Open-Source’ refers to the public avail-

ability of information, rather than a software licensing paradigm.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

SIGIR ’16, July 17 - 21, 2016, Pisa, Italyc© 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM.

ISBN 978-1-4503-4069-4/16/07. . . $15.00

DOI: http://dx.doi.org/10.1145/2911451.2911468

make use of publicly available, heterogeneous, information topursue particular goals. The types of OSINT investigationare varied, ranging from employee vetting [4], competitiveintelligence [1], fraud detection [11] and copyright protec-tion [1], to law enforcement [7] and counter-terrorism [3].

Open-Source information, particularly public social me-dia data, is widely adopted for these purposes. Lexis Nexisindicates that 81% of law enforcement professionals in theUnited States make use of social media platforms as an inves-tigative tool [7], with Facebook, Twitter and Youtube beingthe most frequented [7]. Similarly, CareerBuilder found that43% of employers utilise social media to perform backgroundchecks on prospective employees, with half of them havingrejected candidates as a result of their findings [4].

While such investigations are used in a variety of indus-tries, they are time consuming and fraught with problems.This paper describes an application which is intended to re-duce the time it takes for an investigator to find pages whichare relevant to a given subject, the target of the investiga-tion, and is based on findings from industry engagement.

2. BACKGROUNDWhile such types of search are prominent in a variety of

industries, the literature in relation to the behaviour of Webinvestigators is sparse. High level descriptions of processesand techniques can be found in the NATO OSINT Hand-book [9], which is set in the context of military like opera-tions. Appel [1] provides a similar overview from the per-spective of a professional Web investigator, while providingadditional discussion pertaining to industry best practicesand ethical considerations.

In order to better understand the requirements of inves-tigators, McKeown et al. [8] conducted a qualitative studywhich compares and contrasts the behaviour of Web inves-tigators with other search domains. It was found that theinvestigation is split into two major phases: (i) BackgroundVerification and (ii) Open-Ended Search. The backgroundverification stage primarily relies on fixed resources, such asresellers of the electoral roll, address information and com-pany directories. The latter focuses more on expanding theknowledge of a subject, largely by way of general Web searchengines and social media.

These investigators demonstrated low levels of search ex-pertise, with minimal use of advanced search functionalities.However, a high level of domain expertise was evident. Con-textual clues, personal knowledge and experience facilitatethe discrimination of search results, as well as the assess-

ment of the reliability and quality of pieces of informationand sources. Investigators were aware of the notion of adigital footprint, but did not take any technical measuresto conceal themselves online. Instead, they employ sensiblestrategies for avoiding contact with the subject of the inves-tigation, such as carefully considering their interaction withsocial media profiles.

As a form of domain specific search, such investigationspossess properties of exhaustive searches, as in the patentsearch [6] and E-Discovery domains [2], however, the endgoal is not a complete enumeration of all relevant infor-mation. High recall is subordinate to high precision, withemphasis on finding the critical pieces of information re-quired to fulfil the purpose of the investigation. The process,then, is more correctly characterised as a form of exploratorysearch.

However, such investigations are not without difficulty,with issues being highlighted in:

i Person disambiguation and name variants

ii Underutilization of technological solutions for case man-agement and searching

iii Corroboration of findings and the reliability of evi-dence

iv Time constraints associated with the large volumes ofdata available on the Web.

2.1 Existing ToolsWhile investigators in McKeown et al. [8] typically only

made use of a popular Web browser, general Web searchengines and word processors to carry out the investigation,there are products available for such purposes.

Copernic Agent2 is meta-search engine which allows searchqueries to be disseminated to a wide variety of sources forOSINT information gathering. External search engines andresources are categorised thematically, allowing the user toquickly execute searches for a given type of information.However, as of January 2014, this product is no longer sup-ported, with many of the underlying calls to search enginesbeing non-functional at the time of writing.

Maltego3 is a proprietary OSINT and forensics tool whichfocuses on allowing the user to determine the links betweenentities. Maltego has a broad scope and can be used for avariety of OSINT purposes, providing the ability to visu-alise and discover real world links between entities in socialor physical networks. This is achieved by combining meta-search functionality with an underlying graph, allowing theuser to explore common nodes or expand the graph to newentities. The search interface differs greatly from that ofstandard Web search engines, with most actions populat-ing, or operating, on the graph directly. As a result, non-technical users may find this tool difficult to use.

Oryon C Portable4 is a portable Web browser based onChromium which is pre-installed with a variety of add-onsand bookmarks for OSINT investigations. The default home-page provides tabs to select common search types, such asWeb, Images, Blogs and Maps, while the bookmarks and

2https://www.copernic.com/en/products/agent/

3http://www.paterva.com/web6/products/maltego.php

4http://osintinsight.com/oryon.php

other features are designed to give an investigator fast ac-cess to common resources. Specific resources are searchedindividually, and the browser itself is geared towards pro-tecting the privacy of the user.

Person searching makes up a substantial portion of allWeb searches [10, 5], and, as a result, several people searchengines exist, such as Pipl5 and Spokeo6. Such search en-gines typically behave like federated search systems, withdetails such as the individual’s name and location fuellingqueries to underlying social media, blogging and generalWeb searches. Disambiguation options are limited and thereis often geographical confusion in the results, with the userbeing left to manually sift through the documents for thecorrect individual.

3. INFOSCOUTExisting tools are insufficient because they do not ade-

quately address the core problems faced by the investigator.Subject centric investigations rely heavily on disambiguatingnamesakes in order to acquire information about a particu-lar individual, rather than people with the same name. Thiscan be achieved with advanced query formulation, howeverthis requires some degree of search expertise on the part ofthe investigator. Similarly, while existing solutions allow forfaster and more convenient searching of multiple resources,the user must still manually inspect all documents, or theirsnippets, in order to identify relevant results, and extractthe information, with minimal support for actually captur-ing evidence.

The requirements of the investigator can be described insimple economic terms: they wish to maximise the recov-ery of information relating to the subject and the case, whileminimising the amount of effort required. InfoScout, the sys-tem described in this work, aims to address the investigator’sdifficulty with the large volumes of data found on the Web,allowing users to pinpoint information more readily, withoutthe need for advanced query formulations or large quantitiesof existing knowledge.

InfoScout possesses the following features:

• Automatic page fetching and screenshot acquisition

• Content, entity and key term extraction

• User relevance feedback and re-ranking of search re-sults

• Entity and term filtering of search results

• Case management

The core of InfoScout is designed around allowing the userto identify relevant documents quickly by re-ranking resultsbased on user feedback. This allows the user to focus theirtime on extracting information from key documents, as op-posed to locating such documents in the first place.

Initially, the user is presented with a simple interface whichallows cases to be added, deleted, edited or continued. Eachcase has a subject as their focus, to which all search actionsshould pertain. The user enters ad-hoc queries and is thenpresented with a list of document screenshots, and links tosaid documents, for relevance judgements. This feedback

5https//pipl.com/

6http://www.spokeo.com/

is then used to build a ranking query which re-ranks theresults using relevant terms and entities extracted from rel-evant documents. This inherently allows for person disam-biguation as it allows for features of the users social network,location, hobbies and associated institutions to play a rolein determining the order of documents. By marking a docu-ment with the entity NASA as relevant, it is more likely thatfurther documents with this term are more highly ranked.Such re-ranking allows for results for John Smith who worksat NASA to be distinguished from those of John Smith whoworks for IBM.

The interface for relevance feedback is simple, taking theform of a sequential suggestion of documents. The user ispresented with a screenshot of the document and is giventhe option to mark the document as relevant, non-relevant,or to skip it altogether, see Figure 1. The re-ranking itselfis transparent to the user.

The user must also be allowed to direct the search as theyfind relevant pieces of information about the subject. Ad-hoc queries can be submitted with new query terms, whichwill then be ranked using the information from previouslyjudged documents. The user can choose to be presented withindividual documents for judgement, or can explore the listof results as with a standard interface for Search Engine Re-sults Pages (SERPs). Additionally, users may make use ofextracted entities and terms in order to filter documents, asin Figure 2. This is useful in the case that the user wantsto focus the investigation on a particular entities or key-words, such as the city or employer of the subject. Filtersare applied on the standard SERP view.

Documents which are marked as relevant are also storedas part of the case, facilitating case management while alsopreserving the information as it appeared at the time ofcapture. This particular functionality mitigates the effortof manually tracking screenshots and documents in wordprocessors, as with investigators in McKeown et al. [8] aswell as mitigating later attempts at evidence obfuscation bythe subject.

Search results are provided solely by the Bing search en-gine, as it allows for results from a wide range of Web resultsand includes publicly indexed social media pages. Early iter-ations of InfoScout used a federated approach, however thiswas discovered to be impractical, with API limitations andsocial media privacy settings negating the benefits. In prac-tice, modern social media APIs are inherently privacy aware,such that less information is found programmatically thanwould otherwise be discovered by visiting the profile usinga browser. Social media access in InfoScout leverages theBing search engine, which may be tailored with appropriatequeries7. Additionally, as was likely the case with CopernicAgent, the maintenance of a large number of APIs, whichare likely to change over time, was considered an unneces-sary overhead. Screenshots are provided to the user in orderto facilitate rapid decision making of irrelevant documentswhile avoiding technical limitations of other means of di-rectly displaying the document in the application, such aswith iframes.

InfoScout facilitates the Open-Ended portion of the in-vestigation, which is the most time consuming element ofa case. In doing so, the cost of integrating existing, fixed,resources is mitigated, with the more constrained scope pro-

7Such as searching for John Smith’s Google+ profile with the query

prefix: site:plus.google.com

Figure 1: User document feedback, documents aresuggested for judgement one at a time from the topof the ranking.

viding a cleaner and more focused solution. Similarly, noassumption is made about the financial position of the useror their subscriptions/access to popular industry resources,such as 192.com.

3.1 ImplementationInfoScout is designed to be a Web application which can

be run from any browser, with core components being hostedon the cloud as part of a distributed service. The coresystem is the OSINT API, which is both used to coordi-nate the various services and operations, as well as serveas the Web server which the user connects to. The APIis implemented as a Node JS application, built using theExpress framework. Re-ranking and filtering of Bing re-sults is achieved by building an intermediate, local, indexof Bing results, which makes use of the ElasticSearch plat-form. A consumer/provider pipeline in the OSINT APIprocesses individual Bing result URLs using the AlchemyLanguage API8, which allows the service to scale with largevolumes of results. The parsed content of the documents, aswell as their entities and most discriminative terms (within-document, ranked by their TFIDF) are stored in separatefields in the index. These fields facilitate later re-rankingand querying. Screenshots of the remote pages are capturedusing Screenshotlayer9, which also stores the resulting im-ages. Case information is stored using the IBM’s ComposeMongoDB service.

Document re-ranking is achieved by building ranking queriesfor submission to the ElasticSearch index. These rankingqueries take the form of ElasticSearch Boosting Queries,which are comprised of a list of positive and negative doc-ument terms and entities. Modules within the OSINT APIbuild this query in stages from the list of judged documentson the local index. The first stage aggregates scores for allterms and entities in the index for judged documents, withrelative weights being calculated by within-document rele-vance, as determined by the Alchemy API. The second stagedynamically bins terms and entities by score, allowing for ap-propriate boosts to be applied to each bin. The final bins arethen used to construct the re-ranking Boosting Query whichis submitted to the ElasticSearch index for an updated doc-ument ranking. The result is that the most discriminativeterms and entities from judged documents are used to weight

8http://www.alchemyapi.com/products/alchemylanguage

9https://screenshotlayer.com/

Figure 2: Filtering of search results by applying ad-hoc filters allows the user to focus the search withinretrieved documents.

new documents for the investigator, facilitating discovery ofpertinent information. While both terms and entities arepresent in the re-ranking query, the boost weightings are bi-ased towards entities as these are fundamental to this kindof search. Similarly, filtering of results is achieved by build-ing Bool(ean) Queries for the ElasticSearch index, makinguse of the extracted entities and document content.

4. SUMMARY AND FUTURE WORKWe have presented InfoScout, which is intended to reduce

the time spent searching public resources for subject-centricinformation. The interface is minimalistic and the innerworkings are transparent to the user, mitigating require-ments for training and search expertise. Users can directtheir search while reducing the number of irrelevant docu-ments that they encounter as they pursue a particular sub-ject or critical piece of information.

While this system is essentially a proof of concept, withenough resources it would be possible to build a new en-tity centric index in the same manner as the Bing searchengine. This would mitigate the need for a secondary indexand mid-investigation page fetching, while providing moreflexibility. Additionally, more advanced relevance feedbackand re-ranking methods could be employed, such as by main-taining language models for relevant/irrelevant documents.However, this would require further research into domainspecific re-ranking in order to determine what is most effec-tive in Web based, entity centric, environment.

Additional features to be implemented will further im-prove the time saving potential of the application by pro-viding more intelligence via extended automatic documentprocessing. Information extraction techniques and cluster-ing can be utilised to generate entity profiles and associateddocuments, allowing the user to further focus their search ona particular namesake. Entity co-citation and query expan-sion using known name variants would further bolster theperson disambiguation process. Case management featureswill also be expanded to allow for entity and document re-lationship handling, with a graph based visualisation of theinvestigation.

While Bing is a powerful search engine, the user muststill have some degree of acquaintance with its advancedoperators in order to be maximally effective. To this end, itmay be useful to facilitate advanced query building for theuser by automatically populating site and boolean operatorsbased on user input.

Finally, an evaluation of the system should be conductedin order to quantify its effectiveness. This could be bro-ken into two parts: i) An evaluation of effectiveness of there-ranking and user feedback, which would require the gen-eration of an appropriate ground truth corpus; ii) A us-ability and task-based assessment conducted with industryprofessionals. The former would allow offline tweaking andoptimisation of the underlying mechanics, while the latterwould investigate the appropriateness of the interface de-sign and determine the real world impact on the time/effortinvestment of the investigator.

5. REFERENCES[1] E. J. Appel. Internet Searches for Vetting,

Investigations, and Open-source Intelligence. Taylorand Francis Group, 2011.

[2] S. Attfield and A. Blandford. Improving the coststructure of sensemaking tasks: Analysing userconcepts to inform information system design. InProc. of INTERACT ’09, page 532–545, 2009.

[3] Department of Homeland Security. DHS terrorist useof social networking facebook case study | publicintelligence, 2010.

[4] J. Grasz. Number of Employers Passing on ApplicantsDue to Social Media Posts Continues to Rise,According to New CareerBuilder Survey -CareerBuilder, June 2014.

[5] R. Guha and A. Garg. Disambiguating people insearch. In Proc. of the 13th World Wide WebConference,, 2004.

[6] H. Joho, L. A. Azzopardi, and W. Vanderbauwhede. Asurvey of patent users: An analysis of tasks, behavior,search functionality and system requirements. In Proc.of IIix ’10, page 13–24, 2010.

[7] Lexis Nexis. Social Media Use in Law Enforcement:Crime prevention and investigative activities continueto drive usage., Nov. 2014.

[8] S. McKeown, D. Maxwell, L. Azzopardi, and W. B.Glisson. Investigating people: A qualitative analysis ofthe search behaviours of open-source intelligenceanalysts. In Proceedings of the 5th InformationInteraction in Context Symposium, IIiX ’14, pages175–184, New York, NY, USA, 2014. ACM.

[9] NATO. NATO open source intelligence handbook,2001.

[10] A. Spink, B. J. Jansen, and J. Pedersen. Searching forpeople on web search engines. Journal ofDocumentation, 60(3):266–278, 2004.

[11] L. Subelj, S. Furlan, and M. Bajec. An expert systemfor detecting automobile insurance fraud using socialnetwork analysis. Expert Syst. Appl., 38(1):1039–1052,Jan. 2011.