10
This UT CID research was supported in part by the following organizations: identity.utexas.edu Predicting and Explaining Identity Risk, Exposure and Cost Using the Ecosystem of Identity Attributes Razieh Nokhbeh Zaeem Suratna Budalakoti K. Suzanne Barber Muhibur Rasheed Chandrajit Bajaj 2016 UT CID Report #1603

Predicting and Explaining Identity Risk, Exposure and Cost ......methods for preventing identity theft and fraud. I. INTRODUCTION Identity theft is now a widespread problem in the

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Predicting and Explaining Identity Risk, Exposure and Cost ......methods for preventing identity theft and fraud. I. INTRODUCTION Identity theft is now a widespread problem in the

This UT CID research was supported in part by the following organizations:  

identity.utexas.edu  

Predicting and Explaining Identity Risk, Exposure and Cost Using the Ecosystem of Identity Attributes

Razieh Nokhbeh ZaeemSuratna BudalakotiK. Suzanne BarberMuhibur RasheedChandrajit Bajaj

2016

UT CID Report #1603

Page 2: Predicting and Explaining Identity Risk, Exposure and Cost ......methods for preventing identity theft and fraud. I. INTRODUCTION Identity theft is now a widespread problem in the

Predicting and Explaining Identity Risk, Exposureand Cost Using the Ecosystem of Identity Attributes

Razieh Nokhbeh Zaeem, Suratna Budalakoti and K. Suzanne BarberCenter for Identity

The University of Texas at Austin{razieh, sbudalokoti, sbarber}@identity.utexas.edu

Muhibur Rasheed and Chandrajit BajajComputational Visualization CenterThe University of Texas at Austin{muhibur, bajaj}@ices.utexas.edu

Abstract—Personally Identifiable Information (PII) is com-monly used in both the physical and cyber worlds to performpersonal authentication. A 2014 Department of Justice reportestimated that roughly 7% of American households reportedsome type of identity theft in the previous year, involving the theftand fraudulent use of such PII. Establishing a comprehensivemap of PII attributes and their relationships is a fundamentalfirst step to protect users from identity theft.

In this paper, we present the mathematical representationand implementation of a model of Personally Identifiable In-formation attributes for people, named Identity Ecosystem. EachPII attribute (e.g., name, age, and Social Security Number) ismodeled as a graph node. Probabilistic relationships between PIIattributes are modeled as graph edges. We have implemented thisIdentity Ecosystem model as a Bayesian Belief Network (withcycles allowed) and we use Gibb’s Sampling to approximatethe posteriors in our model. We populated the model from twosources of information: 1) actual theft and fraud cases; and 2)experts’ estimates.

We have utilized our Identity Ecosystem implementation topredict as well as to explain the risk of losing PII and the liabilityassociated with fraudulent use of these PII attributes. For betterhuman understanding of the complex identity ecosystem, we alsoprovide a 3D visualization of the Identity Ecosystem model andqueries executed on the model. This research aims to advance afundamental understanding of PII attributes and leads to bettermethods for preventing identity theft and fraud.

I. INTRODUCTION

Identity theft is now a widespread problem in the U.S.and around the globe. It affected an estimated 17.6 millionpersons in 2014 [1] and topped the Federal Trade Commis-sion’s national ranking of consumer complaints for the 15thconsecutive year [2]. Identity theft is a crime that involvesthe fraudulent acquisition and use of a person’s personallyidentifiable information (PII).

In order to thwart identity thieves and fraudsters, a firstbut fundamental step is to understand what constitutes aperson’s identity, in both the cyber and physical worlds. Thecyber world has seamlessly merged into our everyday physicalworld, making a person’s identity a complex interminglingof their on-line and off-line attributes. Examples of on-lineattributes are one’s social media accounts, on-line shoppingpatterns, passwords, and email accounts. Off-line attributes arethose related to the physical world such as bank accounts,

credit and debit cards, Social Security Number, and one’sphysical characteristics.

We have designed and implemented the Identity Ecosystemat the Center for Identity at the University of Texas at Austinas a valuable tool that models identity theft and fraud, analyzesits data, and answers several questions about identity riskand management. For example, the Ecosystem can predict arisk (i.e., probability) of breach for each PII and a potentialdollar value damage to the PII owner if the PII is fraudulentlyused. The risk of exposure of a PII attribute depends on thedifferent methods by which it can be breached. The valueof an attribute depends on how it can be fraudulently used,possibly to facilitate further breaches. In addition, once moreinformation about the victim or the incident is available, theEcosystem is able to refine the predicted risk and value toreflect the new information and converge to the risk and valuein the real world.

The Identity Ecosystem stores known data about PIIbreaches and fraudulent usage in a probabilistic model, andperforms Bayesian Network-based inference on this data, toidentify high risk and value targets. Bayesian networks asa statistical tool are a good fit for this problem becauseof the highly complex interdependence between the variousattributes. Based on the probabilistic analysis, the Center forIdentity Ecosystem tool presents the results as an easy tointerpret graph-based visualization, where the attributes arerepresented as nodes, and the relations are shown as edges. Thevisualization enables the user to interactively play out variousscenarios, and draw conclusions about the risk of exposureand value of the attributes of interest.

In this paper, we first set forth various example use casesof the Ecosystem tool (Section II). Then, we formally explainthe underlying model of the Ecosystem (Section III) followedby the explanation of our source of raw data (Section IV). Wepresent the Ecosystem implementation details (Section V), andfinally conclude and propose future directions (Section VI).

II. EXAMPLE USE CASES

The Ecosystem tool can be utilized to investigate and answermany possible identity management questions. In this section,we explain a couple of sample use cases of the Ecosystem.978-1-5090-1072-1/16/$31.00 c©2016 IEEE

Page 3: Predicting and Explaining Identity Risk, Exposure and Cost ......methods for preventing identity theft and fraud. I. INTRODUCTION Identity theft is now a widespread problem in the

Fig. 1. Risk and Value of PII attributes in the Ecosystem.

A. Risk and Values of Attributes

The Ecosystem user can analyze and make decisions aboutidentity risk and value for individuals. The Ecosystem predictsthe risk (i.e., probability) of breach for a PII attribute and apotential dollar value damage to the PII owner if the attributeis fraudulently used. PII attributes are connected in variousdifferent ways. For instance, low value attributes might beconnected to high value attributes; whereby a threat may gainaccess to a low value attribute, then, via the links present in theconnected nature of identity, gain access to other high valueattributes. So, low value high risk attributes connected to highvalue attributes signal trouble.

The Ecosystem Graphical User Interface (GUI) displays PIIattributes as nodes and various types of connections betweenthem as edges. The GUI can color and size attribute nodesbased on various properties of the attribute, for example, theirrisk and value. Figure 1 shows the typical set of PII attributesfor a person, in which nodes are colored based on their riskof exposure (high risk of exposure in red, medium risk inyellow, and low risk in green) and are sized based on theirvalue (the bigger the node, the higher the dollar value). TheEcosystem user can visually investigate PII attributes, theirrisk and value, and their connections. As stated above, smallred nodes connected to big nodes indicate potential identitytheft threats.

B. High Level Understanding of Identity

The Ecosystem distinguishes various properties of identityattributes, such as the attribute’s type. The attribute’s type fora person is divided into four type categories:• What You Are: a person’s physical characteristics, such

as fingerprints.• What You Have: credentials and numbers assigned to a

person by other organizations, such as Social SecurityCard.

• What You Know: information known privately to a per-son, such as passwords.

• What You Do: a person’s behavior and action patterns,such as GPS location.

Fig. 2. High Level Understanding of PII Types and Values.

The Ecosystem tool, as shown in Figure 2, can highlightattributes of each type, or provide combinations, to answerquestions like what are the most valuable credentials a personowns?

C. Authentication by Organizations

Authentication, from the perspective of an organization, isa method for verifying that a person is who he/she claimsto be, so that a resource is being accessed only by personswho have a legitimate claim to it. Usually a set of attributes isused during the enrollment process, and another (often smaller)set is used to authenticate the person after enrollment. Theauthentication process exposes the organization to an obviousrisk: it may be possible for someone to falsely authenticatethemselves and gain access to privileged data. Another riskis that the enrollment information stored by the organizationexposes it to potential liability if it is accessed illegally. Forthis reason, it is valuable for organizations to know both thelikelihood of a false authentication given the data they use forthe process, and the future exposure risk and liability they maybe exposed to in case of a breach.

The Ecosystem assigns different properties to PII attributes,among which Accuracy at Enrollment measures how accu-rately an attribute can be verified in the authentication enroll-ment phase. To reduce an organization’s liability while alsoincreasing their authentication accuracy, it is best to use alow risk, low value attribute that provides high accuracy atenrollment.

D. The Breeding Relationship

To breed a PII attribute or document is to create a (legitimateor fraudulent/counterfeit) instance of it. The Ecosystem showsvarious relationships (edges) between PII attributes, includingthe breeding relationship. The Ecosystem can determine, givenan attribute for a person, the probability that other attributescan be fraudulently bred. The Ecosystem user can focus on thefirst, second, or more order of connectedness for PII attributesto go through multiple steps of breeding PII attributes ordocuments.

Page 4: Predicting and Explaining Identity Risk, Exposure and Cost ......methods for preventing identity theft and fraud. I. INTRODUCTION Identity theft is now a widespread problem in the

E. Ecosystem Queries

The Ecosystem is capable of answering some non-trivialquestions relevant to the overall risk and liability of any personor organization in terms of managing identity attributes. Forinstance:• Effect of exposure: When a set of attributes is exposed,

how does it affect the risk of other attributes beingexposed? For instance, if the SSN of an individual iscompromised, what are the most risky PII items thatfraudsters might try to obtain after that? To answer thisquestion, the user can run the query Infer probability ofbreach based on evidence, in the new window that openschoose social security number (SSN) as evidence, and runthe query. Using the Bayesian inference, the Ecosystemcalculates the change in the probability of exposure afterthe compromise of SSN, which is reflected by the changeof color of the nodes. The Ecosystem also shows thepredicted expected loss because of the SSN compromise.

• Cause: If a set of attribute have been exposed, what wasthe most likely origin of the breach? As an example, ifan individual finds out that his/her credit card informationis compromised, the Ecosystem can help to Detect mostprobable origin of breach through selecting credit cardinformation as the evidence and running the query.

• Cost/Liability: What is the total cost/liability of an at-tribute being exposed, in terms of increased risk of expo-sure of other attributes? Which attributes have the highestcost (breach hot-spots) and should be best protected?

III. ECOSYSTEM MATHEMATICAL MODEL

In this section, we elaborate on the mathematical modelbehind the Identity Ecosystem.

A. Modeling Identity Attributes and Relationships

We define a person’s identity as a set of informational datathat are linked to the person. Each such piece of informationis called an attribute. Name, age, zip code, and Social SecurityNumber are examples of such attributes.

Attributes can be classified in many ways depending ontheir properties, such as, whether or not an attribute is uniqueto a person, whether or not an attribute is widely used, howaccurately it can be verified, etc. For example, attributes likename or zip code are applicable to any person but are notunique to a person and cannot be used on their own forverification or authentication purposes. On the other hand,SSN is unique to a specific person and hence a very goodcandidate for authentication.

We identify several different properties for attributes:1) Type (categories PII based on their nature): What You

Are, What You Have, What You Know, What You Do.2) Risk (shows the risk of exposure): Low, Medium, High.3) Liability Value (shows the monetary loss to the individ-

ual if compromised): Low, Medium, High.4) Possession (identifies if individuals necessarily have the

PII): Essential, Accidental.

5) Verification Accuracy At Enrollment (measures howaccurate it is to verify one’s PII at enrollment): Low,Medium, High.

6) Prevalence (shows what percentage of the populationhave the PII): Ubiquitous, Common, Rare.

7) Uniqueness (shows how unique the PII is for the in-dividuals who have it): Individual, Small Group, LargeGroup.

8) Verification Invasiveness (shows how invasive it is toverify one’s PII): Low, Medium, High.

The Ecosystem displays each attribute as a node. It cancolor or size the attributes based on their properties. Once theuser selects a property on which to base the color or size, allthe identity attribute nodes will be colored or sized based ontheir current value of the selected property, as in Figures 1(colored based on risk and sized based on liability value) and2 (colored based on type and sized based on liability value).

Identity attributes are related to each other in many differentways. For example, one attribute can determine another, oneattribute can be used to generate another, or one attributemight be composed of many other attributes. We recognizethe following relationships between identity attributes α andβ:

1) α Breeds β means that an instance/value of α maybe used in order to create a legitimate or fraudulentinstance/value of β. For example, driver’s license breedsmany other documents like boarding pass.

2) α Composed Of β means that for any value αi of theattribute α there is a value βj of the attribute β suchthat βj is a proper part of αi. For example, full name iscomposed of first and last names.

3) α Changes Sensitive To β means that for any person Pwith attributes α and β, if the value of β changes for P ,then the value of α changes for P . For example, one’sphotograph changes with age or one’s driver’s licensechanges with address.

4) α Temporally Precedes β means that for any person P ,P must possess some value of attribute α before P canpossess a value of attribute β. For example, a studentidentification number temporally precedes a degree.

5) α Determines β means that for any person P withattributes α and β, the value of α possessed by P impliesthe value of β possessed by P . For example, date of birthdetermines age.

6) α Necessary For β means that for any person P , if Phas a value for the attribute β, then P has a value for theattribute α. For example, a passport number is necessaryfor a passport.

7) α Probabilistically Determines β means that for anyperson P with attributes α and β, P ’s having a givenvalue of α implies that P probably has some particularvalue of β. For example, one shares her spouse’s lastname with a probability, therefore one’s spouse’s lastname probabilistically determines one’s last name.

The relationship between two attributes α and β is shown

Page 5: Predicting and Explaining Identity Risk, Exposure and Cost ......methods for preventing identity theft and fraud. I. INTRODUCTION Identity theft is now a widespread problem in the

with a directed edge from α to β in the Ecosystem. The usercan select to view one or multiple types of edges at a time.

B. Modeling Identity Ecosystem

We represent the Identity Ecosystem as a graph G(V,E)consisting of N attributes (nodes) A1, ..., AN and a set ofdirected edges between pairs of nodes. Each edge e ∈ E isrepresented as a tuple eij =< i, j > where Ai is the originat-ing node and Aj is the target node such that 1 ≤ i, j ≤ N .

We define the set of all incoming edges to Aj as IN(Aj) ={exy|e ∈ E ∧ y = j}, and let the set of all parents of Aj bePARENT (Aj) = {Ax|exj ∈ E}.

Each node Aj is labeled with a Boolean random variable,denoted D(Aj), which is true if the attribute has been ex-posed/breached and false otherwise. Each edge eij representsa possible path by which Aj can be breached given that Ai

is breached1. For simplicity, we consider all edges to be in-dependent2. Therefore, we can assign conditional probabilitiesto each edge CP (eij) = p(D(Aj)|D(Ai)).

Consequently, the Identity Ecosystem model consists of:1) a set of nodes V , each corresponding to an attribute,2) a set of edges E, such that a directed edge exists between

any nodes Ai and Aj if and only if Ai impacts the riskof exposure of Aj , and

3) a list of conditional probability estimates for each nodeAj , representing how the parent nodes PARENT (Aj),impact the risk of the child Aj .

Also as part of the model, each node has a prior probabilityP (Ai) of it getting exposed on its own (as the first breachin the network). For example, a person’s date of birth has ahigher prior probability of being exposed than his/her SSN,simply because people are less careful with the former thanthe latter.

We also assume that each node has a monetary loss value,L(Ai), which represents the amount an organization/personloses (intrinsically) in case the corresponding attribute isexposed. This loss does not include any secondary loss. Forexample, the exposure of one’s date of birth incurs lowintrinsic cost, even though there might be scenarios where itmay lead to further losses in future, by leading to the exposureof other more sensitive data. The model only assumes that theintrinsic loss value is provided.

C. Background: Bayesian Network-based Inference

The formal framework we defined above has significant sim-ilarity to machine learning tools broadly known as graphicalmodels. Graphical models allow us to represent a complexnetwork of probabilistically dependent or correlated randomvariables and perform inference on the model.

1The probabilistically determines edges directly suit this purpose. Othertypes of edges, too, imply the breach of one attribute based on the breach ofthe other with a certain probability.

2In a more general setting where edges are not necessarily independent,e.g., where multiple attributes are needed to breed a new attribute, we definejoint probability distributions on all the edges in IN(Aj) for each node Aj .The joint probability distribution can be defined as function of all D(Ai)such that Ai belongs to PARENT (Aj).

Bayesian networks are a probabilistic graphical model-based approach that can very effectively represent probabilisticcausal dependencies in a state space. For any set of N k-variatediscrete random variables with inter-dependencies, the jointprobability distribution would need to tabulate a total of kN

possible states. However, in many practical situations with alarge set of random variables, many variables are independentof each other or the causality is indirect. Bayesian networkstake advantage of this by only requiring the representation ofdirect causal dependencies.

Visually, a Bayesian network can be represented as adirected graph. Random variables are represented by nodes,while directed edges are used to represent causal dependencies,with the direction of the edge from the causal to the impactedvariable. The causal variables for any node are usually referredto as its parents. The probabilistic dependence of a node on itsparents is represented via a conditional probability distribution(CPD). So, if a node has m parents, and each of which can takek states, a CPD specifying the state probabilities of the childnode for each of the km combinations of parents’ states wouldneed to be described. For a graph with N nodes, approximatelyNkm conditional probability values are needed. However, thisvalue is still much smaller than kN values that would need tobe stated in the naive case of not considering parents.

Even in the absence of information about the current stateof any node in a Bayesian network, priori probability estimatescan be calculated in principle for each node via marginaliza-tion. In case evidence becomes available that a certain subsetof variables has taken a certain value, new probability distri-butions incorporating this new information can be calculatedfor the other nodes. However, since a naive marginalizationapproach is usually computationally prohibitive in practice,more efficient algorithms have been developed. We use avariation of a belief propagation algorithm, the Junction Treealgorithm [3], for the Identity Ecosystem.

D. Mapping the Identity Ecosystem to a Bayesian Network

The Identity Ecosystem model consists of a set of nodesV with edges between them E, and a list of conditionalprobability estimates, providing for a set of known cases,the probability of a child node being exposed, given thata particular subset of one or more parent nodes have beenexposed. Formally, for a node Ai, such a list consists ofstatements asserting p(D(Ai)|D(R)) = m, where m is aprobability value (0 ≤ m ≤ 1) and R ⊂ PARENT (Ai).Note that the set of known cases is not necessarily com-prehensive, i.e., there might be subsets of PARENT (Ai)for which p is not given. Another way to provide this listis that for each node Ai with parent set PARENT (Ai),a probability function f(Ti, Ai) → [0, 1] must be provided,where Ti ⊂ P (PARENT (Ai)) (i.e., Ti is a subset of thepower set of PARENT (Ai)).

To complete the Bayesian network model, for each nodeAi, this list of probability estimates should be converted toa conditional probability distribution (CPD) table. That is, aprobability value must be assigned to all possible combinations

Page 6: Predicting and Explaining Identity Risk, Exposure and Cost ......methods for preventing identity theft and fraud. I. INTRODUCTION Identity theft is now a widespread problem in the

in which parent attributes can become known. To generatesuch a complete CPD for a node, we need to construct afunction g(Ci, Ai)→ [0, 1], where Ci = P (PARENT (Ai)),the power set of the set of parents of Ai.

For a node with k parents in the Identity Ecosystem,2 × 2k = 2k+1 is the number of values we need to specifyfor Ai. Assigning this probability value is straightforward formembers of Ci also present in Ti, as these are already known.For a combination K ∈ Ci not present in Ti, the probabilitythat a malicious entity successfully exposes an attributes is 1minus the probability that it fails to do so after trying everyapplicable member of a set X , such that X ⊂ Ti and allattributes that are part of any member of X , are also membersof K. We prune this set a bit further, making the assumptionthat, if an exposure attempt using more information will fail,an exposure attempt using a subset of that information will failas well. In practice this means that, all members of X that area subset of another member of X , are removed from the set.We call this set the input set, writing it as I(K ∈ Ci, Ti).

Thus formally, for a node Ai such that Ci is the powerset of its parents, for any X ∈ Ci, g(X,Ai) = f(X,Ai),if X ∈ Ti, else g(X,Ai) = 1 − ΠY ∈I(X,Ti)f(Y,Ai). Thisgives us a recursive definition for calculating the conditionalprobability distribution for a node, given a list of conditionalprobability estimates.

E. Using Bayesian Network to Answer Queries

The Ecosystem tool, in its present form, is capable of usingBayesian inference to perform three chief kinds of analysis (asexplained in a use case in Section II-E): 1) analyzing the riskof exposure, 2) inferring the most likely source of a breach,and 3) calculating the expected cost of attributes.

1) Analyzing the Risk of Exposure: In this analysis, we areinterested in finding out the effect of a breach. In other words,given that a set of attributes BREACHED = {Ai|Ai ∈V ∧ D(Ai) = true} have been exposed/breached, we wantto know: a) for any attribute Aj /∈ BREACHED, theexpected/conditional probability P ′(Aj) of breach given theevidence that the attributes in BREACHED have beenexposed, and b) the expected increase in cost/liability C dueto the breach.

We model this query as an inference problem by treatingthe breach as input evidence. We set the exposure evidencevalues D(Ai) for the nodes in BREACHED to true, and useBayesian inference to compute the posterior probabilities foreach node in the system. Note that these posterior probabilitiesare exactly the P ′(Aj) values that we need. Now, given theP ′(Aj) values, it is easy to compute the percentage increase inthe risk and compute an expected increase in cost/liability as(P ′(Aj)−P (Aj))×L(Aj). Hence, the total cost of the breachcan be computed as C = Σj(P

′(Aj)−P (Aj))×L(Aj). Notethat P ′(Ai) = 1 if Ai ∈ BREACHED.

2) Inferring the Most Likely Source of a Breach: The nextproblem we address it that of finding out the source of a breach(or a set of breaches). In other words, given that a set ofattributes BREACHED = {Ai|Ai ∈ V ∧ D(Ai) = true}

have been exposed/breached, we want to know the mostprobable source of the breach/breaches.

Again, this query can be modeled as an inference problemin a Bayesian Network. We set the evidence of exposure forthe nodes in BREACHED to true (i.e., set D(Ai) = true)and then compute the posterior probabilities for all nodes inthe system using the Junction Tree algorithm [3]. During thiscalculation, we focus only on the ancestor nodes of the nodesthat were breached. An ancestor node of a node Ai is anynode Aj such that a directed path lies between Aj and Ai,i.e., there is a chain of causality from Aj to Ai via which Aj

can be responsible for a breach of Ai. In addition, the evidenceof breach information (D(Ai)) for all the child nodes of thebreached nodes are set to zero. This assures that the updatedposterior probability estimates of all nodes follow the causaldirection, in case of cycles.

Thus, for each ancestor node Aj ∈ ANCESTORS(Ai)we compute the posterior probability P ′(Aj). If P ′(Aj) −P (Aj) > 0, then Aj is a possible source of the breach.To discover the most likely original source of a breach, wefirst select the highest common ancestors among the possiblesources and then report the one with the highest increase inthe posterior probability.

3) Expected Cost of Attributes: For this analytic, we areinterested in finding out the cost/liability of managing anattribute. In other words, we would like to find out howan attribute’s exposure increases the risk of other attributes’getting exposed, so that an attribute incurs not only its ownintrinsic cost, but also some expected costs downstream.

One might be tempted to simply use the same idea as thefirst analytic where we analyzed the effect of a known breach.But this situation is different. Now we want to analyze all thepossible scenarios where a specific attribute Ai is part of abreached set. In other words, there is no unique set to startwith. Also note that the overall cost will not be the samewith a different breach set. Another aspect of the problem isthat, given a specific breach set, it is possible to compute thetotal expected cost of the breach using the same techniques asbefore, but it does not say how the increased cost should beapportioned between the sources of the breach. To take intoaccount all these factors, we propose the following solution.

Note that the problem of distributing the increased cost ofexposure arises because there are multiple parents for eachnode and any subset of them could have been in the breach set.So, instead of trying to sample all possible breach sets, we sim-ply reverse the question and ask, given that an attribute Aj isexposed, what are the expected sources and how much is eachof them responsible. Let, ANCESTORS(Aj) be the set ofancestors of Aj . We compute the posterior probability P ′(Ak)of exposure for each attribute {Ak ∈ ANCESTORS(Aj)},after setting the belief that Aj is exposed. Note that sinceAj has its own intrinsic risk P (Aj) of being exposed andan intrinsic cost L(Aj), the liability attributed to all theancestors would be Lj = (1 − P (Aj)) × L(Aj). Now wedefine the liability attributed to an ancestor Ak for causingthe breach of Aj as Lkj = (P ′(Ak) − P (Ak)) × Lj/S,

Page 7: Predicting and Explaining Identity Risk, Exposure and Cost ......methods for preventing identity theft and fraud. I. INTRODUCTION Identity theft is now a widespread problem in the

where S = Σ{l|Al∈ANCESTORS(Aj)}(P′(Al) − P (Al)). Fi-

nally, we define the total cost/liability of an attribute Ai asC = P (Ai)× L(Ai) + Σj 6=iLij .

IV. DATA SOURCES FOR ECOSYSTEM

The Identity Ecosystem provides a framework to investigateidentity. However, it needs an external source of information topopulate PII attributes and their relationships using real worlddata. To obtain a list of important PII attributes, their proper-ties, e.g., initial risk of exposure, and their relationships, e.g.,the probabilistic relationship, we used two sources of infor-mation separately. We initially asked the faculty, researchers,and students at the Center for Identity at the University ofTexas at Austin to manually list PII attributes and relationshipsutilizing their expert knowledge in the filed. However, to obtainmore accurate data, we moved to utilizing the Identity ThreatAssessment and Prediction (ITAP) project at the Center forIdentity.

A. ITAP

ITAP is a risk assessment tool that increases fundamentalunderstanding of identity thieves’ and fraudsters’ processesand patterns. ITAP aggregates data on identity theft frommultiple sources (e.g., law enforcement, fraud cases, and newsstories) to model and analyze identity vulnerabilities, the valueof identity attributes, and their risk of exposure. At the Centerfor Identity, a team of modelers carefully analyzes identitytheft and fraud news stories on a daily basis and models thisinformation using the ITAP schema. For each case of identitytheft and/or fraud, ITAP collects and analyzes tools usedby criminals, types of information exploited, demographicsof victims, etc. In doing so, ITAP captures and analyzes astructured computational model of identity and fraud processesand outcomes [4], [5]. ITAP currently models over 5,000 newsstories that report on specific identity theft and fraud cases.ITAP readily provides the initial risk of exposure P (Ai) andthe intrinsic loss value L(Ai) [6].

V. ECOSYSTEM IMPLEMENTATION

In this section, we review our current implementation of theIdentity Ecosystem.

A. Graphical User Interface

The Ecosystem GUI (see Figure 1 as an example) consistsof three panels: the main panel, the top panel, and the leftpanel. The main panel renders the graph of attribute nodes andedges. The left panel includes Filters, Controls, and Color/SizeOptions sub-panels.• The Filters sub-panel provides the capability to filter

Ecosystem nodes and edges and it has three tabs: GraphOptions, Node Options, and Edge Options.

– Under Graph Options, the user can select to simplydisplay all the nodes, display a subtree of a selecteddepth rooted at a selected node (Figure 3), or findthe shortest path between two nodes.

Fig. 3. The Rooted Tree of Depth 2 Rooted at the Birth Certificate Attribute.

Fig. 4. Details of the Driving License Attribute.

– Under Node Options, the user can choose to shownodes of a certain value for a certain property, e.g.,nodes with the essential value for the possessionproperty.

– Under Edge Options, the user can choose to displayno edges, all edges, or only some types of edges,e.g., the determines and probabilistically determinesedges only.

• The Controls sub-panel is dedicated to general controlsof the GUI such as refreshing the display and showingthe details of a selected node. Such details include thevalue of the node properties for that attribute, the riskchart over age, and the value over time (Figure 4).

• The Color/Size Options sub-panel allows the user to coloror size the nodes based on any of the node properties,hide/show node labels (i.e., attribute names) and viewthe index/scale of Edge Colors, Node Colors, and NodeSizes.

In addition, the GUI provides a top panel which includes:three types of questions that the Ecosystem can answer usingBayesian inference, Specialization Charts, and SpecializationOptions.• The three types of questions the Ecosystem can answer

Page 8: Predicting and Explaining Identity Risk, Exposure and Cost ......methods for preventing identity theft and fraud. I. INTRODUCTION Identity theft is now a widespread problem in the

Fig. 5. Asking Queries: Infer the Probability of Breach.

are the following. (See Section II-E for a sample usecase.)

– Infer probability of breach based on evidence: Fig-ure 5 shows how the probability of breach for otherattributes changes, once the SSN and Social Secu-rity Card attributes have been breached. Multipleattributes can be selected as evidence at the sametime. It also shows potential loss after such a breachscenario.

– Detect most probable origin of a breach: Figure 6demonstrates a scenario in which the credit cardnumber was breached. Asking the query of themost probable origin of the breach, Medicare andMedicaid cards as well as the health insurance cardare surprisingly the most probable initial point ofentry. The bar chart of expected losses per attributesafter such a breach is shown too.

– Find breach hot-spots: Continuing with the breachscenario of the credit card number, now we ask aboutthe attributes that, if breached now, incur the mostcost, i.e., breach hot-spots. Figure 7 shows that themost valuable piece of PII after the exposure of thecredit card number is the debit/credit card, i.e., creditcard information other than its number.

• Specialization Charts render various different types ofcharts using the Identity information provided to theEcosystem. Examples are age risk per age, time valueof attributes, age risk chart, gender risk chart, educationlevel risk charts, etc.

• Specialization Options allow the Bayesian network tofocus on the data provided about a given specialization—sub-group of PII owners. It can focus on different ageranges, genders, education levels, professions, incomegroups, locations, and citizenships. Once a specializationis selected, all future Bayesian inference calculations areperformed for the specific selection of PII owners.

Fig. 6. Asking Queries: Detect the Most Probable Origin of a Breach.

Fig. 7. Asking Queries: Hot-spots.

B. Implementation Details

The Identity Ecosystem tool was implemented in Java 1.7and is provided as an executable jar file. The project is dividedinto the following packages:

• id.ecosys.core: Implements the basic graph structure asan adjacency list.

• id.ecosys.bayes: Implements the Bayesian inference.• id.ecosys.io: Implements the loading of the model and

storing the updates.• id.ecosys.dataFile: Includes input data needed for the

Ecosystem, directly and automatically generated by ITAP.• id.ecosys.tests: Provides tests to validate the core imple-

mentation.• id.ecosys.util: Contains utility functions.• id.ecosys.view: Contains the Ecosystem GUI.• id.ecosys.visualize: Contains the entry point to the pro-

gram execution and visualization.

We used Jung version 2.0.1 [7], Jung 3D version 2.0.1 [8], andJava 3D version 1.5.2 [9] for the 3D visualization. We usedJFreeChart version 1.0.17 [10] for drawing and rendering thecharts.

Page 9: Predicting and Explaining Identity Risk, Exposure and Cost ......methods for preventing identity theft and fraud. I. INTRODUCTION Identity theft is now a widespread problem in the

VI. CONCLUSION

This paper presents a graph-based model to represent therelationships between various personally identifiable informa-tion attributes. We mapped this model, the Identity Ecosystem,to a Bayesian network. Provided with actual identity theft andfraud input data from over 5,000 identity theft and fraud casesfor the first degree probabilities of the Bayesian network, themodel enables sophisticated inference that can answer manyinteresting questions in the identity space. In this paper, wefocused on three questions: 1) given the evidence of the breachof a subset of identity attributes, what is the impact on theexposure risk of other attributes, 2) given that an attributehas been exposed, what was the most likely source of theexposure, and 3) what is the total cost, including secondarycosts, of exposure of a node, and based on this cost, what arethe hot-spots in the Identity Ecosystem, i.e., nodes that areboth vulnerable to an attack, and carry large costs in case ofan exposure.

In this paper, we did not focus on the analytics that the Iden-tity Ecosystem provides when using the ITAP data. Instead, weintroduced a novel method of modeling the Identity Ecosystemand utilizing it to answer interesting questions. We envisionanswering specific questions and investigating the Ecosystem’sanswer to them using the ITAP data as a promising future workavenue.

ACKNOWLEDGMENT

The authors would like to thank Liang Zhu, Shayani Deb,and Muhammad Zubair Malik for their work on the currentimplementation of the Ecosystem, and James Zaiss for proof-reading.

REFERENCES

[1] E. Harrell, Bureau of Justice Statistics, US Dept of Justice, and Officeof Justice Programs, “Victims of identity theft, 2014,” 2014.

[2] Federal Trade Commission et al., “Consumer sentinel network databook,” 2014.

[3] C. M. Bishop, “Pattern recognition,” Machine Learning, vol. 128, 2006.[4] Y. Yang, “Mining of identity theft stories to model and assess identity

threat behaviors,” Master’s thesis, The University of Texas at Austin,2014.

[5] Y. Yang, M. Manoharan, and K. S. Barber, “Modelling and analysis ofidentity threat behaviors through text mining of identity theft stories,”in IEEE Joint Intelligence and Security Informatics Conference (JISIC),2014, pp. 184–191.

[6] R. Nokhbeh Zaeem, M. Manoharan, and K. S. Barber, “Risk kit: High-lighting vulnerable identity assets for specific age groups,” EuropeanIntelligence and Security Informatics Conference (EISIC), 2016, toAppear.

[7] “Jung.” [Online]. Available: http://jung.sourceforge.net[8] “Jung 3d.” [Online]. Available: http://jung.sourceforge.net/site/jung-3d[9] “Java 3d.” [Online]. Available: https://java.net/projects/java3d

[10] “Jfreechart.” [Online]. Available: http://www.jfree.org/jfreechart

Page 10: Predicting and Explaining Identity Risk, Exposure and Cost ......methods for preventing identity theft and fraud. I. INTRODUCTION Identity theft is now a widespread problem in the

© 2016 Proprietary, The University of Texas at Austin, All Rights Reserved.

For more information on Center for Identity research, resources and information, visit identity.utexas.edu.

identity.utexas.edu