24
1 © Starrett Consulting, Inc. Data Science, Investigations and Privacy Current Status, Challenges and Solutions © Starrett Consulting, Inc. Agenda • Global Privacy • Big Data and Data Science Introduction – What are they? • Data Science and Investigations – what’s the problem? • Data-Science Investigative Tools – the solution! • A Unique Challenge – Automation vs. Human Decisions (Note: examples, algorithms and technologies presented may be summarized or abbreviated for efficient presentation and to accommodate communication to a lay audience) 2 © Starrett Consulting, Inc. Agenda Global Privacy • Big Data and Data Science Introduction – What are they? • Data Science and Investigations – what’s the problem? • Data-Science Investigative Tools – the solution! • A Unique Challenge – Automation vs. Human Decisions 3

Data Science, Investigations and Privacy · 1 © Starrett Consulting, Inc. Data Science, Investigations and Privacy Current Status, Challenges and Solutions © Starrett Consulting,

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Science, Investigations and Privacy · 1 © Starrett Consulting, Inc. Data Science, Investigations and Privacy Current Status, Challenges and Solutions © Starrett Consulting,

1

© Starrett Consulting, Inc.

Data Science, Investigations and PrivacyCurrent Status, Challenges and Solutions

© Starrett Consulting, Inc.

Agenda

• Global Privacy

• Big Data and Data Science Introduction – What are they?

• Data Science and Investigations – what’s the problem?

• Data-Science Investigative Tools – the solution!

• A Unique Challenge – Automation vs. Human Decisions

(Note: examples, algorithms and technologies presented may be summarized or abbreviated for efficient presentation and to accommodate communication to a lay audience)

2

© Starrett Consulting, Inc.

Agenda

• Global Privacy

• Big Data and Data Science Introduction – What are they?

• Data Science and Investigations – what’s the problem?

• Data-Science Investigative Tools – the solution!

• A Unique Challenge – Automation vs. Human Decisions

3

Page 2: Data Science, Investigations and Privacy · 1 © Starrett Consulting, Inc. Data Science, Investigations and Privacy Current Status, Challenges and Solutions © Starrett Consulting,

2

© Starrett Consulting, Inc.

Data Privacy

Data Privacy• Right of individuals to keep their personal data from

being misused or disclosed

Personal Data• Information that identifies or relates to an identifiable

individual

Sensitive Personal Data - Examples

• Personal: Digital signatures, biometric data, fingerprints, passwords

• Demographic: Birth date, marital status, race/ethnicity, health Info

• Financial: Credit card info, bank account info, earnings

• Government-Issued: Social Security Number, ID #, Tax ID #, driver’s license #, passport #

4

© Starrett Consulting, Inc.

Data Protection Laws of the World

5

Source: DLA PIPER -- https://www.dlapiperdataprotection.com/

© Starrett Consulting, Inc.

Privacy Frameworks & Principles

General Principles

Governance

Notice

Choice

& Consent

Collection

Use

Access

Quality

& Accuracy

Retention&

Disposal

6

Privacy Frameworks

• Fair Information Practice Principles (FIPPs), 1973

• OECD Privacy Framework, 1980, (updated 2013)

• APEC Privacy Framework, 2005

• Generally Accepted Privacy Principles (GAAP), 2009

• Final FTC Privacy Framework, 2012

• General Data Protection Regulation (GDPR), 2016

• Other…

Page 3: Data Science, Investigations and Privacy · 1 © Starrett Consulting, Inc. Data Science, Investigations and Privacy Current Status, Challenges and Solutions © Starrett Consulting,

3

© Starrett Consulting, Inc.

EU General Data Protection Regulation (GDPR)

• New rights for data subjects

• Right to data portability

• Right to erasure

• Key operational requirements

• Data Protection Officer (DPO)

• Breach notification

• Privacy Impact Assessment (PIA)

• Data Subject consent

• Cross-border data transfers

• New 2016 EU data protection law

• Replaces EU Data Protection Directive 95/46/EC

• Applies to all companies handling EU citizens’ data

• Enforceable from May 25, 2018

© Starrett Consulting, Inc.

Agenda

• Global Privacy

• Big Data and Data Science Introduction – What are they?

• Data Science and Investigations – what’s the problem?

• Data-Science Investigative Tools – the solution!

• A Unique Challenge – Automation vs. Human Decisions

8

© Starrett Consulting, Inc.

Big Data and Data Science IntroductionThe Need

• Use of data science is the single most important competitive differentiator for enterprises generally.

• Legal touches every aspect of business and life.90%or more

Information we need and use is in

electronic form

9

Page 4: Data Science, Investigations and Privacy · 1 © Starrett Consulting, Inc. Data Science, Investigations and Privacy Current Status, Challenges and Solutions © Starrett Consulting,

4

© Starrett Consulting, Inc.

Big Data and Data Science IntroductionWhat are “Big Data” and Data Science?

• Data that is “out-of-hand” – too voluminous, complex or fast-moving for conventional methods to handle.

• Focus is on solutions found in data science:

“Data science is an interdisciplinary field about scientific methods to extract knowledge from data. It involves subjects in mathematics, statistics, information science, and computer science.”

• Thus data science is contextual.

10

© Starrett Consulting, Inc.

Big Data and Data Science IntroductionRealities

• Many technical verticals.

• Domain (legal) professional must be present to make decision on application of data-science vertical(s).

• Forensic in nature – PhD needed?

• Insight and leads generated require follow up and corroboration.

11

© Starrett Consulting, Inc.

Big Data and Data Science IntroductionPredictive Analytics Learning Progression

Math for Modelers

Statistics

Regression and Multivariate Analysis

Generalized Linear Models

Machine Learning

Advanced Topics / Horizontal Areas

12

Page 5: Data Science, Investigations and Privacy · 1 © Starrett Consulting, Inc. Data Science, Investigations and Privacy Current Status, Challenges and Solutions © Starrett Consulting,

5

© Starrett Consulting, Inc.

Big Data and Data Science IntroductionThe Data Science in Investigations “Golden Rule”

“Until proven otherwise, data

science as used in investigations is a

service, not a product!”

(It’s not the car, it’s the driver!)

DomainData

Science

13

© Starrett Consulting, Inc.

Big Data and Data Science IntroductionTypes of Data and Analysis

Quantitative(numbers)

Unstructured(e.g. free-form text, NoSQL database)

Qualitative(qualities, categories)

Structured(e.g. columns and rows, logs files)

14

© Starrett Consulting, Inc.

Big Data and Data Science IntroductionData Science Bigger Picture

QuantitativeApproaches

StructuredUnstructured

Numeric Aspects of Qualitative Features

Qualitative

15

Page 6: Data Science, Investigations and Privacy · 1 © Starrett Consulting, Inc. Data Science, Investigations and Privacy Current Status, Challenges and Solutions © Starrett Consulting,

6

© Starrett Consulting, Inc.

Structured Data Unstructured Data

Free-form Text

Natural Language

NoSQL Database

(Et cetera)

Metadata

Spreadsheets

Relational Databases

Key / Value Pairs

(Et cetera)

Big Data and Data Science IntroductionWhere is identification of personal / sensitive data most challenging?

FOCUS WILL BE HERE!

Identifying personal and sensitive data in structured data is much easier. So……..

16

© Starrett Consulting, Inc.

Agenda

• Global Privacy

• Big Data and Data Science Introduction – What are they?

• Data Science and Investigations – what’s the problem?

• Data-Science Investigative Tools – the solution!

• A Unique Challenge – Automation vs. Human Decisions

17

© Starrett Consulting, Inc.

Data Science and InvestigationsUnsupervised Learning

• Exploratory / Investigative.

• Clustering, for example:

• K-means

• Hierarchical

• Document / Text

• Correlation.

18

Page 7: Data Science, Investigations and Privacy · 1 © Starrett Consulting, Inc. Data Science, Investigations and Privacy Current Status, Challenges and Solutions © Starrett Consulting,

7

© Starrett Consulting, Inc.

Data Science and InvestigationsUnsupervised Learning - Document Clustering

Topic 1

Sub-Topic 2

Sub-Topic 1

Sub-Topic 3

Topic 2

Sub-Topic 1

Sub-Topic 2

Clusters determined by common words, phrases, concepts, etc. found in docs

19

© Starrett Consulting, Inc.

Data Science and InvestigationsUnsupervised Learning - Document Clustering

20

© Starrett Consulting, Inc.

Data Science and Investigations: Unsupervised and Supervised LearningRegression (predictive analytics - numeric)

1 93 4 5 6 7 82 10 11 12 13 14 15

1

2

3

4

5

6

Sales per Month (Millions)

Sale

s C

alls

per

Week

Salesperson Performance Records (6 months)

Regression line (“formula”) used in predictive analytics (compare - correlation)

Outlier

Outlier

21

Page 8: Data Science, Investigations and Privacy · 1 © Starrett Consulting, Inc. Data Science, Investigations and Privacy Current Status, Challenges and Solutions © Starrett Consulting,

8

© Starrett Consulting, Inc.

Data Science and Investigations: Supervised LearningClassification (predictive analytics - categorical)

Skull

shape

Eyebrows Leg

length

Hair

length

Tail Number

of ears

Type

Pointed Yes 3 inches Short Yes 2 Dog

Round Yes 2 feet Short Yes 2 Dog

Triangle No 5 inches Medium Yes 2 Cat

Triangle No 5 inches Long Yes 2 Cat

Round Yes 1 foot Long Yes 2 Dog

Triangle Unk 4 inches Short Yes 2 Cat

Triangle No 5 inches Short Yes 2 Cat

“Training” Data

22

© Starrett Consulting, Inc.

Skull

shape

Eyebrows Leg

length

Hair

length

Tail Number

of ears

Predict?

Pointed Yes 3 inches Short Yes 2

Round Yes 2 feet Short Yes 2

Triangle No 5 inches Medium Yes 2

Triangle No 5 inches Long Yes 2

Round Yes 1 foot Long Yes 2

Triangle Unk 4 inches Short Yes 2

Triangle No 5 inches Short Yes 2

Type

Dog

Dog

Cat

Cat

Dog

Cat

Cat

Data Science and Investigations: Supervised LearningClassification (predictive analytics - categorical)

23

© Starrett Consulting, Inc.

Skull

shape

Eyebrows Leg

length

Hair

length

Tail Number

of ears

Predict?

Pointed Yes 3 inches Short Yes 2 Cat

Round Yes 2 feet Short Yes 2 Dog

Triangle No 5 inches Medium Yes 2 Cat

Triangle No 5 inches Long Yes 2 Cat

Round Yes 1 foot Long Yes 2 Dog

Triangle Unk 4 inches Short Yes 2 Dog

Triangle No 5 inches Short Yes 2 Cat

Type

Dog

Dog

Cat

Cat

Dog

Cat

Cat

Predictive Accuracy

Data Science and Investigations: Supervised LearningClassification (predictive analytics - categorical)

24

Page 9: Data Science, Investigations and Privacy · 1 © Starrett Consulting, Inc. Data Science, Investigations and Privacy Current Status, Challenges and Solutions © Starrett Consulting,

9

© Starrett Consulting, Inc.

Data Science and Investigations: Supervised Learning (Classification)Example – Electronic Discovery and Predictive Coding

ProcessingIdentification ProductionReview

25

© Starrett Consulting, Inc.

Samplee.g. 50k

Document Population Requiring Review for Relevancy to Lawsuit

e.g. 1 million

Patterns found in sample training data are used to classify documents in population as relevant and non-relevant.

Data Science and Investigations: Supervised Learning (Classification)Example – Electronic Discovery and Predictive Coding

26

© Starrett Consulting, Inc.

“Training” Data

Date Metadata Document Text Relevant

Yes

No

Yes

Yes

No

No

Yes

Data Science and Investigations: Supervised Learning (Classification)Example – Electronic Discovery and Predictive Coding

50k total documents in sample

27

Page 10: Data Science, Investigations and Privacy · 1 © Starrett Consulting, Inc. Data Science, Investigations and Privacy Current Status, Challenges and Solutions © Starrett Consulting,

10

© Starrett Consulting, Inc.

Data Science and Investigations: Supervised Learning (Classification)Example – Electronic Discovery and Predictive Coding

Predictive Model(Classifier)

1 million files

Relevant

Non-relevant

28

© Starrett Consulting, Inc.

100’s of Clusters

Document Population Requiring Review for Relevancy to Lawsuit

1 million

Cluster all 1 million

documents

Data Science and Investigations: Supervised Learning (Classification)Example – Clustering and Classification

Two clusters look

interesting

CEO and CFO

emails / SM

29

© Starrett Consulting, Inc.

Data Science and InvestigationsExample – Clustering and Classification

CEO

Emails

BoardVendors

Husband

CFO

Social

Media

CEO’s Husband

Bank

Assume 3000 emails and social

media messages

total

30

Page 11: Data Science, Investigations and Privacy · 1 © Starrett Consulting, Inc. Data Science, Investigations and Privacy Current Status, Challenges and Solutions © Starrett Consulting,

11

© Starrett Consulting, Inc.

Data Science and InvestigationsExample – Clustering and Classification

CEO Emails to Board

Use CLUSTERS to help classify DOCs related to LEGAL ISSUES(compare ediscovery relevancy review where docs were randomly selected and

manually tagged by attorneys)

Conspiracy

CEO Emails to Husband

CEO Emails to Vendors

CFO SM – CEO Husband

CFO SM – Bank

Money Laundering

Fraud

Contract Breach

31

© Starrett Consulting, Inc.

Date Metadata Document Text Data Type

Fraud

Conspiracy

Money Laundering

Contract Breach

Fraud

Conspiracy

Contract Breach

Fraud

Money Laundering

Conspiracy

Money Laundering

Conspiracy

Contract Breach

Data Science and InvestigationsExample – Clustering and Classification

3000 total documents

from clusters

32

© Starrett Consulting, Inc.

Data Science and InvestigationsExample – Clustering and Classification

Contract Breach

Predictive Model(Classifier)

1 million files

Money Laundering

Conspiracy

Fraud

None of the above

33

Page 12: Data Science, Investigations and Privacy · 1 © Starrett Consulting, Inc. Data Science, Investigations and Privacy Current Status, Challenges and Solutions © Starrett Consulting,

12

© Starrett Consulting, Inc.

Data Science and Investigations – Information RetrievalDiverse Data into NoSQL Database (Text Repository) to Search Engine

Date Author Title Last edit Body

Part Name Part No. Price Dept.

To From CC Sent Subject Body

Date URL Title Web Page Text

Etc.

Text Repository(NoSQL Database)

Format: e.g. JSON, XMLSearch Engine

CompareSQL which has fixed, “structured” schema vs. diverse, schema-less, “unstructured” NoSQL database

(“document database”)

Original Files

34

© Starrett Consulting, Inc.

Data Science and Investigations – Information RetrievalSearch – Index Search

35

© Starrett Consulting, Inc.

Data Science and Investigations – Information RetrievalInformation Retrieval / Relevancy Ranking

Sparse Term-Matrix to Inverted Index

Doc apple dog cat blue Simple ran time

Doc 1 0 1 0 0 1 0 0

Doc 2 0 0 1 0 0 0 0

Doc 3 1 0 0 1 0 0 1

Doc 4 0 0 0 0 0 1 0

Doc 5 0 1 0 0 1 0 0

Doc 6 1 0 0 0 0 0 1

Word Doc

apple 3, 6

dog 1,5

cat 2

blue 3

simple 1, 5

ran 4

time 3, 6

Common words (e.g. ‘to’, ‘the’, ‘a’), punctuation marks are often removed here. Other conversions such as converting all chars to lower-case, taking root versions of words, etc. are also common. Compound words (phrases) and other additions can be done.

36

Page 13: Data Science, Investigations and Privacy · 1 © Starrett Consulting, Inc. Data Science, Investigations and Privacy Current Status, Challenges and Solutions © Starrett Consulting,

13

© Starrett Consulting, Inc.

Data Science and InvestigationsInformation Retrieval / Relevancy Ranking - TF-IDF

Term Frequency

• The number of times a word appears in a document means that word is more important.

Inverse Document Frequency

• Terms that appear frequently across all documents are unimportant and thus weight down a term.

TF-IDF

• Terms that appear often in a doc are important, those that appear often in document collection are not. A word receives and “importance score”.

37

© Starrett Consulting, Inc.

Data Science and Investigations – Information Retrieval / Relevancy Ranking –TF-IDF

Document Dog Cat Other words ->

Doc 1 5 5

Doc 2 4 5

Doc 3 3 1

Doc 4 2 0

Search: Documents returned in search for “Dog” and “Cat” sorted by relevancy. TF-IDF scores for terms in documents “weight” individual docs up or down.

ORClassify: Documents with similar word combinations can be “grouped” together. This approximates “classifying” like documents together. Remember electronic discovery example?

38

© Starrett Consulting, Inc.

Data Science and Investigations – Information ExtractionNamed Entity Extraction

Date Author Title Last edit Body

Part Name Part No. Price Dept.

To From CC Sent Subject Body

Date URL Title Web Page Text

Etc.

Text Repository(NoSQL Database)

Format: e.g. JSON, XMLSearch Engine

Information Extraction occurs on text repository, NOT on original files or search engine index

Original Files

39

Page 14: Data Science, Investigations and Privacy · 1 © Starrett Consulting, Inc. Data Science, Investigations and Privacy Current Status, Challenges and Solutions © Starrett Consulting,

14

© Starrett Consulting, Inc.

Data Science and Investigations – Information ExtractionNamed Entity Types (examples)

NE Type Examples

ORGANIZATION ACFE, American Bar Association

PERSON Donald Trump, Hillary Clinton

LOCATION Mississippi River, Mt. Whitney

DATE 12-06-1970, January 15th, 2013

TIME Four fifty p.m., 0200 hours

MONEY $43.15, 90,000 YEN

FACILITY Lincoln Memorial, U.S. Treasury Bldg.

Some commercially available named entity tools have almost 1000 types of entities!

40

© Starrett Consulting, Inc.

Data Science and Investigations – Information ExtractionNamed Entity Extraction

TokenizationSentence

SegmentationEntity

Detection

Parts-of-Speech Tagging

Raw Text from NoSQL Document Database (not search engine)

Entity Extraction

41

© Starrett Consulting, Inc.

Data Science and Investigations – Information ExtractionNamed Entity Extraction – POS Tagging

W e s a w t h e b r o w n c a t s

Pronoun NounAdjectivePrepositionVerb

Noun

phraseNoun

phrase

42

Page 15: Data Science, Investigations and Privacy · 1 © Starrett Consulting, Inc. Data Science, Investigations and Privacy Current Status, Challenges and Solutions © Starrett Consulting,

15

© Starrett Consulting, Inc.

Data Science and Investigations – Information ExtractionNamed Entity Extraction – Entity identification

(PERSON Donald/N J./N Trump/N)

(PERSON = /N + /N + /N)

Machine learning determines that each named-entity type follows certain parts-of-speech patterns, for example:

43

© Starrett Consulting, Inc.

Data Science and Investigations – Information Extraction

Keyphrase Extraction

• Extracts key words and word combinations. Often identified using TF-IDF-like methods.

• Useful in identifying concepts, important terms, topics, code words and other “lingo”.

• Also used in document classification and clustering.

• Machine learning techniques can be used to create keyphrase extraction tools.

44

© Starrett Consulting, Inc.

Data Science and Investigations – Information Extraction

Other Available Information

Categories

Input - www.cnn.com

Output - /news/art and entertainment/movies and tv/television/news/international news

Concepts

Input - "Natural language processing uses machine learning to analyze text.“

Output - Linguistics, Natural language processing, machine learning

45

Page 16: Data Science, Investigations and Privacy · 1 © Starrett Consulting, Inc. Data Science, Investigations and Privacy Current Status, Challenges and Solutions © Starrett Consulting,

16

© Starrett Consulting, Inc.

Data Science and Investigations – Information Extraction

Other Available Information

Emotion

Input - "I love cities, but I hate the country“

Output - "cities": joy, "country": anger

Metadata

Input - "https://www.starrettconsultinginc.com"

Output:

• Author: Paul Starrett

• Title: A state-of-the-art investigations and consulting firm

• Publication date: March 1, 2016

46

© Starrett Consulting, Inc.

Data Science and Investigations – Information Extraction

Other Available Information

Semantic Roles

Input - "In 2016, Trump ran for president“

Output:

• Subject: Trump

• Action: ran

• Object: for president

Sentiment

Input - "Thank you and enjoy your trip!“

Output - Positive sentiment (score: 0.81)

47

© Starrett Consulting, Inc.

Data Science and Investigations – Information Extraction

Other Available Information

• Other information:• Geospatial data – physical addresses converted to

GPS coordinates, distances can be calculated.

• Topic modeling.

• Lexical analysis.

• Many of above resources are developed using machine learning just as named entity extraction.

• Above information can be used in predictive models to sensitive / private data.

48

Page 17: Data Science, Investigations and Privacy · 1 © Starrett Consulting, Inc. Data Science, Investigations and Privacy Current Status, Challenges and Solutions © Starrett Consulting,

17

© Starrett Consulting, Inc.

Data Science and Investigations – GraphsNodes and Edges

Node 2Node 1 Edge (relationship)

49

A node can be anything: Person, Location, Project, Concept, Association, Account, Document, etc.An edge can be anything: Ownership, Parent / child, Lawyer / client, Knows, Member, Married, etc.

© Starrett Consulting, Inc.

Data Science and Investigations – GraphsNodes and Edges

123 Main St. John Doe Owns

50

© Starrett Consulting, Inc.

Data Science and Investigations – GraphsExploratory and Predictive

Exploratory uses

• For investigations, due diligence and to conduct research for predictive models.

Predictive analytics and graphs

• Typically used inside enterprise / government infrastructure to identify threats.

• Think anomaly detection and machine learning.

51

Page 18: Data Science, Investigations and Privacy · 1 © Starrett Consulting, Inc. Data Science, Investigations and Privacy Current Status, Challenges and Solutions © Starrett Consulting,

18

© Starrett Consulting, Inc.

Data Science and Investigations

Graphs (Example)

• Trump organizations links to advisors and auditors.

• Exploration only.

(Data courtesy Bureau Van Dijk(www.bvdinfo.com), Visualization rendered in Polinode(www.polinode.com), Graph created by StarrettConsulting, Inc.)

52

© Starrett Consulting, Inc.

Data Science and InvestigationsSummary

• Use unsupervised methods to summarize and categorize data for focus and prioritization.

• Helps identify legal, regulatory and policy issues along with identifying supporting facts.

• Use supervised methods to identify certain information in other data.

• Helps “mine” other data to capture or identify known facts or issues (often as identified in unsupervised learning).

53

© Starrett Consulting, Inc.

Data Science and InvestigationsWhat’s the Problem?

• How do we investigate without running afoul of privacy and compliance regulations?

• Why not apply certain data-science investigative tools to this problem?

• Certain tools are perfect for identifyingand gathering personal and sensitive data!

• Hence, the solution…….

• But wait, we’re not done! Enter GDPR (stay tuned!)

54

Page 19: Data Science, Investigations and Privacy · 1 © Starrett Consulting, Inc. Data Science, Investigations and Privacy Current Status, Challenges and Solutions © Starrett Consulting,

19

© Starrett Consulting, Inc.

Agenda

• Global Privacy

• Big Data and Data Science Introduction – What are they?

• Data Science and Investigations – what’s the problem?

• Data-Science Investigative Tools – the solution!

• A Unique Challenge – Automation vs. Human Decisions

55

© Starrett Consulting, Inc.

Data-Science Investigative Tools as SolutionPrevious tools used to find personal and sensitive data

• Use information extraction to find data that is personal and sensitive.

• Use information retrieval (search technologies) to further refine personal and sensitive data.

• No reason clustering and graph databases cannot be used.

56

© Starrett Consulting, Inc.

Data-Science Investigative Tools as Solution – Information RetrievalDiverse Data into NoSQL Database (Text Repository) to Search Engine

Date Author Title Last edit Body

Part Name Part No. Price Dept.

To From CC Sent Subject Body

Date URL Title Web Page Text

Etc.

Text Repository(NoSQL Database)

Original Files

Information Extraction at Document LevelCategories, Concepts, Emotion, Entities, Keywords, Metadata, Keyphrases, Semantic Roles, Sentiment

Refine with:• Clustering?• Search engine?

57

Page 20: Data Science, Investigations and Privacy · 1 © Starrett Consulting, Inc. Data Science, Investigations and Privacy Current Status, Challenges and Solutions © Starrett Consulting,

20

© Starrett Consulting, Inc.

Data-Science Investigative Tools as Solution – Information ExtractionHigh-Level Flow

• Categories• Concepts• Emotion• Entities• Keywords• Metadata• Keyphrases• Semantic

Roles• Sentiment

Use EXTRACTED data from a document ->

To help identify

DOCS containing personal / sensitive

data

• Health• Name / ID• Sex Life• Psychological• Location• Political opinion

(Etc.)

(This process often involves active human review, i.e. whether extracted data will classify a

document as containing Health, Name / ID, Sex Life (etc.) data.)

58

© Starrett Consulting, Inc.

Skull

shape

Eyebrows Leg

length

Hair

length

Tail Number

of ears

Predict?

Pointed Yes 3 inches Short Yes 2 Cat

Round Yes 2 feet Short Yes 2 Dog

Triangle No 5 inches Medium Yes 2 Cat

Triangle No 5 inches Long Yes 2 Cat

Round Yes 1 foot Long Yes 2 Dog

Triangle Unk 4 inches Short Yes 2 Dog

Triangle No 5 inches Short Yes 2 Cat

Type

Dog

Dog

Cat

Cat

Dog

Cat

Cat

Predictive Accuracy

Data-Science Investigative Tools as Solution: Supervised LearningClassification (predictive analytics - categorical)

Remember this? Except now we go from two “classes” (dog / cat) to Health, Name / ID, Sex Life, Psychological, Location, Political opinion, etc.

59

© Starrett Consulting, Inc.

Date Metadata Document Text Data Type

Health

Location

Sex Life

Name / ID

Psychological

Health

Sex Life

Health

Religious Belief

Sex Life

Psychological

Location

Health

Data-Science Investigative Tools as Solution – Supervised Learning (Classification)Example – Classifying Personal / Sensitive Data

60

Page 21: Data Science, Investigations and Privacy · 1 © Starrett Consulting, Inc. Data Science, Investigations and Privacy Current Status, Challenges and Solutions © Starrett Consulting,

21

© Starrett Consulting, Inc.

Data-Science Investigative Tools as SolutionGDPR data classification

Health

Predictive Model(Classifier)

Files Stream

Location

Sex Life

Name / ID

Psychological

(Etc.)

Personal

Sensitive

This example is too coarse and generic as a “real-world” example but communicates basic concept.Major solution providers are using this same basic concept though for information-governance data classification. 61

© Starrett Consulting, Inc.

Agenda

• Global Privacy

• Big Data and Data Science Introduction – What are they?

• Data Science and Investigations – what’s the problem?

• Data-Science Investigative Tools – the solution!

• A Unique Challenge – Automation vs. Human Decisions

62

© Starrett Consulting, Inc.

A Unique Challenge – Automation vs. Human Decisions: GDPR (abbreviated!)

Generally:

• Individuals have the right not to be subject to a decision when:

• It is based on automated processing

• It produces a legal effect or a similarly significant effect on the individual.

(Don’t investigations fit this definition?)

• You must ensure that individuals can:

• Obtain human intervention.

• Express their point of view.

• Obtain an explanation of the decision and challenge it.

63

Page 22: Data Science, Investigations and Privacy · 1 © Starrett Consulting, Inc. Data Science, Investigations and Privacy Current Status, Challenges and Solutions © Starrett Consulting,

22

© Starrett Consulting, Inc.

A Unique Challenge – Automation vs. Human Decisions

GDPR and DPA – using personal data to profile

• Safeguards against the risk that damaging decision is not taken without human intervention.

• Establish if any of your processing operations amount to automated decision making.

64

© Starrett Consulting, Inc.

A Unique Challenge – Automation vs. Human Decisions: GDPR (abbreviated!)

Profiling

• Any form of automated processing to evaluate personal aspects of an individual in order to analyze / predict:

• Performance at work.

• Economic situation.

• Health.

• Personal preferences.

• Reliability.

• Behavior.

• Location.

• Movements.

(Again, don’t investigations fit this definition?)

65

© Starrett Consulting, Inc.

A Unique Challenge – Automation vs. Human Decisions: GDPR (abbreviated!)

When Profiling Requires:

• Processing is fair and transparent by providing meaningful information about the logic involved, as well as the significance and the envisaged consequences.

• Use appropriate mathematical or statistical procedures for the profiling.

• Implement appropriate technical and organizational measures to enable inaccuracies to be corrected and minimize the risk of errors.

• Not of a child or special categories (exceptions apply)

66

Page 23: Data Science, Investigations and Privacy · 1 © Starrett Consulting, Inc. Data Science, Investigations and Privacy Current Status, Challenges and Solutions © Starrett Consulting,

23

© Starrett Consulting, Inc.

A Unique Challenge – Automation vs. Human Decisions: Others

Credit applications and “adverse inference”

• May need to explain automated decisions.

Employment decisions (e.g. resume recommendations)

• Do algorithms or predictive models inadvertently discriminate?

67

© Starrett Consulting, Inc.

A Unique Challenge – Automation vs. Human Decisions: Solutions!

• Keeping any machine learning effort conventional and straightforward.

• Dog vs. wolf example.

• Intuitive assessments are key in interpretability.

• This starts at outset of automation design.

68

© Starrett Consulting, Inc.

A Unique Challenge – Automation vs. Human Decisions: Solutions!

• What factors (features) are used?

• Choice of machine learning algorithm.

• How is sampling done (if at all)?

• Details of predictive model testing and validation.

• Software outputs to logs to detail decision process that can be interpreted in lay terms.

69

Page 24: Data Science, Investigations and Privacy · 1 © Starrett Consulting, Inc. Data Science, Investigations and Privacy Current Status, Challenges and Solutions © Starrett Consulting,

24

© Starrett Consulting, Inc.

THE END!

QUESTIONS?

70

408.803.2288© Starrett Consulting, Inc.