Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total

Handling Personal Information in LinkedIn’s Content Ingestion System

David MaxSenior Software Engineer

LinkedIn

About Me

• Software Engineer at LinkedIn NYC since 2015

• Content Ingestion team

• Office Hours –Thursday 11:30-12:00

David MaxSenior Software Engineer

LinkedInwww.linkedin.com/in/davidpmax/

https://www.linkedin.com/in/davidpmax/

About LinkedIn New York Engineering

• Located in Empire State Building

• Approximately 100 engineers and 1000 employees total

• Multiple teams, front end, back end, and data science

New YorkEngineering

Disclaimers

• I’m not a lawyer

• Some details omitted

• I am not a spokesperson for official LinkedIn policy

O U R M I S S I O N

Create economic opportunity for every member of the global workforce

LinkedIn

>546M >70%

• World’s largest professional network

members of members reside outside the U.S.

• More than 200 countries and territories worldwide

General Data Protection Regulation

• Applies to all companies worldwide that process personal data of EU citizens.

• Widens definition of personal data.

• Introduces restrictive data handling principles.

• Enforceable from May 25, 2018.

Handling Personally Identifiable Information (PII)

Limit personal data collection, storage,

usage

Data Minimization

Cannot use collected data for a different

purpose

Consent

Do not hold data longer then necessary

Retention

Must delete data upon request

Deletion

Handling PII in Content Ingestion

Content Ingestion Data Protection

Babylonia Data Minimization Consent

Retention Deletion

What is Content Ingestion?

Content Ingestion

Babylonia

Content Ingestion

Babylonia

Content Ingestion

Babylonia

Content Ingestion

Babylonia

url: https://www.youtube.com/watch?v=MS3c9hz0bRg

title: "SATURN 2017 Keynote: Software is Details”

image: https://i.ytimg.com/vi/MS3c9hz0bRg/hqdefault.jpg?sqpoaymwEYCKgBEF5IVfKriqkDCwgBFQAAiEIYAXAB\\u0026rs=AOn4CLClwjQlBmMeoRCePtHaThN-qXRHqg

Content Ingestion

Babylonia

What is Content Ingestion?

Content Ingestion

Babylonia

• Extracts metadata from web pages

• Source of Truth for 3rd party content

• Also contains metadata for some public 1st party content

• Used by LinkedIn services for sharing, decorating, and embedding content

• Data also feeds into content understanding and relevance models

How does PII get into

Babylonia?

Ingesting 1st party pages containing publicly viewable

member PII• Profile pages• Publish posts• SlideShare content

When a Member Account is Closed

• Remove scraped data relating to the member pages that have been taken down

• Notify downstream systems that might be holding a copy of the data

• Babylonia (along with other systems) is notified that the member’s account is closed

• Other systems take down the member’s content(i.e. public profile page, publish posts, etc.)

What happens What Babylonia needs to do

Babylonia Datasets

EspressoDatabase

HDFSETL

Brooklin Data Change Events

Datasets

Content Ingestion

Babylonia

Downstream and Upstream Datasets

EspressoDatabase

HDFSETL

Brooklin Data Change Events

1st party web page

profile

job

article

publishing

profile

Online Service

Near Line

Offline

• Need to identify URLs that contain a member’s PII.

• My post might contain your PII

• Connection between member and the URL resides in the upstream system

Challenges of member PII in

Babylonia

Option #1: Require Upstream Systems to Notify Babylonia

• Simple – Babylonia waits to be told specifically which URLs should be purged

• Babylonia only does extra work when a URL needs to be purged

• Puts responsibility where the knowledge is

Pros Cons

• Requires additional work by every system that exposes PII in publicly accessible web pages

• If the notification is missed, how will Babylonia know?

• 1st party URLs sometimes change as upstream systems are changed – need to correctly handle old URLs too

Option #2: Actively Refetch Every 1st Party URL

• Simple logic: Page gone? Purge the page.

• Requires little additional work from upstream systems

• Works also for old 1st party URLs

Pros Cons

• There are a lot of 1st party URLs in Babylonia

• Continuous polling of all 1st party URLs consumes a lot of resources just for the sake of the very few URLs that are actually affected

• Extra work to avoid false positives or false negatives

Option #3: Eliminate Member PII in Babylonia

• The easiest data to delete is data that isn’t in your system to begin with

• Gets closer to Single Source of Truth (SSOT) for all 1st party content – better for consistency, not only for compliance

Pros Cons

• Babylonia is relied upon by numerous systems to have content for URLs – excluding 1st party content will affect member experience

• No substitute currently available

• Difficult to achieve based on URL – can’t always tell by looking at a URL if it resolves to 1st party content (eg. shortlinks)

Blended Approach

• Option 1 - Having upstream systems notify is best, but might miss some pages

• Option 2 - Active refetch is thorough but expensive. Must use to catch pages that won’t support notifications

• Option 3 - Some pages won’t work with active refetch. For example, pages that still return an HTTP status code 200 even when the data has been removed. These must be blocked

Classification of Ingested URLs

URL3rd Party

1st PartyBlocked

Whitelisted

Actively Refetched

Notified by Upstream

Option 1 – Upstream Notification

• Upstream system sends a Kafka message

• Babylonia consumes message and purges data

• Open source -kafka.apache.org

Option 2 – Active Refetching

EspressoDatabase

HDFSETL

RefetchURL table

RefetchURL table

Offlinejob

Refetchmessages

Kafka Pushjob

Refetchprocess

UPDATETakedown Requests for

deleted pages

Option 3 – Whitelist

• Block all 1st party URLs that can’t meet minimal requirements

• Mainly must return a 404 for an invalid or deleted URL

• Ensures new 1st party URLs are onboarded before being ingested

Managing PII in Datasets

HDFSETL

Offline Datasets

EspressoDatabase

Espresso Datasets

Espresso Datasets

EspressoDatabase

• LinkedIn distributed NoSQL database

• Data stored in Avro format (JSON)

• Indexed by specific primary key fields

What is Espresso? Challenges

• Reference to PII not always in the key

• ETL snapshots of Espresso Dataset become offline Datasets

Offline (HDFS) Datasets

HDFSETL

Offline Datasets• Files of Avro (JSON) records

• Need to read whole record to see if it has PII

• Files not conducive to removing one record from the middle

• Dataset can be source for downstream jobs that also need to be purged

Challenges

Which datasets contain member PII?

Data Discovery

• Data discovery and lineage tool

• Central location for all schema

• Document meanings of each column

• Trace downstream/upstream lineage of datasets

• Tag every column that can contain member reference or PII.

• Open Source -github.com/linkedin/wherehows

WhereHows

• Interface for accessing datasets

• Combines dataset schema with WhereHows metadata

• Defines output virtual dataset while preserving data tags

• Supports defining virtual datasets where PII is excluded or obfuscated

Dali (Data Access at LinkedIn)

Raw Dataset

WhereHowsMetadata

Dali Reader

Only systems that handle PII properly are allowed access

Access Control

• Controls access to PII data to known list of authorized systems

• We only approve access to systems that it can handle PII properly

• Ensures that member PII can’t leak into untracked systems/datasets

• Acts as a list of downstream services

Access Control List (ACL)

Keeping Track of Personal Information in Babylonia

• Field tagging for fields containing PII

• Know where the PII is

WhereHows Dali ACL

• Downstreams use Dali, which preserves the WhereHows tagging on new virtual datasets

• Keeps tags with the data as it moves from one dataset to another

• Control the spread of PII data only to authorized readers

• Serves as a list of current downstream systems to notify when data is purged

Apache Gobblin

• Framework for transforming large datasets

• Data lifecycle management

• Uses WhereHows tags to identify data in our Espresso or offline datasets that need to be purged

• Open source - gobblin.apache.org

• Created tags representing ingested content URLs in WhereHows

• Enables downstream systems to onboard with Espresso auto purge and Gobblin by tagging columns in their tables as containing a URL or Ingested Content URN (Uniform Resource Name)

Tagging in WhereHows

WhereHows and Gobblin

• Choose an implementation where restriction is the default until proven safe

• Whitelisting ensures all allowed 1st

party URLs meets a minimum technical bar for ingestion

• Simplicity of active refetching helps keep the bar low enough to include most content safely

Compliance Comes First

• Added constraints to the system

• Developer restrictions

• Made certain kinds of things harder to do

Constraints

Bigger Picture

“Constraints can act as guide rails thatpoint a system where you want it to go.”

G E O R G E F A I R B A N K S

• A constrained system is easier to predict and control

• Make the wrong things harder to do

• Give guidance to all developers how things are supposed to be done

Constraints / Guide Rails

Bigger Picture

• Constraints should manifest in some explicit way

• Counter-Example: “No backwards incompatible schema changes”

• Hard to tell what developers refrained from doing

• WhereHows, Dali, and ACLs make metadata and the rules explicit and thus easier to perpetuate

Manifest Guide Rails in the Code

Bigger Picture

A design technique where the responsibility for a guide rail is moved away from developer vigilance into code, with the goal of achieving a global property on the system.

Architecture Hoisting

Bigger Picture

Architecture Hoisting

Bigger Picture

• Make use of the framework to manage PII

• Requires developers to think about PII concerns up front to access the data

• Once set up, developers can focus less on managing PII because the architecture is handling it

• Users of the framework can automatically benefit from future enhancements

Thank you

Documents

Handling Personal Information in LinkedIn's Content ...About LinkedIn New York Engineering •Located in Empire State Building •Approximately 100 engineers and 1000 employees total