Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Handling Personal Information in LinkedIn’s Content Ingestion System
David MaxSenior Software Engineer
About Me
• Software Engineer at LinkedIn NYC since 2015
• Content Ingestion team
• Office Hours –Thursday 11:30-12:00
David MaxSenior Software Engineer
LinkedInwww.linkedin.com/in/davidpmax/
About LinkedIn New York Engineering
• Located in Empire State Building
• Approximately 100 engineers and 1000 employees total
• Multiple teams, front end, back end, and data science
New YorkEngineering
Disclaimers
• I’m not a lawyer
• Some details omitted
• I am not a spokesperson for official LinkedIn policy
O U R M I S S I O N
Create economic opportunity for every member of the global workforce
>546M >70%
• World’s largest professional network
members of members reside outside the U.S.
• More than 200 countries and territories worldwide
General Data Protection Regulation
• Applies to all companies worldwide that process personal data of EU citizens.
• Widens definition of personal data.
• Introduces restrictive data handling principles.
• Enforceable from May 25, 2018.
Handling Personally Identifiable Information (PII)
Limit personal data collection, storage,
usage
Data Minimization
Cannot use collected data for a different
purpose
Consent
Do not hold data longer then necessary
Retention
Must delete data upon request
Deletion
Handling PII in Content Ingestion
Content Ingestion Data Protection
Babylonia Data Minimization Consent
Retention Deletion
What is Content Ingestion?
Content Ingestion
Babylonia
Content Ingestion
Babylonia
Content Ingestion
Babylonia
Content Ingestion
Babylonia
url: https://www.youtube.com/watch?v=MS3c9hz0bRg
title: "SATURN 2017 Keynote: Software is Details”
image: https://i.ytimg.com/vi/MS3c9hz0bRg/hqdefault.jpg?sqpoaymwEYCKgBEF5IVfKriqkDCwgBFQAAiEIYAXAB\\u0026rs=AOn4CLClwjQlBmMeoRCePtHaThN-qXRHqg
Content Ingestion
Babylonia
What is Content Ingestion?
Content Ingestion
Babylonia
• Extracts metadata from web pages
• Source of Truth for 3rd party content
• Also contains metadata for some public 1st party content
• Used by LinkedIn services for sharing, decorating, and embedding content
• Data also feeds into content understanding and relevance models
How does PII get into
Babylonia?
Ingesting 1st party pages containing publicly viewable
member PII• Profile pages• Publish posts• SlideShare content
When a Member Account is Closed
• Remove scraped data relating to the member pages that have been taken down
• Notify downstream systems that might be holding a copy of the data
• Babylonia (along with other systems) is notified that the member’s account is closed
• Other systems take down the member’s content(i.e. public profile page, publish posts, etc.)
What happens What Babylonia needs to do
Babylonia Datasets
EspressoDatabase
HDFSETL
Brooklin Data Change Events
Datasets
Content Ingestion
Babylonia
Downstream and Upstream Datasets
EspressoDatabase
HDFSETL
Brooklin Data Change Events
1st party web page
profile
job
article
publishing
profile
Online Service
Near Line
Offline
• Need to identify URLs that contain a member’s PII.
• My post might contain your PII
• Connection between member and the URL resides in the upstream system
Challenges of member PII in
Babylonia
Option #1: Require Upstream Systems to Notify Babylonia
• Simple – Babylonia waits to be told specifically which URLs should be purged
• Babylonia only does extra work when a URL needs to be purged
• Puts responsibility where the knowledge is
Pros Cons
• Requires additional work by every system that exposes PII in publicly accessible web pages
• If the notification is missed, how will Babylonia know?
• 1st party URLs sometimes change as upstream systems are changed – need to correctly handle old URLs too
Option #2: Actively Refetch Every 1st Party URL
• Simple logic: Page gone? Purge the page.
• Requires little additional work from upstream systems
• Works also for old 1st party URLs
Pros Cons
• There are a lot of 1st party URLs in Babylonia
• Continuous polling of all 1st party URLs consumes a lot of resources just for the sake of the very few URLs that are actually affected
• Extra work to avoid false positives or false negatives
Option #3: Eliminate Member PII in Babylonia
• The easiest data to delete is data that isn’t in your system to begin with
• Gets closer to Single Source of Truth (SSOT) for all 1st party content – better for consistency, not only for compliance
Pros Cons
• Babylonia is relied upon by numerous systems to have content for URLs – excluding 1st party content will affect member experience
• No substitute currently available
• Difficult to achieve based on URL – can’t always tell by looking at a URL if it resolves to 1st party content (eg. shortlinks)
Blended Approach
• Option 1 - Having upstream systems notify is best, but might miss some pages
• Option 2 - Active refetch is thorough but expensive. Must use to catch pages that won’t support notifications
• Option 3 - Some pages won’t work with active refetch. For example, pages that still return an HTTP status code 200 even when the data has been removed. These must be blocked
Classification of Ingested URLs
URL3rd Party
1st PartyBlocked
Whitelisted
Actively Refetched
Notified by Upstream
Option 1 – Upstream Notification
• Upstream system sends a Kafka message
• Babylonia consumes message and purges data
• Open source -kafka.apache.org
Option 2 – Active Refetching
EspressoDatabase
HDFSETL
RefetchURL table
RefetchURL table
Offlinejob
Refetchmessages
Kafka Pushjob
Refetchprocess
UPDATETakedown Requests for
deleted pages
Option 3 – Whitelist
• Block all 1st party URLs that can’t meet minimal requirements
• Mainly must return a 404 for an invalid or deleted URL
• Ensures new 1st party URLs are onboarded before being ingested
Managing PII in Datasets
HDFSETL
Offline Datasets
EspressoDatabase
Espresso Datasets
Espresso Datasets
EspressoDatabase
• LinkedIn distributed NoSQL database
• Data stored in Avro format (JSON)
• Indexed by specific primary key fields
What is Espresso? Challenges
• Reference to PII not always in the key
• ETL snapshots of Espresso Dataset become offline Datasets
Offline (HDFS) Datasets
HDFSETL
Offline Datasets• Files of Avro (JSON) records
• Need to read whole record to see if it has PII
• Files not conducive to removing one record from the middle
• Dataset can be source for downstream jobs that also need to be purged
Challenges
Which datasets contain member PII?
Data Discovery
• Data discovery and lineage tool
• Central location for all schema
• Document meanings of each column
• Trace downstream/upstream lineage of datasets
• Tag every column that can contain member reference or PII.
• Open Source -github.com/linkedin/wherehows
WhereHows
• Interface for accessing datasets
• Combines dataset schema with WhereHows metadata
• Defines output virtual dataset while preserving data tags
• Supports defining virtual datasets where PII is excluded or obfuscated
Dali (Data Access at LinkedIn)
Raw Dataset
WhereHowsMetadata
Dali Reader
Only systems that handle PII properly are allowed access
Access Control
• Controls access to PII data to known list of authorized systems
• We only approve access to systems that it can handle PII properly
• Ensures that member PII can’t leak into untracked systems/datasets
• Acts as a list of downstream services
Access Control List (ACL)
Keeping Track of Personal Information in Babylonia
• Field tagging for fields containing PII
• Know where the PII is
WhereHows Dali ACL
• Downstreams use Dali, which preserves the WhereHows tagging on new virtual datasets
• Keeps tags with the data as it moves from one dataset to another
• Control the spread of PII data only to authorized readers
• Serves as a list of current downstream systems to notify when data is purged
Apache Gobblin
• Framework for transforming large datasets
• Data lifecycle management
• Uses WhereHows tags to identify data in our Espresso or offline datasets that need to be purged
• Open source - gobblin.apache.org
• Created tags representing ingested content URLs in WhereHows
• Enables downstream systems to onboard with Espresso auto purge and Gobblin by tagging columns in their tables as containing a URL or Ingested Content URN (Uniform Resource Name)
Tagging in WhereHows
WhereHows and Gobblin
• Choose an implementation where restriction is the default until proven safe
• Whitelisting ensures all allowed 1st
party URLs meets a minimum technical bar for ingestion
• Simplicity of active refetching helps keep the bar low enough to include most content safely
Compliance Comes First
• Added constraints to the system
• Developer restrictions
• Made certain kinds of things harder to do
Constraints
Bigger Picture
“Constraints can act as guide rails thatpoint a system where you want it to go.”
G E O R G E F A I R B A N K S
• A constrained system is easier to predict and control
• Make the wrong things harder to do
• Give guidance to all developers how things are supposed to be done
Constraints / Guide Rails
Bigger Picture
• Constraints should manifest in some explicit way
• Counter-Example: “No backwards incompatible schema changes”
• Hard to tell what developers refrained from doing
• WhereHows, Dali, and ACLs make metadata and the rules explicit and thus easier to perpetuate
Manifest Guide Rails in the Code
Bigger Picture
A design technique where the responsibility for a guide rail is moved away from developer vigilance into code, with the goal of achieving a global property on the system.
Architecture Hoisting
Bigger Picture
Architecture Hoisting
Bigger Picture
• Make use of the framework to manage PII
• Requires developers to think about PII concerns up front to access the data
• Once set up, developers can focus less on managing PII because the architecture is handling it
• Users of the framework can automatically benefit from future enhancements
Thank you