24
The Big Data Analytics Ecosystem at LinkedIn Rajappa Iyer September 17, 2013

The Big Data Analytics Ecosystem at LinkedIn

Embed Size (px)

DESCRIPTION

LinkedIn has several data driven products that improve the experience of its users -- whether they are professionals or enterprises. Supporting this is a large ecosystem of systems and processes that provide data and insights in a timely manner to the products that are driven by it. This talk provides an overview of the various components of this ecosystem which are: - Hadoop - Teradata - Kafka - Databus - Camus - Lumos etc.

Citation preview

Page 1: The Big Data Analytics Ecosystem at LinkedIn

The Big Data Analytics Ecosystem at LinkedIn

Rajappa IyerSeptember 17, 2013

Page 2: The Big Data Analytics Ecosystem at LinkedIn

Agenda

LinkedIn by the numbers An Overview of Data Driven Products /

Insights The Big Data Analytics Ecosystem

– Storage and Compute Platforms– Data Transport Pipelines– Data Processing Pipelines– Operational Tooling - Metadata

Q&A

Page 3: The Big Data Analytics Ecosystem at LinkedIn

LinkedIn: The World’s Largest Professional Network

Members Worldwide

2 newMembers Per Second

100M+Monthly Unique Visitors

238M+ 3M+ Company Pages

Connecting Talent Opportunity. At scale…

Page 4: The Big Data Analytics Ecosystem at LinkedIn

Insights

(Analysts and Data Scientists)

Data Driven Products and Insights

Products for Members

(Professionals)

Products for Enterprises

(Companies)

Data,Platforms,Analytics

Page 5: The Big Data Analytics Ecosystem at LinkedIn

Products for Members

Page 6: The Big Data Analytics Ecosystem at LinkedIn

Products for Enterprises

Sell - Sales Navigator Market - Marketing Solutions

Hire - Talent Solutions

Page 7: The Big Data Analytics Ecosystem at LinkedIn

Examples of Business Insights

Page 8: The Big Data Analytics Ecosystem at LinkedIn

Example of Deeper Insight

Job Migration After Financial Collapse

Page 9: The Big Data Analytics Ecosystem at LinkedIn

A Simplified Overview of Data Flow

Page 10: The Big Data Analytics Ecosystem at LinkedIn

LinkedIn Confidential ©2013 All Rights Reserved 10

Storage and Compute Platforms

Most data in Avro format Access via Hive and Pig

Most ETL processes run hereSpecialized batch processing

Algorithmic data mining

Page 11: The Big Data Analytics Ecosystem at LinkedIn

LinkedIn Confidential ©2013 All Rights Reserved 11

Storage and Compute Platforms

Integrated Data Warehouse

Standard BI Tools

Interactive Querying(Low latency)

Workload Management

Page 12: The Big Data Analytics Ecosystem at LinkedIn

LinkedIn Confidential ©2013 All Rights Reserved 12

Transport Pipeline - Kafka

High-volume, low-latency messaging system

Horizontally scalable Automatic load balancing Rewindability Intra-cluster replication Mainly used for log

aggregation and queuing

Page 13: The Big Data Analytics Ecosystem at LinkedIn

LinkedIn Confidential ©2013 All Rights Reserved 13

Transport Pipeline - Databus

Timeline consistent data change capture

Works with Oracle, MySQL, Espresso… Transactional semantics In-order, at least once delivery Low latency Has scaled to 100s of sources

Page 14: The Big Data Analytics Ecosystem at LinkedIn

LinkedIn Confidential ©2013 All Rights Reserved 14

Processing Pipeline: Camus

Camus: Framework for ingesting Kafka streams to HDFS

Page 15: The Big Data Analytics Ecosystem at LinkedIn

LinkedIn Confidential ©2013 All Rights Reserved 15

Camus: Features

Highly scalable due to adaptive input format

– Handled 10x increase in data volume without change

Restartable with checkpointing Robust auditing support Plays nicely with Hive and Pig

– Avro format support– Hive metastore registration

Open source– https://github.com/linkedin/camus

Page 16: The Big Data Analytics Ecosystem at LinkedIn

LinkedIn Confidential ©2013 All Rights Reserved 16

Processing Pipeline: Lumos

Lumos: Framework to ingest database data to HDFS

PRODOracle

VirtualSnapshot

Materializer

ETL Hadoop Cluster

Staging Data(internal)

Data-BusDB

Extract

LazySnapshot

Materializer

ExternalData

Inc/Full(internal)

DWHprocesses

Meta-Data

PublishedVirtual Snapshot

Pig/HiveLoaders

PRODEspresso

Page 17: The Big Data Analytics Ecosystem at LinkedIn

LinkedIn Confidential ©2013 All Rights Reserved 17

Lumos: Features

Supports Espresso, Oracle and MySQL as sources

Full snapshots and incremental dumps Automatic type translation for most database

types Provides LAST UPDATE semantics for data Supports low latency requirements

– Reader API performs just-in-time compaction Snapshot constructed two ways:

– On demand compaction for upserts– Periodic snapshotting that reflects deletes as well

Page 18: The Big Data Analytics Ecosystem at LinkedIn

Operational Support - Metadata

ETL pipeline is a complex graph of workflows

– Our comprehensive dashboard production flow is nearly 30 levels deep with complex dependencies

To manage this, we needed to capture:– Process dependencies– Data dependencies– Process execution history– Data load status– Data consumption status (watermarks)

Page 19: The Big Data Analytics Ecosystem at LinkedIn

Operational Metadata – v1 Capture process

dependency graph– Also capture useful

metadata such as process owners

Capture stats for each execution of a workflow

– Time of execution– Status– Pointer to error logs

Has proved quite useful for monitoring critical chains

Page 20: The Big Data Analytics Ecosystem at LinkedIn

Operational Metadata – v2

For each flow, capture input and output data elements

For each execution, capture stats on data element, e.g.

– Number of records / lines read– Number of records / lines

written– Error counts– Last processed record

Can be time based or sequence based

This can be per flow as more than one flow can consume a data element

Page 21: The Big Data Analytics Ecosystem at LinkedIn

Operational Metadata – The Payoff

Restartable ETL jobs – Process new data since last successful

previous run Catch up mode for ETL jobs

– Single run can consume data from multiple intervals in one batch

– Next run will resume from correct place Data freshness and availability dashboard Coarse form of data lineage

– Impact analysis for unfortunately all-too-common changes upstream

Page 22: The Big Data Analytics Ecosystem at LinkedIn

LinkedIn Confidential ©2013 All Rights Reserved 22

Putting it all Together

Online Data Stores

Data Transport Pipelines

Data Processing Pipelines

Offline Storage / Compute

Analytics Application

s

Espresso

Voldemort

Kafka

Databus

Camus

Lumos

Hadoop

Teradata

Operational Metadata and Tooling

Page 23: The Big Data Analytics Ecosystem at LinkedIn

`whoami`

Sr. Manager / DWH Architect @ LinkedIn since 2011

Prior to that:– Director of Engineering at Digg– Enterprise Data Architect at eBay

www.linkedin.com/in/rajappaiyer/

Page 24: The Big Data Analytics Ecosystem at LinkedIn

Questions?

More at data.linkedin.comWe’re Hiring