Upload
digicomp-academy-ag
View
619
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Karthik Ranganathan, ehemaliger Lead-Ingenieur bei Facebook, und heute bei NUTANIX beschäftigt erklärt in seiner Präsentation moderne Datencenter anhand des Business Use Case Facebook.
Citation preview
1
Kannan Muthukkaruppan & Karthik RanganathanJun/20/2013
How Big Data Technologies Power Facebook
How Big Data Technologies Power FacebookKarthik RanganathanSeptember, 2013
2
Introduction
Email: [email protected]: @KarthikRCurrent: Member of Technical Staff, NutanixBackground: Technical Engineering Lead at Facebook. Co-built Cassandra for Facebook Inbox Search and improved performance and resiliency of Hbase for Facebook Messages and Search Indexing.
3
Agenda
Big data at Facebook HBase use cases
• OLTP• Analytics
Operating at scale The Nutanix solution
4
Big Data at Facebook
OLTP• User databases (MySQL)• Photos (Haystack)
• Facebook Messages, Operational Data Store (HBase) Warehouse
• Hive Analytics• Graph Search Indexing
5
HBase in a nutshell
Apache project, modeled after BigTable Distributed, large scale data store Built on top of Hadoop DFS (HDFS) Efficient at random reads and writes
6
FB’s Largest Hbase Application
Facebook Messages
7
The New Facebook Messages
8
Why HBase?
Evaluated a bunch of different options• MySQL, Cassandra, building a custom storage system for
messages
Horizontal Scalability Automatic failover and load balancing Optimized for write-heavy workloads HDFS already battle-tested at Facebook HBase’s strong consistency model
9
Quick stats (as of Nov 2011)
Traffic to HBase• Billions of messages per day• 75B+ rpc’s per day
Usage pattern• 55% reads, 45% writes• Average write: 16 KV’s to multiple CF’s
10
Data Sizes
7PB+ online data• ~21PB with replication• LZO compressed• Excludes backups
Growth rate• 500TB+ per month• ~20PB of raw disk per year!
11
Growing with size
Constant need of features with growth Read and write path improvements
• Performance optimizations• IOPS reduction• New database file format
Intelligent data and compute placement• Shard level block placement• Locality based load-balancing
12
Other OLTP use cases of HBase
Operational Data Store Multi-tenant KeyValue store Site integrity – fighting spam
13
Warehouse use cases of HBase
Graph Search Indexing• Complex application logic• Multiple verticals
Hive over HBase• Realtime data ingest• Enables real-time analytics
14
Real-time monitoring and anomaly detection
Operational Data Store
15
ODS: Facebook’s #1 Debugging Tool
Collects metrics from production servers
Supports complex aggregations and transformations
Really well-designed UI
16
Quick stats
Traffic to HBase• 150B+ ops per day
Usage pattern• Heavy reads of recent data• Frequent MR jobs for rollups• TTL to expire older data
17
Real-time Analytics
Facebook Insights
18
Real-time URL/Domain Insights
Deep analytics for websites• Facebook widgets
Massive scale• Billions of URL’s• Millions of increments/sec
19
Detailed Insights
Tracks many metrics• Clicks, likes, shares,
impressions• Referral traffic
Detailed breakdown• Age buckets, gender,
location
20
Controlled Multi-tenancy
Generic KeyValue Store
21
A Multi-tenant solution on HBase
Generic Key-Value store• Multiple apps on the same cluster• Transparent schema design• Simple API
put(appid, key, value)value = get(appid, key)
22
Architecture
HBase
put(appid, key, value)
Memcache
get(appid, key)
ReadWrite
23
Multi-tenancy Issues
Not a self-service model• Each app is reviewed
Global and per-app metrics• Monitor RPCs by type, latencies, errors• Friendly names for apps
If things went wrong• Per-app kill switch
24
Powering FB’s Semantic Search Engine
Graph Search Indexing
25
Framework to build search indexes
Multiple, independent input sources HBase stores document info Output is the search index image
rowKey = document idvalue = terms, document
data
26
Architecture
HBase cluster
Document
source 2Document source 1
MR cluster
…Image files…
27
Do’s and Do-Not’s From Experience
Operating at Scale
28
Design for failures(!)
Architect for failures and manageability No single point of failure
• Killing any process is legit
Minimize manual intervention• Especially for frequent failures
Uptime is important• Rolling upgrades are the norm• Need to survive rack failures
29
Dashboard and Metrics
Single place to graph/report everything RPC calls SLA misses
• Latencies, p99, Errors• Per-request profiling
Cluster and node health Network Utilization
30
Health Checks
Constantly monitor nodes Auto-exclude nodes on failure
• Machine not ssh-able• Hardware failures (HDD failure, etc)• Do NOT exclude on rack failures
Auto-include nodes once repaired Rate limit remediation of nodes
31
In a nutshell…
Use commodity hardware Scaling out is #1 Efficiency is #2
• though pretty close behind scale-out
Design for failures• Frequent failures must be auto handled
Metrics, Metrics, Metrics!
32
Overview through comparison
The Nutanix Solution
33
Nutanix compared with HBase
Evaluated a bunch of different options• MySQL, Cassandra, building a custom storage system for
messages
Horizontal Scalability Just add more nodes to scale out
Automatic failover and load balancing When a node goes down, others take its place automatically Load of node that went down is distributed to many others
34
Nutanix compared with HBase philosophy
Optimized for write-heavy workloads Optimized for virtualized environments Read and write heavy workloads Transparent use of flash to boost perf
HDFS already battle-tested at Facebook Nutanix is also quite battle-tested
HBase’s strong consistency model Nutanix is also strongly consistent
35
Other aspects of Nutanix
Architected for failures and manageability No single point of failure Minimal manual intervention for frequent failures
Uptime is important Rolling upgrades are the norm• Need to survive rack failures
Single place to graph/report everything Prism UI to report and manage the entire cluster
Constantly monitor nodes Auto-exclude nodes on failure
36
In a nutshell about Nutanix…
Runs on commodity hardware Scaling out is #1
Drop in scale out for nodes
Efficiency is #2 Constant work on perf improvements
Design for failures Frequent failures auto handled Alerts in UI for many other states
Metrics, Metrics, Metrics! Prism UI gives insights into the cluster health
37
Questions?
38
Thank You
NUTANIX INC. – CONFIDENTIAL AND PROPRIETARY