View
439
Download
1
Category
Preview:
Citation preview
EverestScaling to Petabytes
Yahoo!
May 2008
2
Everest Architecture
Clients
Query Server
Storage Server
• Massively Parallel (Tens of PB)– Commodity Clusters
– Multi-tier scalability– Distributed Columnar Storage
• Smart– Optimized compression– Parallel Vector Query Processing– Query and Storage optimizations
– Query Expression and Columnar caching
• Leverage PostgreSQL– Tools and Connectivity (ODBC)– extensibility– UDF & UDAF framework
• Inexpensive– COTS
PostgreSQL Lib
Everest Extensions
Distributed QP
PostgreSQL Server
Node Storage Manager
Volume
Volume
Volume
PERL/Ruby.DBI ADO.NETODBC
Segment Manager
Scripts /Apps
PQLib
PgAdmin
Segmentation Platform
Logical Storage Manager
Storage Provider
Storage Proxy
Asynchronous Communications
Trans Server
QP
Shared Memory
Trans Proxy Mgmt Proxy LSM Proxy
MgmtServices
Storage Cache
Storage Provider
Storage Proxy
Asynchronous Communications
Storage Cache
Chunk Storage
Volume Storage Manager
Volume Metadata
Storage Services
3
Performance and Scale
Data size Everest (min)
Vendor A (min)
Vendor B (min)
90 TB(600 B rows)
177 414 325
30 TB(200 B rows)
60 95 91
HW Cost (1 PB)
250 1200 1200
• Proven Petabytes scale in production– Approaching 2 PB, projected to grow > 30 PB by 2009– Largest table: 3.5 Trillion rows (time partitioned)
• 10x Price-Performance relative to commercial systems
Performance comparison
0
100
200
300
400
500
30 TB 90 TB
Data size
Res
po
nse
Tim
e (m
in)
Vendor A
Vendor B
Everest
4
Everest Performance Advantages
• Source of Performance and Scale– Distributed Compressed Columnar Storage– Highly Parallel and Asynchronous
• Multi threaded Query Execution as well as Storage
– Vector Query Processing– Multi-level data partitioning and query partitioning– Cluster-level Compressed Columnar caching– Query expression caching– Yahoo! specific language extensions and UDF & UDAF
Recommended