Upload
alluxio-inc
View
65
Download
4
Embed Size (px)
Citation preview
UNIFY DATA AT MEMORY SPEED Haoyuan (HY) Li @ AMPLab End of Project Celebration
November 18th, 2016
HISTORY
2
Trex12-13
HISTORY
2
Trex12-13
Tachyon13-15
HISTORY
2
Trex12-13
Tachyon13-15
Alluxio15-
FASTEST-GROWING BIG DATA PROJECT
3
FASTEST-GROWING BIG DATA PROJECT
3
• Fastest growing open-source project in the big data ecosystem
• 400+ contributors from 100+ organizations
• Running world’s largest production clusters
• Welcome to join the community!
CURRENT STATUS
4
Haoyuan Li, CEO
Alluxio (formerly Tachyon) Co-creator, Joined AMPLab Ph.D. Program 2011FOUNDER
INVESTOR
TEAM
From AMD, Dell, Google, Palantir, Uber, Yahoo; Experts in Distributed Systems
MSs and PhDs in CS from CMU,, Stanford, UC Berkeley
Top 10 Committers of the Alluxio Open Source Project
We are Hiring!
COMPANY Founded 2015
BIG DATA ECOSYSTEM YESTERDAY
5
…
…
BIG DATA ECOSYSTEM TODAY
5
…
…
5
…
…
BIG DATA ECOSYSTEM ISSUES
BIG DATA ECOSYSTEM WITH ALLUXIO
5
…
…
FUSE Compatible File System
Hadoop Compatible File System
Native Key-Value Interface
Native File System
GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface
BIG DATA ECOSYSTEM WITH ALLUXIO
5
…
…
FUSE Compatible File System
Hadoop Compatible File System
Native Key-Value Interface
Native File System
Enabling Application to Access Data from any Storage System at Memory-speed
GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface
WHY ALLUXIO
6
Co-located compute and data with memory-speed access to data
Virtualized across different storage systems under a unified namespace
Scale-out architecture
File system API, software only
ALLUXIO BENEFITS
7
Unification
New workflows across any data in any storage system
Orders of magnitude improvement in run time
Choice in compute and storage – grow each independently, buy only what is needed
Performance Flexibility
TRUSTED BY THE WORLD LEADING COMPANIES
8
ALLUXIO USE CASES
9
Accelerating I/O to and from remote storage
Managing data across disparate storage systems
Sharing data across workloads at memory speed
ACCELERATE I/O TO/FROM REMOTE STORAGE
10
Baidu’s PMs and analysts run
interactive queries to gain insights
into their products and business
• 200+ nodes deployment
• 2+ petabytes of storage
• Mix of memory + HDD
ALLUXIO
Baidu File System
ACCELERATE I/O TO/FROM REMOTE STORAGE
10
The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds. - Baidu
RESULTS
• Data queries are now 30x faster with Alluxio
• Alluxio cluster runs stably, providing over 50TB of RAM space
• By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds
Baidu’s PMs and analysts run
interactive queries to gain insights
into their products and business
• 200+ nodes deployment
• 2+ petabytes of storage
• Mix of memory + HDD
ALLUXIO
Baidu File System
SHARE DATA ACROSS JOBS @ MEMORY SPEED
11
Barclays uses query and machine
learning to train models for risk
management
• 6 node deployment
• 1TB of storage
• Memory only
ALLUXIO
Relational Database
SHARE DATA ACROSS JOBS @ MEMORY SPEED
11
Thanks to Alluxio, we now have the raw data immediately available at every iteration and we can skip the costs of loading in terms of time waiting, network traffic, and RDBMS activity. - Barclays
RESULTS
• Barclays workflow iteration time decreased from hours to seconds
• Alluxio enabled workflows that were impossible before
• By keeping data only in memory, the I/O cost of loading and storing in Alluxio is now on the order of seconds
Barclays uses query and machine
learning to train models for risk
management
• 6 node deployment
• 1TB of storage
• Memory only
ALLUXIO
Relational Database
MANAGE DATA ACROSS STORAGE SYSTEMS
12
• 200+ nodes deployment
• 6 billion logs (4.5 TB) daily
• Mix of Memory + HDD
ALLUXIO
Qunar uses real-time machine
learning for their website ads.
MANAGE DATA ACROSS STORAGE SYSTEMS
12
We’ve been running Alluxio in production for over 9 months, Alluxio’s unified namespace enable different applications and frameworks to easily interact with data from different storage systems - Qunar
RESULTS
• Data sharing among Spark Streaming, Spark batch and Flink jobs provide efficient data sharing
• Improved the performance of their system with 15x – 300x speedups
• Tiered storage feature manages storage resources including memory, SSD and disk
• 200+ nodes deployment
• 6 billion logs (4.5 TB) daily
• Mix of Memory + HDD
ALLUXIO
Qunar uses real-time machine
learning for their website ads.
ALLUXIO, INC PRODUCT OFFERINGS
13
Capa
bilit
y/Va
lue
Technology Validation
Alluxio Open Source
Open Source
Alluxio Community
Edition (ACE)
Accelerate Adoption
Alluxio Manager
Open Source
Alluxio Enterprise
Edition (AEE)
Enterprise Deployment
• Kerberos Authentication • Data Replication • Support
Alluxio Manager
Open Source
GOING INTO THE FUTURE
14
Congrats to the AMPLab!Thank you!Contact: [email protected] or [email protected]
15