Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
© 2017 MapR Technologies MapR Confidential 1
Distributed Deep Learning on MapR Converged Data Platform
Powered By NVIDIA GPUs
Dong Meng - [email protected] Data Scientist, MapR Technologies
© 2017 MapR Technologies MapR Confidential 2
Roadmaps
• Enterprise Big data Journey • Distributed Deep Learning • Apply Container Technology and NVIDIA GPUs • Demos
© 2017 MapR Technologies MapR Confidential 3
Enterprise Big data Journey
© 2017 MapR Technologies MapR Confidential 4
Great buildings have great foundations
© 2017 MapR Technologies MapR Confidential 5
Big Data Technology development 2003 • Google published “The Google File System”
2004 • Google published “MapReduce” paper
2008 • Hadoop became a top-level Apache Project
2009 • Facebook created Hive
2010 • Facebook implemented Hbase
2011 • Linkedin presented Kafka
2014 • Spark 1.0 was released
2014 • Google implement Borg with Go and release Kubernetes
© 2017 MapR Technologies MapR Confidential 6
MapR Development
2011 Industrial grade data platform for big data analytics 1.0 - MapR File System
2013 Industrial grade NoSQL Key Value Store – MapRDB (Implements Hbase APIs)
2012 Industry’s first visual big data ops dashboard in MapR control system
2014 Global multi datacenter replication Fast Ingest 1.0
2016 Global streaming - MapR Streams JSON Document DB Fast Ingest 2.0 Spyglass Monitoring
2015 Schema free SQL engine for big data – Apache Drill Global table replication
2017 Persisted data access for Docker containers MapR Edge for edge computing
© 2017 MapR Technologies MapR Confidential 7
MapR Converged Data Platform
Files, Tables, Streams together on same platform
Shared Services
On-Premise, In the Cloud, Hybrid
High Availability Real Time Security & Governance Multi-tenancy Disaster Recovery Global Namespace
Converge-X™ Data Fabric
Event Data Streams
Analytics & Machine Learning
Engines
Operational Database
Cloud-scale Data Store
© 2017 MapR Technologies MapR Confidential 8
High Availability Real Time Security & Governance Multi-tenancy Disaster Recovery Global Namespace
Converge-X™ Data Fabric
On-Premise, In the Cloud, Hybrid
HDFS API POSIX, NFS HBase API JSON API Kafka API
An Enterprise Foundation To Operationalize Data
Supports Open-Source APIs with Patented Speed, Scale, Reliability
Data Center
Next-Gen Applications
Event Data Streams
Analytics & Machine Learning Engines
Operational Database
Cloud-scale Data Store
Existing Enterprise Applications
© 2017 MapR Technologies MapR Confidential 9
From a User Perspective
Files
Table
Streams
Directories
Cluster
Volume mount point
© 2017 MapR Technologies MapR Confidential 10
What we have observed:
• Enterprise customers start to gain great value from big data platform as technologies grow mature
• The data pipeline is growing from batch to real time streaming • The volume of data collected is growing from TB to PB to EB. • Customers are more motivated than ever to build intelligence
applications based on collected data • Machine learning interest • Add GPU cards to clusters
© 2017 MapR Technologies MapR Confidential 11
Distributed Deep Learning
© 2017 MapR Technologies MapR Confidential 12
More DATA
The Unreasonable Effectiveness of Data, published by Google
beats complex algorithms
© 2017 MapR Technologies MapR Confidential 13
Distributed Machine Learning • Mahout: Map-Reduce • GraphLab: Graph-based parallel framework • Spark: In memory dataflow system • H2o: Distributed machine learning library and server
© 2017 MapR Technologies MapR Confidential 14
Why Deep Learning • Machine learning algorithm’s performance depend on the
representation of the data they are given (feature engineering)
• Difficulty in finding the right representations (features) • Deep learning
– Learns multiple levels of representations – Learns high level of abstractions
• Growth of data volumes and computing power (GPUs)
© 2017 MapR Technologies MapR Confidential 15
Distributed Deep Learning System From a system design perspective: • Distributed Data Storage – HDFS, MapRFS, Ceph • Consistency – Parameter Server • Fault Tolerance – Checkpoint reload • Resource management – Yarn, Mesos, Kubernetes • Programming Mode – TensorFlow, Apache MXNet, Torch, Caffe
© 2017 MapR Technologies MapR Confidential 16
Distributed Deep Learning Model Data Parallelism: • Data Partition • Train each model on mini-batches of data Model Parallelism: • Model Partition • Separate the training task for each layer
© 2017 MapR Technologies MapR Confidential 17
Distributed Deep Learning System From a user perspective: • Ease of expression: for lots of crazy ML ideas/algorithms • Scalability: can run experiments quickly • Portability: can run on wide variety of platforms • Reproducibility: easy to share and reproduce research • Production readiness: go from research to real products
© 2017 MapR Technologies MapR Confidential 18
Apply Container Technology and NVIDIA GPUs
© 2017 MapR Technologies MapR Confidential 19
Deep Learning Development Environment Individual Research: • Laptop/Dev boxes Research Lab Setting: • HPC clusters, high speed job execution, small teams Enterprise Deep learning System: • Distributed Data Storage, Consistency, Fault Tolerance,
Resource management, Programming Mode, Version and Deployment
• Container, Kubernetes, MapR Data Platform
© 2017 MapR Technologies MapR Confidential 20
Containers are Great for Deployment and Research Advantages
• Reproducible work environments
• Ease of deployment
• Isolation of individual workspaces
• Run across heterogeneous
environments
• Facilitate collaboration
© 2017 MapR Technologies MapR Confidential 21
Stateful Containers for Deep Learning
Persistent Storage
Audio Data Video Feeds
Advantages • Containerized workspaces
• Keep your work between sessions
• Manage work across many
projects
• Work with versioned datasets and
models
• Share work across containers,
projects and/or teams
Sensor data
© 2017 MapR Technologies MapR Confidential 22
Microservices for Production Deep Learning
Event Streams & DB
Advantages • Deploy models to production as
microservices
• Use files, real-time streams and
databases in production
• Scales horizontally
• Support both real-time and batch
• May or may not be stateful
© 2017 MapR Technologies MapR Confidential 23
Kubernetes for Containers Orchestration • Deploy once and run many times • Rollout new versions/rollback to old versions • Dynamic Scheduling and Elastic Scaling • Auto Fault Recovery and Scalable Computing • Isolation and Quota • Manage GPU resources • Mount MapR Volume to Persistent Volume
© 2017 MapR Technologies MapR Confidential 24
Kubernetes in a Heterogeneous GPU Cluster • Kubernetes master:
• CPU-only • Workers:
• NVIDIA driver • CUDA • CUDNN
• Kubelet config updated for GPU workers needed
--feature gates=Accelerators=true
• Note: Containers also need to be configured for GPU support separately Diagram: Frederic Tausch on Github
© 2017 MapR Technologies MapR Confidential 25
Architecture
Data layer
Orchestration layer
Application layer
© 2017 MapR Technologies MapR Confidential 26
• Converged Data Platform for both Big Data and Deep Learning • By design Volume Topology, collocate the data with GPU
computation/training. • Performant POSIX file system, quick adapt to new deep
learning technologies and frameworks • Mirroring feature enables model training/deployment with global
data centers • Snapshot feature enables research reproducibility • MapRDB and MapR Streams provides infrastructures for
Intelligence applications with deep learning models
MapR as the Infrastructure for Distributed DL
© 2017 MapR Technologies MapR Confidential 27
MapR Volumes • Volumes是containers的逻辑分组, 被mounted在文件目录上(just like Linux).
• 没有大小限制 • 每个volume可以设置不同的管
理策略 • 没有数目限制 • Containers在其他节点上有备份 • 快照备份,镜像功能,数据定位(限制在特定节点内),数据核查
/projects
/project1
/users
/jsmith
/mjohnson
/projects
/users
Data Nodes
© 2017 MapR Technologies MapR Confidential 28
Pattern 1: Separate Clusters
MapR Converged Data Platform Tier
Dockerized GPU-based NVIDIA Tier, NVIDIA DGX systems as MapR client
© 2017 MapR Technologies MapR Confidential 29
Pattern 2: Collocated MapR + Kubernetes
MapR Converged Data Platforms powered by NVIDIA GPU Cards
© 2017 MapR Technologies MapR Confidential 30
Demos
© 2017 MapR Technologies MapR Confidential 31
Demo1: Parameter Server and Worker POD1: TF Parameter Server
POD3: TF Worker2 POD2: TF Work1
/projects
/project1 Persistent Volume
MapR Volume
© 2017 MapR Technologies MapR Confidential 32
Demo2: Real Time Streams Face Detection GLOBAL DATA PLANE
MAPR EDGE
Small footprint at the edge MAPR CONVERGED
ENTERPRISE EDITION
Send updated model back to edge
Aggregate, analyze data at core and refine models
Reliable replication
MAPR EDGE
Camera to capture video feeds
Convergence at the edge
MAPR EDGE Deploy Model at the edge to
score data feed
[on-prem, hybrid, cloud]
© 2017 MapR Technologies MapR Confidential 33
Demo2: Real Time Streams Face Detection
Producers to capture video feeds
Consumers to output processed videos
DL model process streams
MAPR EDGE