63

Building machine-learning infrastructure on

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

PowerPoint Presentation© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Building machine-learning infrastructure on Amazon EKS with Kubeflow
C O N 3 0 6
Yaniv Donenfeld
Jeremie Vallee
• The mission
ML services
P O L L Y
A M A Z O N
T R A N S C R I B E
A M A Z O N
T R A N S L A T E
A M A Z O N
C O M P R E H E N D
& C O M P R E H E N D
M E D I C A L
A M A Z O N
L E X F O R E C A S TA M A Z O N
R E K O G N I T I O N
I M A G E
A M A Z O N
R E K O G N I T I O N
V I D E O
A M A Z O N
T E X T R A C T A M A Z O N
P E R S O N A L I Z E
Ground Truth Notebooks Algorithms + Marketplace Reinforcement learning Training Optimization Deployment HostingAmazon SageMaker
F P G A SE C 2 P 3
& P 3 D N
E C 2 G 4
E C 2 C 5
I N F E R E N T I AA W S I o T
G R E E N G R A S S
E L A S T I C
I N F E R E N C E
D L
C O N T A I N E R S
& A M I s
K U B E R N E T E S
S E R V I C E
A M A Z O N
E L A S T I C
C O N T A I N E R
S E R V I C E
The AWS ML stack
Composability Portability Scalability
O N - P R E M I S E S C L O U D
http://www.shutterstock.com/gallery-635827p1.html
Managed Kubernetes control plane, attach data plane
Native upstream Kubernetes experience
Integrates with additional AWS services
Which user are you?
• Infrastructure abstraction
• Experiment management
• Hyperparameter tuning/optimization
• An end to end platform
• Infrastructure abstraction
• Resource and quota management
• Live code, equations, visualizations, and narrative text
• 40+ programming languages
• Sharing and collaboration
and results
support
Training
Data Acquisition1
Data Ingestion2
Data Pre-processing3
integration
integration
Ajay Vohra
Principal SA -
tracking experiments, jobs, and runs
• An engine for scheduling multi-step
ML workflows
pipelines and components
Running SageMaker – pipeline component
12/4/19 (Wednesday) 4:00 PM - Bellagio, Monet 1
What’s next?
• kfctl for deployment and upgrades
• TFJob and PyTorch for distributed training (already 1.0)
• Jupyter notebook controller and web app
• …
• KFServing for model deployment and inference
• …
Call to Action
12/5/19 (Thursday) 3:15 PM - MGM, Level 1, Grand Ballroom 120
Community
Online workshop
https://eksworkshop.com/kubeflow/
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
We believe it is possible to put an accessible and affordable health service in the hands of every person on Earth
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Our challenges
Our challenges
15+ EKS clusters
5 AWS Regions
• Providing a safe and secure environment for our research teams
• Data locality
• Improving overall engineering efficiency
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AI platform: The big picture
Providing our teams with a single interface to experiment, train, tune, validate, and track their AI models
AI platform
Treat it as a production platform
Pay for what you use, and
scale as much as you need
It must be able to run any tool or service our
users need
anywhere we operate
• Encrypt by default
• NVidia device plugin DaemonSet
• Label: GPU type and brand
nvidia-tesla-k80 # Add a toleration, allowing the pod to run on GPU nodes
tolerations:
nodeSelector:
• Virtual Private Cloud (VPC)
• Span over at least 3 Availability Zones (AZ) … if you can!
• 1 Auto Scaling Group per AZ per instance type
• Use NAT gateways
• Kubernetes
• Massive scale
• Multi-user: plug in your enterprise OIDC
• Open-source: deploy it anywhere you have Kubernetes
• We mostly use Jupyter Notebooks, TensorFlow jobs, hyperparameter tuning
AI platform: Deploying Kubeflow
• Need faster storage?
• Encrypted EFS partition
• Monitor Cluster State (MLOps)
• Collect metrics from jobs
• Automated Dashboards via ConfigMaps
• Gives us ability to add new tools fast
• Ex: deploying Argo for workflow management
• Use case:
• Migrating our model Clinical Validation pipeline to Argo on EKS
AI platform: Single cluster
• Per microservice tracking of:
• Engineering documents
• Data management
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Next steps
• Better metadata tracking for our models
• Integration with Kubeflow pipelines
Thank you!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Yaniv Donenfeld
Jeremie Vallee
jeremie.vallee@babylon health.com