Pluto - NVIDIA...Learning Summit 2017. • TensorFlow in AliMe, Jun Yang et al., Shanghai GDG Mar.,...

Pluto A Distributed Heterogeneous Deep Learning Framework

Jun Yang, Yan Chen Large Scale Learning, Alibaba Cloud

Outline

•  PAI(Platform of Artificial Intelligence) •  PAI Overview •  Deep Learning with PAI •  Pluto

•  PAI DL Application •  Chatbot Engine

•  Summary

Machine Learning Platforms

PAI Overview

OSS Streaming data: DataHub/TT/Kafka

Feature Engineering

Statistic Methods

Machine Learning

Deep Learning

……

PAI WEB Console PAI IDE

Database: ODPS/RDS

CPU/GPU/FPGA/ASIC/…

Fuxi Scheduler

MR/MPI/PS/Graph/Pluto…

PAI SDK

Data Storage

Distributed Computing

Algorithms

Serving

Frontend

Tutorial: data.aliyun.com

PAI Project

Search

Experiment

Data Source

Component

Serving

Machine Learning with PAI

Data Preprocessing

Sampling & Filtering

Data Merge

Fill Missing Values

Normalization

Feature Engineering

Feature Transformatio

Feature Selection

Feature Importance

Feature Generation

Statistics

Correlation Coefficients

Histogram

Hypothesis Test

Visualization

Modeling

Binary Classificatio

n Multiple

Classification

Clustering

Regression

Prediction

Evaluation

Deep Learning

A La Carte

Application

Search & Rec.

Image Process

Network Analysis

Financial Section

Deep Learning with PAI

PAI TensorFlow

•  Rich Data IO •  Distributed Job Optimization

(Multi. GPU/CPUs) •  Easy model Serving •  Hyper Parameter Tuning

Single-card Optimization

•  Compiler-oriented strategy •  Fuse small ops into bigger one

•  Reduce CUDA kernel launch overhead •  Prepare data layout friendly with low-level computation library

•  Memory optimization •  Here again compiler-oriented tactics

•  Dependency analysis •  Lifetime analysis

Multi-cards Optimization

•  Heuristic-based Model Parallelism •  Both model weights and feature map taken into consideration •  Memory allocator strategy taken into consideration •  A greedy allocation algorithm

•  With pre-run support

•  Hybrid-parallelism •  Mixture of data-parallelism and model-parallelism •  For communication-intensive parts, consider model-parallelism •  For computation-intensive parts, consider data-parallelism •  Tricks

•  Integrate seamlessly with computation graph style •  Happier with pyramid network

•  Hybrid-parallelism(cont.)

M40 Result K40 Result

•  Late-multiply •  Customized for fully-connected layers •  Trade-off between computation and communication

Wavg: [Nl ,Nl+1], X:[M, Nl], E:[M, Nl+1], here Nl,Nl+1 layer sizes, M is mini-batch size

•  Late-multiply(cont.)

•  Heuristic-based MA •  Automatic batch-size selection •  Learning rate auto-tuning •  Happier with sequential model

•  Heuristic-based MA(cont.)

Training Time in Wallclock

Model Metrics

Inference Optimization

•  Quantization •  Significantly reduce model size(4X) •  Around 2X speed-up on average

•  Binarized Neural Network •  Binarize model weights •  Convert floating point computation into bit manipulation •  Both model size and computation speed significantly improved •  Training process needs to be manipulated to compensate for accuracy •  Happier with CNN, but for RNN…

PAI DL Application

AliMe – Personal Assistant Bot in E-commerce

AliMe for

Customers

AliMe for

Sellers

AliMe for

Enterprises

20 From 海青@云栖大会

Open-Domain Conversations •  Retrieval Model

•  Learning to rank •  Generation Model

•  Sequence to Sequence (Seq2Seq) Model

•  Recurrent Neural Networks: LSTM, GRU (our choice)

Cho et al., 2014

Query QA pairs Knowledge Base

Q1-A1: s1 Q2-A2: s2 Q3-A3: s3

... Qn-An: sn

A Hybrid Conversation Model based on Seq2Seq •  Overview

Query IR Candidates Answer Rerank Output

Answer Generation

Score > T

Seq2Seq Model

QA pairs

Seq2Seq Based Rerank and Generation Modules

KnowledgeBase

Retrieval Module

Chat logs

SNS data

22 [AliMe Chat: Minghui Qiu et al., ACL 2017]

PAI DL Support for AliMe •  Both the offline training and online serving

backed by PAI •  Through heuristic-based MA, the offline

training task has 2.8X convergence speed-up with 4 cards setting

•  Through quantization, the online serving task has 1.5X speed-up on commodity CPU servers.

Conclusion

•  PAI DL •  End2end machine learning platform •  Support big data analytics •  Optimized Deep learning algorithms •  Scheduling on CPU／GPU cloud •  More data intelligence…

•  Pluto •  Distributed optimization engine of PAI DL

•  PAI DL Application •  PAI DL makes it easy to build DL methods

for industrial applications

SCAN BARCODE!START YOUR TRIAL !

数据智能触手可及

We are hiring! J

muzhuo.yj@alibaba-inc.com chenyan.cy@alibaba-inc.com

Reference

•  AliMe Chat: A Sequence to Sequence and Rerank based Chatbot Engine, Minghui Qiu et al., ACL 2017.

•  Deep Learning with PAI: a Case Study of AliMe, Minghui Qiu et al., Deep Learning Summit 2017.

•  TensorFlow in AliMe, Jun Yang et al., Shanghai GDG Mar., 2017.

Thanks!

Pluto - NVIDIA...Learning Summit 2017. • TensorFlow in AliMe, Jun Yang et al., Shanghai GDG Mar.,...

Documents

Yangjun Chen 1 Network Flow What is a network? Flow network and flows Ford-Fulkerson method - Residual networks - Augmenting paths - Cuts of flow networks

Jan. 2014Dr. Yangjun Chen ACS-49021 Outline Spatial Databases -Theme -Map -Geographic objects -Modeling geographic data Temporal Databases - Transaction

AN NT DO LA ALIME O I R C ME N J Ó nusaalnusa.com.pe/images/catalogos/catalogo-alnusa.pdf · 2019-12-20 · alimentos andinos y ancestrales de alto valor nutricional, ... son cereales

Jan. 20041 Event Handling - 1.02 Yangjun Chen Dept. Business Computing University of Winnipeg

Sept. 2014Dr. Yangjun Chen ACS-49021 Outline (Ch. 12, 3 rd ed. – Ch. 21, 4 th ed.) Overview of ODMG - Objects and Literals - Built-in Interfaces for Collection

Java Networking TCP Yangjun Chen Dept. Business Computing University of Winnipeg

Java Networking UDP Yangjun Chen Dept. Business Computing University of Winnipeg

Applets and Graphics Yangjun Chen Dept. Business Computing University of Winnipeg

Normalization Sept. 2014ACS-3902 Yangjun Chen1 Outline: Normalization Redundant information and update anomalies Function dependencies Normal forms -1NF,

Art of Programming Yangjun Chen Dept. Business Computing University of Winnipeg

AWT Widgets Yangjun Chen Dept. Business Computing University of Winnipeg

Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries Dr. Yangjun Chen Dept. Applied Computer Science, University

Jan. 2014Dr. Yangjun Chen ACS-49021 Database recovery techniques (Ch. 21, 3 rd ed. – Ch. 19, 4 th and 5 th ed. – Ch. 23, 6 th ed.)

Yangjun Chen 1 String Matching String matching problem - prefix - suffix - automata - String-matching automata - prefix function - Knuth-Morris-Pratt algorithm

ER-to-Relational Mapping Jan. 2008ACS-3902 Yangjun Chen1 ER-to-Relational Mapping Principles Specialization/Generalization -Superclass/Subclass Relationship

diketopyrrolopyrrole Conjugated Polymers for … Conjugated Polymers for Efficient Solar Cells Shuo Suna, Peng Zhangb, Jianfeng Lib, Yuanke Lib, Jianlu Wanga, Shujiang Zhangc, Yangjun

Analysis of Midterm-Examination Jan. 2010ACS-7102 Yangjun Chen1 1.(15) Draw an ER-diagram to describe the following real world problem. (a)A university

Java I/O Yangjun Chen Dept. Business Computing University of Winnipeg

Jan. 20041 Java Networking TCP Yangjun Chen Dept. Business Computing University of Winnipeg

Jan. 20041 Java I/O Yangjun Chen Dept. Business Computing University of Winnipeg