Upload
duongdiep
View
213
Download
0
Embed Size (px)
Citation preview
Please note • IBM’s statements regarding its plans, direc5ons, and intent are subject to change or withdrawal without no5ce
at IBM’s sole discre5on.
• Informa5on regarding poten5al future products is intended to outline our general product direc5on and it should not be relied on in making a purchasing decision.
• The informa5on men5oned regarding poten5al future products is not a commitment, promise, or legal obliga5on to deliver any material, code or func5onality. Informa5on about poten5al future products may not be incorporated into any contract.
• The development, release, and 5ming of any future features or func5onality described for our products remains at our sole discre5on.
• Performance is based on measurements and projec5ons using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considera5ons such as the amount of mul5programming in the user’s job stream, the I/O configura5on, the storage configura5on, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
4/11/16 1
Agenda ! What is Apache Spark?
! How does Spark perform on OpenPOWER?
! Leveraging OpenPOWER innova5on under Spark
! Ques5ons
What is Apache Spark? • Unified Analy5cs PlaOorm
– Combine streaming, graph, machine learning and sql analy5cs on a single plaOorm
– Simplified, mul5-‐language programming model
– Interac5ve and Batch
• In-‐Memory Design – Pipelines mul5ple itera5ons on
single copy of data in memory – Superior Performance – Natural Successor to MapReduce
4/11/16 3
Fast and general engine for large-scale data processing
Spark Core API R Scala SQL Python Java
Spark SQL Streaming MLlib GraphX
The following charts show Performance results of comparing multiple Spark Workloads from SparkBench using data sizes from 100GB to 10TB (https://github.com/SparkTC/spark-bench)
7-node cluster of Intel Haswell servers • E5-2620 V3 • 12-core • 256GB
vs 7-node cluster of OpenPOWER servers • POWER8 S812LC • 10-core • 256GB
• Machine Learning (Spark MLlib) • Matrix Factorization • Logistic Regression • Support Vector Machine
• SQL (Spark SQL) sqlContext.sql("SELECT COUNT(*) FROM orderTab").count()
sqlContext.sql("SELECT COUNT(*) FROM orderTab where bid>5000").count()
sqlContext.sql("SELECT * FROM oitemTab WHERE price>250").count()
sqlContext.sql("SELECT * FROM oitemTab WHERE price>500").count()
sqlContext.sql("SELECT * FROM orderTab r JOIN oitemTab s ON r.oid = s.oid").count()
• Graph (Spark GraphX) • Page Rank • Triangle Count • Singular Value Decomp++
Measuring Performance of Spark on POWER
System Performance of Spark on POWER
4/6/2016 5
0"
0.5"
1"
1.5"
2"
2.5"
3"
E5)262
0"v3"
100GB"Mat."Fact."
100GB"(in
"mem)"LR"
1TB"(in
"mem)"LR"
1TB"(50
/50)"LR
"
1TB"SV
M"
10TB"LR"
1TB"5"q
uery"
2TB"5"q
uery"
130GB"Pa
ge"Ran
k"
1TB"Trian
gle"Cn
t"
1TB"SV
D++"
AVERA
GE"
Relat
ive Sy
stem
Perfo
rman
ce
Spark"Workloads"
Machine Learning SQL Graph
1.7X
Price Performance of Spark on POWER • Spend 33% less on infrastructure suppor5ng the same amount of workload
• Spend the same on infrastructure but host 50% more workload
4/6/2016 6 * - based on preliminary SoftLayer pricing targets – subject to change
Machine Learning SQL Graph
1.5X
POWER Advantages for Spark • Streaming and SQL benefit from High Thread Density and Concurrency
• Processing mul5ple packets of a stream and different stages of a message stream pipeline • Processing mul5ple rows from a query
• Machine Learning benefits from Large Caches and Memory Bandwidth • Itera5ve Algorithms on the same data • Fewer core pipeline stalls and overall higher throughput
• Graph also benefits from Large Caches, Memory Bandwidth and Higher Thread Strength • Flexibility to go from 8 SMT threads per core to 4 or 2 • Manage Balance between thread performance and throughput
4/6/2016 7
Leveraging OpenPOWER Innova5on
4/6/2016 8
0
50000
100000
150000
200000
250000
300000
350000
400000
Run
time
(ms)
Total Heap Memory
x Degrees of Separation on Spark
Disk CAPI/Flash
CAPI Flash for RDD Cache = 4X memory reduction at equal performance
RDMA for Spark Shuffle = 30% Better Response Time, Lower CPU Utilization, Lower Memory Footprint
CAPI Flash and RDMA can be Leveraged Transparently to Spark Applications
Accelera5ng Spark with GPUs • Adverse Drug Reac5on Predic5on built on Spark • 25X Speed up for Building Model stage (using Spark Mllib Logis5c Regression) • Again, Transparent to the Spark Applica5on • Game changer for Personalized Medicine
4/6/2016 9
Summary • Spark is a new disrup5ve technology for big
data analy5cs
• OpenPOWER systems can provide leadership performance and economics for Spark deployments.
• OpenPOWER innova5ons can provide addi5onal accelera5on and value to Spark
4/6/2016 10
No5ces and Disclaimers (1 of 2) Copyright © 2016 by Interna5onal Business Machines Corpora5on (IBM). No part of this document may be reproduced or transmiged in any form without wrigen permission from IBM.
U.S. Government Users Restricted Rights -‐ Use, duplicaGon or disclosure restricted by GSA ADP Schedule Contract with IBM.
Informa5on in these presenta5ons (including informa5on rela5ng to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of ini5al publica5on and could include uninten5onal technical or typographical errors. IBM shall have no responsibility to update this informa5on. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted according to the terms and condi5ons of the agreements under which they are provided.
Any statements regarding IBM's future direcGon, intent or product plans are subject to change or withdrawal without noGce.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustra5ons of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other opera5ng environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informa5onal purposes only, and are neither intended to, nor shall cons5tute legal or other guidance or advice to any individual par5cipant or their specific situa5on.
It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the iden5fica5on and interpreta5on of any relevant laws and regulatory requirements that may affect the customer’s business and any ac5ons the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer is in compliance with any law
4/6/2016 11
No5ces and Disclaimers (2 of 2) Informa5on concerning non-‐IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products in connec5on with this publica5on and cannot confirm the accuracy of performance, compa5bility or any other claims related to non-‐IBM products. Ques5ons on the capabili5es of non-‐IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-‐party products, or the ability of any such third-‐party products to interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
The provision of the informa5on contained h erein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right.
IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document Management System™, FASP®, FileNet®, Global Business Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM SmartCloud®, IBM Social Business®, Informa5on on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnaly5cs™, PureApplica5on®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®, Ra5onal®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®, StoredIQ, Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-‐Force® and System z® Z/OS, are trademarks of Interna5onal Business Machines Corpora5on, registered in many jurisdic5ons worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark informa5on" at: www.ibm.com/legal/copytrade.shtml.
4/6/2016 12
Revolutionizing the Datacenter
Join the Conversation #OpenPOWERSummit
Thank You! Ques5ons?