Please¬e& - OpenPOWER Foundation · Total Heap Memory x Degrees of Separation on Spark Disk CAPI/Flash ... Worklight®,&XForce®&and&System&z®&Z/OS,&are&trademarks&of&Internaonal&Business&Machines&

Please note •  IBM’s statements regarding its plans, direc5ons, and intent are subject to change or withdrawal without no5ce

at IBM’s sole discre5on.

•  Informa5on regarding poten5al future products is intended to outline our general product direc5on and it should not be relied on in making a purchasing decision.

•  The informa5on men5oned regarding poten5al future products is not a commitment, promise, or legal obliga5on to deliver any material, code or func5onality. Informa5on about poten5al future products may not be incorporated into any contract.

•  The development, release, and 5ming of any future features or func5onality described for our products remains at our sole discre5on.

•  Performance is based on measurements and projec5ons using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considera5ons such as the amount of mul5programming in the user’s job stream, the I/O configura5on, the storage configura5on, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.

4/11/16 1

Agenda !  What is Apache Spark?

!  How does Spark perform on OpenPOWER?

!  Leveraging OpenPOWER innova5on under Spark

!  Ques5ons

What is Apache Spark? •  Unified Analy5cs PlaOorm

–  Combine streaming, graph, machine learning and sql analy5cs on a single plaOorm

–  Simplified, mul5-‐language programming model

–  Interac5ve and Batch

•  In-‐Memory Design –  Pipelines mul5ple itera5ons on

single copy of data in memory –  Superior Performance –  Natural Successor to MapReduce

4/11/16 3

Fast and general engine for large-scale data processing

Spark Core API R Scala SQL Python Java

Spark SQL Streaming MLlib GraphX

The following charts show Performance results of comparing multiple Spark Workloads from SparkBench using data sizes from 100GB to 10TB (https://github.com/SparkTC/spark-bench)

7-node cluster of Intel Haswell servers •  E5-2620 V3 •  12-core •  256GB

vs 7-node cluster of OpenPOWER servers •  POWER8 S812LC •  10-core •  256GB

•  Machine Learning (Spark MLlib) •  Matrix Factorization •  Logistic Regression •  Support Vector Machine

•  SQL (Spark SQL) sqlContext.sql("SELECT COUNT(*) FROM orderTab").count()

sqlContext.sql("SELECT COUNT(*) FROM orderTab where bid>5000").count()

sqlContext.sql("SELECT * FROM oitemTab WHERE price>250").count()

sqlContext.sql("SELECT * FROM oitemTab WHERE price>500").count()

sqlContext.sql("SELECT * FROM orderTab r JOIN oitemTab s ON r.oid = s.oid").count()

•  Graph (Spark GraphX) •  Page Rank •  Triangle Count •  Singular Value Decomp++

Measuring Performance of Spark on POWER

System Performance of Spark on POWER

4/6/2016 5

0"

0.5"

1"

1.5"

2"

2.5"

3"

E5)262

0"v3"

100GB"Mat."Fact."

100GB"(in

"mem)"LR"

1TB"(in

"mem)"LR"

1TB"(50

/50)"LR

"

1TB"SV

M"

10TB"LR"

1TB"5"q

uery"

2TB"5"q

uery"

130GB"Pa

ge"Ran

k"

1TB"Trian

gle"Cn

t"

1TB"SV

D++"

AVERA

GE"

Relat

ive Sy

stem

Perfo

rman

ce

Spark"Workloads"

Machine Learning SQL Graph

1.7X

Price Performance of Spark on POWER •  Spend 33% less on infrastructure suppor5ng the same amount of workload

•  Spend the same on infrastructure but host 50% more workload

4/6/2016 6 * - based on preliminary SoftLayer pricing targets – subject to change

Machine Learning SQL Graph

1.5X

POWER Advantages for Spark •  Streaming and SQL benefit from High Thread Density and Concurrency

•  Processing mul5ple packets of a stream and different stages of a message stream pipeline •  Processing mul5ple rows from a query

•  Machine Learning benefits from Large Caches and Memory Bandwidth •  Itera5ve Algorithms on the same data •  Fewer core pipeline stalls and overall higher throughput

•  Graph also benefits from Large Caches, Memory Bandwidth and Higher Thread Strength •  Flexibility to go from 8 SMT threads per core to 4 or 2 •  Manage Balance between thread performance and throughput

4/6/2016 7

Leveraging OpenPOWER Innova5on

4/6/2016 8

0

50000

100000

150000

200000

250000

300000

350000

400000

Run

time

(ms)

Total Heap Memory

x Degrees of Separation on Spark

Disk CAPI/Flash

CAPI Flash for RDD Cache = 4X memory reduction at equal performance

RDMA for Spark Shuffle = 30% Better Response Time, Lower CPU Utilization, Lower Memory Footprint

CAPI Flash and RDMA can be Leveraged Transparently to Spark Applications

Accelera5ng Spark with GPUs •  Adverse Drug Reac5on Predic5on built on Spark •  25X Speed up for Building Model stage (using Spark Mllib Logis5c Regression) •  Again, Transparent to the Spark Applica5on •  Game changer for Personalized Medicine

4/6/2016 9

Summary •  Spark is a new disrup5ve technology for big

data analy5cs

•  OpenPOWER systems can provide leadership performance and economics for Spark deployments.

•  OpenPOWER innova5ons can provide addi5onal accelera5on and value to Spark

4/6/2016 10

No5ces and Disclaimers (1 of 2) Copyright © 2016 by Interna5onal Business Machines Corpora5on (IBM). No part of this document may be reproduced or transmiged in any form without wrigen permission from IBM.

U.S. Government Users Restricted Rights -‐ Use, duplicaGon or disclosure restricted by GSA ADP Schedule Contract with IBM.

Informa5on in these presenta5ons (including informa5on rela5ng to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of ini5al publica5on and could include uninten5onal technical or typographical errors. IBM shall have no responsibility to update this informa5on. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted according to the terms and condi5ons of the agreements under which they are provided.

Any statements regarding IBM's future direcGon, intent or product plans are subject to change or withdrawal without noGce.

Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustra5ons of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other opera5ng environments may vary.

References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business.

Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informa5onal purposes only, and are neither intended to, nor shall cons5tute legal or other guidance or advice to any individual par5cipant or their specific situa5on.

It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the iden5fica5on and interpreta5on of any relevant laws and regulatory requirements that may affect the customer’s business and any ac5ons the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer is in compliance with any law

4/6/2016 11

No5ces and Disclaimers (2 of 2) Informa5on concerning non-‐IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products in connec5on with this publica5on and cannot confirm the accuracy of performance, compa5bility or any other claims related to non-‐IBM products. Ques5ons on the capabili5es of non-‐IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-‐party products, or the ability of any such third-‐party products to interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.

The provision of the informa5on contained h erein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right.

IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document Management System™, FASP®, FileNet®, Global Business Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM SmartCloud®, IBM Social Business®, Informa5on on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnaly5cs™, PureApplica5on®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®, Ra5onal®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®, StoredIQ, Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-‐Force® and System z® Z/OS, are trademarks of Interna5onal Business Machines Corpora5on, registered in many jurisdic5ons worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark informa5on" at: www.ibm.com/legal/copytrade.shtml.

4/6/2016 12

Revolutionizing the Datacenter

Join the Conversation #OpenPOWERSummit

Thank You! Ques5ons?

Documents

Please¬e& - OpenPOWER Foundation · Total Heap Memory x Degrees of Separation on Spark Disk CAPI/Flash ... Worklight®,&XForce®&and&System&z®&Z/OS,&are&trademarks&of&Internaonal&Business&Machines&