© 2015 IBM Corporation Apache Spark and z Systems December 2015 Mythili Venkatakrishnan [email protected]

© 2015 IBM Corporation

Apache Spark and z Systems

December 2015

Mythili [email protected]

What is Spark?

• An Apache Foundation open source project; not a product

• An in-memory compute engine that works with data; not a data store

• Enables highly iterative analysis on large volumes of data at scale

• Unified environment for data scientists, developers and data engineers

• Radically simplifies the process of developing intelligent apps fueled by data


Imagine the possibilities…

Fraud andrisk detection

Real-time traffic flow optimization

Accurate and timely threat detection

Understand and act on

customer sentiment

Low-latency network analysis

Predict and act on intent to purchase


•© 2015 IBM Corporation4

Fast Leverages aggressively cached in-memory

distributed computing and JVM threads Faster than MapReduce for some workloads

Ease of use (for programmers) Written in Scala, an object-oriented, functional

programming language Scala, Python and Java APIs Scala and Python interactive shells Runs on Hadoop, Mesos, standalone or cloud

General purpose Covers a wide range of workloads Provides SQL, streaming and complex analytics

from http://spark.apache.org

Logistic regression in Hadoop and Spark

Apache Spark is…

IBM’s Commitment to Spark:

Founding Member of AMPLab

Establish Spark Technology Center

Contributing to the core

Open Source SystemML

Educate one million data professionals


Brief History of Spark2002 – MapReduce @ Google

2004 – MapReduce paper

2006 – Hadoop @ Yahoo

2008 – Hadoop Summit2010 – Spark paper

2011 – Hadoop 1.0 GA

2014 – Apache Spark top-level

2014 – 1.2.0 release in December

2015 – 1.3.0 release in March

2015 – 1.4.0 release in June

2015 – 1.5.0 release in September

Activity for 6 months in 2014(from Matei Zaharia – 2014 Spark Summit)

Spark is Active

Most active project in Apache Software Foundation

One of top 3 most active Apache projects

Databricks founded by the creators of Spark from UC Berkeley’s AMPLab

Spark empowers users to accelerate the insight economy

Data Scientist

Data EngineerApp Developer

“the convincer”

“the builder”“the thinker”

What they want to do: Identify patterns, trends, risks, and

opportunities in data Discover new actionable insights

How Spark can help: Supports the entire data science workflow:

from data access and integration, to machine learning, to visualization

Provides a growing library of machine learning algorithms

What they want to do: Bridge between the Data Scientist and the App Developer Implement machine learning algorithms at scale Put the right data system to work for the job at hand

How Spark can help: Abstract data access complexity (Spark doesn’t care what your

data store is) Enables solutions at web-scale

What they want to do: Build applications that lever advanced

analytics in partnership with the data scientist and data engineer

Follow agile design methodologies Optimize performance and meet

SLAs

How Spark can help: Supports the top analytics app

languages such as Python and Scala Eliminates programming complexity

with libraries such as MLlib and simplifies DevOps

Makes it easy to embed advanced analytics into applications


Spark’s basic unit of data

Immutable, fault tolerant collection of elements that can be operated on in parallel across a cluster

Fault tolerance– If data in memory is lost it will be recreated from lineage

Caching, persistence (memory, spilling, disk) and check-pointing

Many database or file type can be supported

An RDD is physically distributed across the cluster, but manipulated as one logical entity:

– Spark will “distribute” any required processing to all partitions where the RDD exists and perform necessary redistributions and aggregations as well.

– Example: Consider a distributed RDD “Names” made of names

MichaelJacquesDirk

CindyDanSusan

DirkFrankJacques

Partition 1 Partition 2 Partition 3

Names

Resilient Distributed Datasets (RDDs)


Spark SQL– Provide for relational queries expressed in SQL, HiveQL and

Scala– Seamlessly mix SQL queries with Spark programs – Provide a single interface for efficiently working with

structured data including Apache Hive, Parquet and JSON files

– Standard connectivity through JDBC/ODBC– Rocket Software has announced intent to make

available a Rocket Mainframe Data Service for Apache Spark z/OS

Spark R– Spark R is an R package that provides a light-weight front-

end to use Apache Spark from R – Spark R exposes the Spark API through the RDD class and

allows users to interactively run jobs from the R shell on a cluster.

– Goal to make Spark R production ready – Rocket Software has announced intent to

support R on z/OS, already available for Linux on z

Common, popular methods to access data


Spark Streaming

Run a streaming computation as a series of very small, deterministic batch jobs

Chop up live stream into batches of X seconds

Spark treats each batch of data as RDDs and processes them using RDD operations

The process results of the RDD operations are returned in batches

Combine live data streams with historical dataGenerate historical data models with SparkUse data models to process live data

Combine Streaming with MLlib algorithmsOffline learning, online predictions Actionable information


Spark and Hadoop

– Spark analytics *can* use information from Hadoop based file systems – though not required

– Hadoop’s MapReduce processing is disk-dependent, Spark relies on in-memory capabilities; this can result in more ‘real-time’ capabilities for Spark

– Spark comes with a set of libraries and functions for streaming, SQL, graph, and machine learning; Hadoop has an ecosystem of other tools to support these (Storm, Hive, Giraph, Mahout) which are integrated separately

– Spark can use HDFS, but is not limited to particular data store format

Spark analytics

Spark analytics

Spark in-memory Spark Resilient Distributed Data (RDDs)Spark in-memory Spark Resilient Distributed Data (RDDs)

Hadoop Distributed File Systems (HDFS) Hadoop Distributed File Systems (HDFS)

Spark analytics

Spark analytics

Spark analytics

Spark analytics

Other File systems, DBs, etc.Other File systems, DBs, etc.


Spark Transform actions across multiple data environments: apply a map function to each, filter, sample, union, intersect, joins…

Apply Spark Actions, for example Analytic Algorithms found in MLIB: clustering, PCA, logistic regression, classification, random forest…

Results of analysis

Data: from structured data, unstructured, any data environment supported by Spark

z Systems•DB2, IMS, VSAM•Adabas•Syslog•PDSE •Etc

Distributed:•Oracle•Filesystems•Hadoop •Etc

RDDRDDRDD’RDD’

Apache Spark – Analytics Across Heterogeneous Data Environments

Data did not need to move, centralize, or leave platform enforced security, governance….


Highly technical skills. Limited uses.

Time consuming and complex.

Not focused on insight.

Inflexible and varied

programming models.

Analytics platforms difficult to configure, setup, and

deploy.

Steep learning curve, intense programming.

Not designed for data

scientists.

Separate and fragmented big data and

analytics work flows.

Complex & disjoint

platforms. Mix of workloads.

80% of time spend on data

preparation

Today’s solutions are not for agile data science and app development.Main focus is on data management and predictability.


IBM z Systems analytics and Spark: well-matched

Faster analytics, both batch and real-timeFederated data access while preserving consistent APIs Historical, OLTP and streaming data Java based, can embed optimized nodes within transaction JVMsUse data in-place, avoiding unnecessary ETLsInfrastructure can be leveraged & optimized ‘underneath’ APIs .. e.g. zEDC, memory

management enhancements, SMT2, SIMDAvailable for Linux on z System now …..Available end of year for z/OS

Apache Spark is emerging as key to analytics ecosystem

File System SchedulerNetwork Serialization

MLlib

RDDcache

SQLGraphXStream

Compression

Look familiar?


LoZ

Leverage LoZ virtualization benefits

Power

SparkSpark SparkSpark

SparkSpark SparkSpark

x86

Leverage call center, external, social, sentiment data …

DB2DB2 VSAM VSAM

z/OS

Leverage z/OS data and transactions

IBM DB2 Analytics Accelerator

z Systems & Apache Spark: Strategic Direction

IMSIMS

Fast, expressive, cluster computing system leveraging in-memory framework for analytics

Federation across platform Spark implementations will initially need external orchestration

Spark

Spark

Linux on z Systems

Spark

DB2

CICSCICS WASWASIMSIMS

SparkSparkSparkSparkSparkSpark


Apache Spark on z/OS – Available Year End 2015

Key Business Transaction

Systems

Type 2Type 4

VSAM

Java

Apache Spark Core

Spark Stream

Spark SQL MLib GraphX

RDDRDD

RDDRDD

RDDRDD

RDDRDD

z/OS

Rocket Mainframe Data Service for Apache Spark z/OS

Spark Applications: IBM and Partners

and more . . .

SMFIMSDB2 z/OS

Value:Optimized access and z/OS governed ‘in-memory’ capabilities for core business data, leveraging open source analytic frameworks

Consistent analytic interfaces for SQL, Graph, Machine Learning across multiple data and system environments

Integration of analytics across core systems, social data, website information, etc.

NEW! Rocket Software Integrates with Apache Spark z/OS via the Mainframe Data Service for Apache Spark z/OS

Distributed

Oracle

HDFS


Data: example DB2

Core transactions

z/OS Distributed

SMS gateway

Social Data

Analytics to Qualify

candidate for promotion

offer

CICS Banking Process:

•Process transaction•Score risk of fraud•Qualify & Initiate promotion offer

Optimized access and z/OS governed ‘in-memory’ capabilities for core business data, leveraging open source analytic frameworks


Almost entirely all zIIP eligible

Leverage of emerging Spark skills and commercial solution ecosystem built on Apache Spark for fast ROI and agility

Integration of analytics heterogeneous data across core systems, social data, website information, etc.

Optimized access and z/OS governed ‘in-memory’ capabilities for core business data, leveraging open source analytic frameworks


Almost entirely all zIIP eligible

Leverage of emerging Spark skills and commercial solution ecosystem built on Apache Spark for fast ROI and agility

Integration of analytics heterogeneous data across core systems, social data, website information, etc.

z/OS

Example: Integration of Spark Analytics with Transaction Systems

Initiate Offer

Federated Analytics, Data in Place

Twitter 27

Spark z/OS

REST or Java interface

Data: example IMS


Spark z/OS Demo: Business Use Case

18

Financial Institution:offers both retail / consumer banking as

well as investment services

Financial Institution:offers both retail / consumer banking as

well as investment services

Business Critical Information owned by the organization:

Credit Card Info

Trade Transaction Info

Customer Info

Stock Price History – public data

Social Media Data: Twitter

GOALGOAL Produce right-time offers for cross sell or upsell, tailored for individual customers based on: customer profile, trade transaction history, credit card purchases and social media sentiment and information

View DemoView Demo https://youtu.be/sDmWcuO5Rk8

https://youtu.be/sDmWcuO5Rk8


Demo: Configuration

19

DB2 z/OSDB2 z/OS

IMSIMS

VSAMVSAM

Cloudant Linux on zCloudant

Linux on z

Sp

ark Job

S

erverS

park Jo

b

Server JDBCJDBC

JDBCJDBC

JDBC with

Rocket MDSS

JDBC with

Rocket MDSS

IMS Transaction

system

IMS Transaction

system

Credit Card Info

Open Source Visualization Libraries

Customer Info

Trade Transaction Info

Linux on z z/OS

CICSTransaction

system

CICSTransaction

system

REST API

REST API


Analytics across OLTP & Warehouse information Leverages aggressively cached in-memory distributed computing and JVM threads

Analytics combining business-owned data and external / social data Written in Scala, an object-oriented, functional

Analytics of real-time transactions via streaming, combine with OLTP & social Covers a wide range of workloads Provides SQL, streaming and complex analytics

Custom analytics of SMF leveraging real-time as well as archived data Covers a wide range of workloads Provides SQL, streaming and complex analytics

Use Cases for Apache Spark z/OS, Leveraging data-in-place analytics:

© 2015 IBM Corporation21

Related Links

Here is Ross Mauri‘s July 27 blog announcement: http://ibm.co/1U2y7Cw

Rocket Software’s Announce:

http://blog.rocketsoftware.com/blog/2015/07/26/get-closer-to-your-data-with-spark/#.Venw5kZ6wo1

http://ibm.co/1U2y7Cw

http://ibm.co/1U2y7Cw

For More Information please contact…

Len Santalucia, CTO & Business Development Manager Vicom Infinity, Inc. One Penn Plaza – Suite 2010 New York, NY 10119 917-856-4493 mobile [email protected] About Vicom InfinityAccount Presence Since Late 1990’sIBM Premier Business PartnerReseller of IBM Hardware, Software, and MaintenanceVendor Source for the Last 10 Generations of Mainframes/IBM StorageProfessional and IT Architectural ServicesVicom Family of Companies Also Offer Leasing & Financing, Computer Services, and IT Staffing & IT Project Management

mailto:[email protected]

Documents

© 2015 IBM Corporation Apache Spark and z Systems December 2015 Mythili Venkatakrishnan [email protected]