Scala: Lingua Franca of Fast Data - Meetupfiles.meetup.com/7770922/20160524 IBM Fast Data...

Scala: Lingua Franca of Fast Data

Jamie AllenSr. Director of Global Solutions Architects

• Why Scala?• Who is doing this?• What is Fast Data?• Architecting for Fast Data

Agenda

• Cloud portability versus native control• Application correctness versus speed of development• Modularity versus global namespace• Concise syntax versus boilerplate• Multi-threaded simplicity via abstractions versus low-level control

Tradeoffs

• REPL• Type safety• Modularity• Concise syntax• Multi-threaded simplicity• Data-centric semantics• Managed runtime for cloud portability• Ecosystem

Scala is the local optimum

The JVM is a primary reason for Scala’s success

• No REPL or Notebook• Not a data-centric language, particularly collections semantics

Why not Java?

• Data-centric language, has all of the wonderful collections semantics we want• No type safety• No modularity

Why not Python?

• Weak type safety• Collections are too elemental• Native execution is a non-starter, so Go is the only option• Garbage collection is not generational

Why not Go or C++?

• Scala just so happened to fit well in this space• Performance• Correctness• Conciseness

• Scala will evolve• Other languages will come in time

Scala is NOT the end of the road

Who is doing this?

One Caveat: Apache Beam and TensorFlow

Why Scala?At the time we started, I really wanted a PL that supports a language-integrated interface (where people write functions inline, etc)… However, I also wanted to be on the JVM in order to easily interact with the Hadoop filesystem and data formats for that. Scala was the only somewhat popular JVM language that offered this kind of functional syntax and was also statically typed (letting us have some control over performance), so we chose that. Today there might be an argument to make the first version of the API in Java with Java 8, but we also benefitted from other aspects of Scala in Spark, like type inference, pattern matching, actor libriaries, etc.Matei Zaharia, creator of Spark

What is Fast Data?

A bit of history: Hadoop

MRjob#1

MRjob#2

Flume Sqoop

SlaveNode

DiskDiskDiskDiskDisk

NodeMgr

DataNode

Master

ResourceManager

NameNode

Worker

Hadoop strengths• Lowest capital expenditure for big data• Excellent for ingesting and integrating diverse datasets• Flexible

• Classic analytics (aggregations and data warehousing)• Machine learning

Hadoop weaknesses• Complex administration• YARN requires dedicated cluster• MapReduce foibles

• Poor performance• Imperative programming model• No stream processing support

Fast Data with Spark

Spark• 100x faster as a replacement for Hadoop MapReduce• Uses much fewer machines and resources• Incredible support from the community and enterprise

Spark use cases• Primarily anomaly detection

• Risk management• Fraud detection• Odds recalculation

• Spam filters• Update search engine results quickly

• Spark had it with RDDs• They removed it with the DataFrames API• Brought it back with DataSets, but not as comprehensively as RDDs

Type safety

Why not Flink?• Flink has much better stream handling for low latency systems that Spark currently

lacks• Event timing• Watermarks• Triggers

• Exactly-once semantics• Pipeline portability via Apache Beam integration

Why not Flink?

Architecting for Fast Data

This isn’t enough

Old and busted

Traditional application architectures and platforms are obsolete.Gartner

How do we avoid messing this up?

• At the API• In our source• For our data

We want isolation

Wikipedia, Creative Commons, created by DFoerster

We want realistic data management• Use CQRS and Event Sourcing, not CRUD• Transactions, especially distributed, will not work• Consistency is an anti-pattern at scale• Distributed locks and shared data will limit you• Data fabrics break all of these conventions

Think in terms of compensation, not prevention.Kevin Webber, Lightbend

We want to ACID v2• Associativity, not Atomicity• Commutativity, not Consistency• Idempotent, not Isolation• Distributed, not Durable

Wikipedia, Creative Commons, created by Weston.pace

New hotness

Mesos,YARNonBareMetal,Cloud

HDFS,S3,CFSv2SQL/NoSQL

Streaming SQL

MLlib GraphX

Fast Data Architecture

HTTP/RESTInternet

ReacHveServices

LogsandOtherFiles

Actors

Cluster …Persist

AkkaStreams

WebServices

Learning Spark• Go to http://bigdatauniversity.com, built by IBM

Scala: Lingua Franca of Fast Data - Meetupfiles.meetup.com/7770922/20160524 IBM Fast Data...

Documents

IBM Analytics for Apache Spark (Spark as a Service)files.meetup.com/7770922/Spark as a Service.pdfIBM Analytics for Apache Spark –Personas & Practitioners Data Scientist Application

@guillotinaweb guillotina.readthedocs.io guillotinapythonbootcamp.net/meetings/meetings/Guillotina-meetup.pdf · guillotina_mailer guillotina_pgcatalog. EXTENSIBLE Built with adapter

People’s Post False Bay 20160524

People’s Post Retreat 20160524

Healthcare Unwiredfiles.meetup.com/1582256/Healthcare Unwired - 102510 - Meetup.pdf · Personal Health Meetup Healthcare unwired is the most comprehensive mobile health research to

Shark Update and Upcoming Changes - Meetupfiles.meetup.com/3138542/2013-05-09 Shark @ Spark Meetup.pdf · 5/9/2013 · Release Versioning & Schedule! Shark! Spark! Time! 0.1! 0.5!

Ejecutando Lenguaje R en Hadoop: BigR - Meetupfiles.meetup.com/7770922/BigR.pdf · Hadoop Streaming – Open Source – part of Hadoop frameworks. Invoking R Script in MapReduce through

IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Securing Hadoop using Ranger - Meetupfiles.meetup.com/19917255/Apache Ranger Meetup.pdf · Tag Based Policies in Atlas Ø Atlas and Ranger combination supports automation for governance

Ssw 20160524

© 2015 IBM Corporation - Meetupfiles.meetup.com/7770922/Spark2.pdf · 7 © 2015 IBM Corporation Introduction to MapReduce MapReduce Application 1. Map Phase (break job into small

20160524 - flyiin - presentation of iata ndc hackathon webinar

Self Service Data Exploration with Apache Drill - Meetupfiles.meetup.com/10136492/apache-drill-portland-meetup.pdf · Self Service Data Exploration with Apache Drill ... Apache Drill

People’s Post Athlone 20160524

© 2015 IBM Corporation - Meetupfiles.meetup.com/7770922/Streams_Overview_12-4-16.pdf · © 2015 IBM Corporation Agenda Introduction to Streams Use Cases / References / Samples Demo

Amazon Aurora Deep Dive - files.meetup.comfiles.meetup.com/8179642/Amazon Aurora Deep dive YVR MeetUP.pdf · MySQL-compatible relational database ... open source databases Delivered

Analyst deck-20160524-final

20160524 Webinar SAP BusinessObjects Cloud (Español)

People’s Post Atlantic Seaboard/City Edition 20160524

AWS IoT - files.meetup.comfiles.meetup.com/4507922/LA AWS IoT Meetup.pdf · •Setting up AWS IoT offering using vanilla Raspberry Pi. •Integrate with other AWS services to start