Spark on Dataproc - Israel Spark Meetup at taboola

Preview:

Citation preview

Vadim Soloveyvadim@doit-intl.com

Google Cloud DataprocSpark and Hadoop with superfast start-up, easy management and billed by the minute.

Copyright 2015 Google Inc

<vadim@doit-intl.com>

Google Developer Expert & Trainer

CTO of DoIT International

Agenda

01

02

03

04

05

06

Google Dataproc Overview

Features

Demo

Roadmap

Q&A

Try Google Dataproc

Google Cloud Dataproc is a fast, easy to use, low cost and fully-managed service that lets you run Spark and Hadoop on Google Cloud Platform.

Cloud Dataproc

Confidential & ProprietaryGoogle Cloud Platform 5

Management

Mobile

Services

Compute

Big Data

Storage

Developer Tools

Confidential & ProprietaryGoogle Cloud Platform 6

Dataproc 101

Low Cost IntegratedEasy to Use

Easily create and scale clusters to run native:

• Spark• PySpark• Spark SQL• MapReduce• Hive• Pig• More with IA’s

Integration with Cloud Platform provides immense scalability, ease-of use, and multiple channels for cluster interaction and management.

Low-cost data processing with:• Low and fixed price• Minute-by-minute billing• Fast cluster provisioning,

execution, and removal• Ability to manually scale

clusters based on needs• Preemptible instances

Confidential & ProprietaryGoogle Cloud Platform 7

Product Characteristics Cloud Dataproc

Amazon EMR Customer Impact

Cluster start timeElapsed time from cluster creation until it is ready.

< 90 seconds ~360 secondsFaster data processing workflows because less time is spent waiting for clusters to provision and start executing applications.

Billing unit of measureIncrement used for billing service when active.

Minute HourlyReduced costs for running Spark and Hadoop because you pay for what you actually use, not a cost which has been rounded up.

Preemptible VMsClusters can utilize preemptible VMs.

Yes Kind of :-)Lower total operating costs for Spark and Hadoop processing by leveraging the cost benefits of preemptibles.

Job output & cancellationJob output easy to find and are cancelable without SSH

Yes NoHigher productivity because job output does not necessitate reviewing log files and canceling jobs does not require SSH.

Competitive Highlights

02 Features

Confidential & ProprietaryGoogle Cloud Platform 9

● Spark 1.5.2 w/ Py-Spark & Spark-SQL

● Hadoop 2.7.1

● Pig 0.15

● Hive 1.2.1

● YARN Resource Manager

● Debian 8 based O/S

● Google Connectors for Cloud Storage, BigQuery & BigTable etc.

Packaging & Versioning

Confidential & ProprietaryGoogle Cloud Platform 10

Features

Integrated with Cloud Storage, Cloud Logging,

BigQuery, and more.

Integrated

Manually scale clusters up or down based on need,

even when jobs are running.

Anytime Scaling

UI, API & CLI for rapid development including

Initialization Actions & Job Output Driver

Tools

Available in every Google Cloud zone in the United States, Europe, and Asia

Global Availability

Confidential & ProprietaryGoogle Cloud Platform 11

# Only run on the master nodeROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)if [[ "${ROLE}" == 'Master' ]]; then

apt-get install build-essential python-dev libpng-dev libfreetype6-dev libxft-dev pkg-config python-matplotlib python-requestscurl https://bootstrap.pypa.io/get-pip.py | python

mkdir IPythonNBpip install "ipython[notebook]"ipython profile create default

echo "c = get_config()" > /root/.ipython/profile_default/ipython_notebook_config.pyecho "c.NotebookApp.ip = '*'" >> /root/.ipython/profile_default/ipython_notebook_config.py

# Setup script for iPython Notebook so it uses the cluster's Sparkcat > /root/.ipython/profile_default/startup/00-pyspark-setup.py <<'_EOF'

import osimport sysspark_home = '/usr/lib/spark/'os.environ["SPARK_HOME"] = spark_homesys.path.insert(0, os.path.join(spark_home, 'python'))sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))_EOF

nohup ipython notebook --no-browser --ip=* --port=8123 > /var/log/python_notebook.log &fi

Initialization Action Example

Confidential & ProprietaryGoogle Cloud Platform 12

Off-the-Shelf Initialization Actionshttps://github.com/GoogleCloudPlatform/dataproc-initialization-actions

Pull Requests are Welcome!

JupyterFacebook Presto Zeppelin Kafka Zookeeper

Confidential & ProprietaryGoogle Cloud Platform 13

BigQuery BigTable CloudSQL Datastore

Available Datastores

Cloud Storage Nearline

Confidential & ProprietaryGoogle Cloud Platform 14

GCS Connector Performance (I)Recommendation Engine Use-Case (1 file, 500GB)

Confidential & ProprietaryGoogle Cloud Platform 15

GCS Connector Performance (II)Sessionization Use-Case (14,800 files, 1GB each)

Confidential & ProprietaryGoogle Cloud Platform 16

GCS Connector Performance (III)Document Clustering Use-Case (31,000 files, 250MB each)

Confidential & ProprietaryGoogle Cloud Platform 17

Additional Integrations

Cloud Logging Cloud Monitoring

Confidential & ProprietaryGoogle Cloud Platform 18

Spark & BigQuery Integration Exampleval fullyQualifiedInputTableId = "publicdata:samples.shakespeare"val fullyQualifiedOutputTableId = "<your-fully-qualified-table-id>"val outputTableSchema = "[{'name': 'Word','type': 'STRING'},{'name': 'Count','type': 'INTEGER'}]"val jobName = "wordcount"

// Set the job-level projectId.conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId)

// Use the systemBucket for temporary BigQuery export data used by the InputFormat.val systemBucket = conf.get("fs.gs.system.bucket")conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, systemBucket)

// Configure input and output for BigQuery access.BigQueryConfiguration.configureBigQueryInput(conf, fullyQualifiedInputTableId)BigQueryConfiguration.configureBigQueryOutput(conf, fullyQualifiedOutputTableId, outputTableSchema)

val fieldName = "word"

val tableData = sc.newAPIHadoopRDD(conf, classOf[GsonBigQueryInputFormat], classOf[LongWritable], classOf[JsonObject])tableData.cache()tableData.count()tableData.map(entry => (entry._1.toString(),entry._2.toString())).take(10)

03 Demo

Confidential & ProprietaryGoogle Cloud Platform 20

Pricing Example

35-minutes Spark job running on 14x 16-cores workers (224 cores)

[ Crunching 3TB TeraSort ]

Confidential & ProprietaryGoogle Cloud Platform 21

Pricing

Pricing Example

Function Machine Type # in Cluster vCPUs Instances Price Dataproc Price

Master Node n1-standard-4 1 4 $0.2 $0.04

Worker Nodes n1-highmem-16 4 64 $4.032 $0.64

Worker Nodes (Preemptible) n1-highmem-16 10 160 $3.8 $1.6

Cluster Total n/a 15 224 $4.88

Pricing Details

Per Compute Engine vCPU (any Machine Type) $0.01 Dataproc per hour price (USD)

35% to 300% less than AWS EMR(c3.2xlarge | m2.4xlarge)

04 Roadmap

Confidential & ProprietaryGoogle Cloud Platform 23

Roadmap (Q1 2015)

More Pre-Installed Engines, Frameworks & Tools (via Initialization Scripts)Mahout, Hue, Cloudera, MapR and others

PerformanceFurther improve performance on jobs running directly on Google Cloud Storage. The ultimate goal is to make GCS the default storage for Dataproc and provide 2x performance of local HDFS (when not using LocalSSD)

More Native DatastoresSpanner, Google ML

06 Try Google Dataproc in 2015

Confidential & ProprietaryGoogle Cloud Platform 25

AWS EMR Customer?

Get $1,000To test Google Dataproc

Confidential & ProprietaryGoogle Cloud Platform 26

Not a AWS EMR Customer?

Get $1,000*

To test Google Dataproc

Confidential & ProprietaryGoogle Cloud Platform 27

* Agree to 1-hour meeting@ Google Tel-Aviv

to discuss your Big Data needs

Confidential & ProprietaryGoogle Cloud Platform 28

goo.gl/mFwCYapromo code is “1K-Dataproc”

05 Q?A

goo.gl/mFwCYa

Recommended