Spark on Dataproc - Israel Spark Meetup at taboola

Vadim Soloveyvadim@doit-intl.com

Google Cloud DataprocSpark and Hadoop with superfast start-up, easy management and billed by the minute.

<vadim@doit-intl.com>

Google Developer Expert & Trainer

CTO of DoIT International

Agenda

Google Dataproc Overview

Features

Roadmap

Try Google Dataproc

Google Cloud Dataproc is a fast, easy to use, low cost and fully-managed service that lets you run Spark and Hadoop on Google Cloud Platform.

Cloud Dataproc

Confidential & ProprietaryGoogle Cloud Platform 5

Management

Mobile

Services

Compute

Big Data

Storage

Developer Tools

Dataproc 101

Low Cost IntegratedEasy to Use

Easily create and scale clusters to run native:

• Spark• PySpark• Spark SQL• MapReduce• Hive• Pig• More with IA’s

Integration with Cloud Platform provides immense scalability, ease-of use, and multiple channels for cluster interaction and management.

Low-cost data processing with:• Low and fixed price• Minute-by-minute billing• Fast cluster provisioning,

execution, and removal• Ability to manually scale

clusters based on needs• Preemptible instances

Product Characteristics Cloud Dataproc

Amazon EMR Customer Impact

Cluster start timeElapsed time from cluster creation until it is ready.

< 90 seconds ~360 secondsFaster data processing workflows because less time is spent waiting for clusters to provision and start executing applications.

Billing unit of measureIncrement used for billing service when active.

Minute HourlyReduced costs for running Spark and Hadoop because you pay for what you actually use, not a cost which has been rounded up.

Preemptible VMsClusters can utilize preemptible VMs.

Yes Kind of :-)Lower total operating costs for Spark and Hadoop processing by leveraging the cost benefits of preemptibles.

Job output & cancellationJob output easy to find and are cancelable without SSH

Yes NoHigher productivity because job output does not necessitate reviewing log files and canceling jobs does not require SSH.

Competitive Highlights

02 Features

● Spark 1.5.2 w/ Py-Spark & Spark-SQL

● Hadoop 2.7.1

● Pig 0.15

● Hive 1.2.1

● YARN Resource Manager

● Debian 8 based O/S

● Google Connectors for Cloud Storage, BigQuery & BigTable etc.

Packaging & Versioning

Features

Integrated with Cloud Storage, Cloud Logging,

BigQuery, and more.

Integrated

Manually scale clusters up or down based on need,

even when jobs are running.

Anytime Scaling

UI, API & CLI for rapid development including

Initialization Actions & Job Output Driver

Available in every Google Cloud zone in the United States, Europe, and Asia

Global Availability

# Only run on the master nodeROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)if [[ "${ROLE}" == 'Master' ]]; then

apt-get install build-essential python-dev libpng-dev libfreetype6-dev libxft-dev pkg-config python-matplotlib python-requestscurl https://bootstrap.pypa.io/get-pip.py | python

mkdir IPythonNBpip install "ipython[notebook]"ipython profile create default

echo "c = get_config()" > /root/.ipython/profile_default/ipython_notebook_config.pyecho "c.NotebookApp.ip = '*'" >> /root/.ipython/profile_default/ipython_notebook_config.py

# Setup script for iPython Notebook so it uses the cluster's Sparkcat > /root/.ipython/profile_default/startup/00-pyspark-setup.py <<'_EOF'

import osimport sysspark_home = '/usr/lib/spark/'os.environ["SPARK_HOME"] = spark_homesys.path.insert(0, os.path.join(spark_home, 'python'))sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))_EOF

nohup ipython notebook --no-browser --ip=* --port=8123 > /var/log/python_notebook.log &fi

Initialization Action Example

Off-the-Shelf Initialization Actionshttps://github.com/GoogleCloudPlatform/dataproc-initialization-actions

Pull Requests are Welcome!

JupyterFacebook Presto Zeppelin Kafka Zookeeper

BigQuery BigTable CloudSQL Datastore

Available Datastores

Cloud Storage Nearline

GCS Connector Performance (I)Recommendation Engine Use-Case (1 file, 500GB)

GCS Connector Performance (II)Sessionization Use-Case (14,800 files, 1GB each)

GCS Connector Performance (III)Document Clustering Use-Case (31,000 files, 250MB each)

Additional Integrations

Cloud Logging Cloud Monitoring

Spark & BigQuery Integration Exampleval fullyQualifiedInputTableId = "publicdata:samples.shakespeare"val fullyQualifiedOutputTableId = "<your-fully-qualified-table-id>"val outputTableSchema = "[{'name': 'Word','type': 'STRING'},{'name': 'Count','type': 'INTEGER'}]"val jobName = "wordcount"

// Set the job-level projectId.conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId)

// Use the systemBucket for temporary BigQuery export data used by the InputFormat.val systemBucket = conf.get("fs.gs.system.bucket")conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, systemBucket)

// Configure input and output for BigQuery access.BigQueryConfiguration.configureBigQueryInput(conf, fullyQualifiedInputTableId)BigQueryConfiguration.configureBigQueryOutput(conf, fullyQualifiedOutputTableId, outputTableSchema)

val fieldName = "word"

val tableData = sc.newAPIHadoopRDD(conf, classOf[GsonBigQueryInputFormat], classOf[LongWritable], classOf[JsonObject])tableData.cache()tableData.count()tableData.map(entry => (entry._1.toString(),entry._2.toString())).take(10)

03 Demo

Pricing Example

35-minutes Spark job running on 14x 16-cores workers (224 cores)

[ Crunching 3TB TeraSort ]

Pricing

Pricing Example

Function Machine Type # in Cluster vCPUs Instances Price Dataproc Price

Master Node n1-standard-4 1 4 $0.2 $0.04

Worker Nodes n1-highmem-16 4 64 $4.032 $0.64

Worker Nodes (Preemptible) n1-highmem-16 10 160 $3.8 $1.6

Cluster Total n/a 15 224 $4.88

Pricing Details

Per Compute Engine vCPU (any Machine Type) $0.01 Dataproc per hour price (USD)

35% to 300% less than AWS EMR(c3.2xlarge | m2.4xlarge)

04 Roadmap

Roadmap (Q1 2015)

More Pre-Installed Engines, Frameworks & Tools (via Initialization Scripts)Mahout, Hue, Cloudera, MapR and others

PerformanceFurther improve performance on jobs running directly on Google Cloud Storage. The ultimate goal is to make GCS the default storage for Dataproc and provide 2x performance of local HDFS (when not using LocalSSD)

More Native DatastoresSpanner, Google ML

06 Try Google Dataproc in 2015

AWS EMR Customer?

Get $1,000To test Google Dataproc

Not a AWS EMR Customer?

Get $1,000*

To test Google Dataproc

* Agree to 1-hour meeting@ Google Tel-Aviv

to discuss your Big Data needs

goo.gl/mFwCYapromo code is “1K-Dataproc”

05 Q?A

goo.gl/mFwCYa

Spark on Dataproc - Israel Spark Meetup at taboola

Software

[Spark meetup] Spark Streaming Overview

Learning spark ch09 - Spark SQL

Taboola entertainment examples

Learning spark ch10 - Spark Streaming

Spark SQL | Apache Spark

Using apache spark to fight world hunger - Israel spark meetup at taboola

FORM F-4 REGISTRATION STATEMENT UNDER THE ... - Taboola

Walser Wealth · 2018. 10. 9. · MORE FROM CNBC CNBC by Taboola Sponsored Links by Taboola The last time unemployment was this low, we were hit with a recession Here's how much money

Euronews Doubles Revenue from Taboola as it Expands ... · CaseStudy News Euronews originally started working with Taboola by implementing a widget recommending organic content. When

B2B CASE STUDIES TABOOLA

Spark summit2014 techtalk - testing spark

Salon E commerce 2015 Taboola Présentation - Contenu et Conversion

Taboola One-Pager

· YOU BELONG HERE. Book from Recommended by Taboola A Cautionary Kale: The Truth Exposed Buro 24/7 Sponsored Links by Taboola New prices to with improve monorail sen SPAD tells

How To Track Your Taboola Campaign

Spark meetup2 final (Taboola)

Taboola To Acquire Connexity

Taboola Partners - Introduction To Taboola

· Sign in @ + Follow Newest I Oldest MORE FROM MANORAMAONLINE 1 person listening Post comment as... by Taboola Sponsored Links by Taboola India's Pride Is Its Female Athlete

Great tips on starting with Taboola - Webtraffic.agency