Why GCP over AWS -...

Preview:

Citation preview

April, 2018

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 1

GCP Workshop

Koivu Solutions Oy

• Porist – already since 2017

• A Solutions Company • Knowledge to help define the digital

roadmap• Capabilities and tools to rapidly develop

digital pilots to showcase benefits • Deployment experience to expand

breakthrough digital technologies to the entire organization

• Extensive Fortune Global 1000 experience• Expertise developed across many digital

assessments in multiple industries• More than 25 years enterprise solutions

development experience• More than 100 enterprise customer projects

delivered across 4 continents

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 2

Koivu Founding Team – Born Global

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 3

Sami LahtiSenior Technology ArchitectFormer JDA, i2, Innomat

Samu LahtiSenior Software ArchitectFormer JDA, i2, Tieto

Chris MorhardSenior Business ArchitectFormer E2open, MercuryGate, JDA, i2

Janne SalmiSenior Business ArchitectFormer Chainalytics, ROCE Partnerts, McKinsey

Harri RajalaSenior Software ArchitectFormer JDA, i2, Innomat

Our team’s work has been for industry giants

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL4…and many more

Stop-and-go mentality is Achilles' heel of today’s enterprise software

• Nightly batch processing stops the business. 24 hour day is not 24 hour business day.

• Diminishing nightly time windows for data batch processing are creating pressure for systems to be available earlier on the next morning with data good enough for tomorrow’s business.

• These systems are sequence oriented, not capable to do parallel data processing and hard to scale. Peak load capacity needs to be always on-line.

5© Copyright 2018 Koivu Solutions Oy

runStop and

load runStop and

load runStop and

load runStop and

loadStop and

load

Modern way to build enterprise applications is to connect the dots on cloud platform

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 6

IoT and Data Analytics solutions are mean to be real-time.

7© Copyright 2018 Koivu Solutions Oy

Message queue is a core component of a modern enterprise application

• When you need messages to be delivered?• People send messages all the time: email, SMS, Chat, Comments, … • These are just fraction of messages systems send to each others.

• Traditionally many message type processes have been managed as batch load, for example nightly. Today real-time operation is a business requirement.

• Message hub consumes messages and sends them for others to operate.• Works as shock absorber for load fluctuation.• Message content can be ‘anything’.

• Message buffering and guaranteed deliver works as enablers for high availability for the complete communication chain.• Publishers and Subscribers work independent and unknowingly of each others. • There can be many topics / channels on which messages are published.• A new consumer can join to listen the messages without changes to publisher(s).

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 8

Google Cloud Pub / Sub is a high-available messaging hub for asynchronous communication

• Cloud Pub/Sub is a simple, reliable, scalable foundation for stream analytics and event-driven computing systems.

• As part of Google Cloud’s stream analytics solution, the service ingests event streams and delivers them to Cloud Dataflow for processing and BigQuery for analysis as a data warehousing solution.

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 9

Spotify is a music streaming service, and also generates huge amount of messages.

“Whenever a user performs an action in the Spotify client—such as listening to a song or searching for an artist—a small piece of information, an event, is sent to our servers. Event delivery, the process of making sure that all events gets transported safely from clients all over the world to our central processing system, is an interesting problem. ”

https://labs.spotify.com/2016/03/03/spotifys-event-delivery-the-road-to-the-cloud-part-ii/

• Spotify is now on their 3rd generation of messaging solutions (2016).

• How do they handle millions of messages 24/7?

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 10

Spotify did a massive Pub / Sub Performance Test

• Spotify used 29 servers to send messages to Pub/Sub to generate continuous test load of 2 million messages.

• “Enabling batching and compression on the Event Service machines resulted in ~1Gbps of network traffic towards Pub/Sub.”

• “Pub/Sub passed the test with flying colours. We published 2M messages [per second] without any service degradation and received almost no server errors from the Pub/Sub backend.”

• With current Google prices (2018) this would generate cost of $17 per hour. • $0,04 per Gb

• 330TB monthly rate.

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 11

The second part of the test was to let the Pub/Sub buffer messages for one hour before starting to consume them.

• 7.2 billion messages buffered

• Consumption was just slightly higher rate than publishing.

• It took 8 hours to catch up the backlog.

• No messages were observed to be lost!

• Pub/Sub managed the backlog and hide the ‘blackout’ of the back-end system.

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 12

What was required from Spotify to setup, run and maintain Pub/Sub for this size of scalability test?

Absolutely nothing.

Pub / Sub is fully managed, unconfigurable, global service and

operated entirely by Google.

Not even capacity reservations needed, no system configuration, no

administration.

It just runs. Like a telecommunication switch for

phones.

“Based on these tests, we felt confident that Cloud Pub/Sub was the right choice

for us. Latency was low and consistent, and the

only capacity limitations we encountered was the one explicitly set

by the available quota. In short, choosing Cloud Pub/Sub rather

than Kafka 0.8 for our new event delivery platform was an obvious

choice.”

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 13

Serverless – NoOps

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 14

BigQuery is Google's serverless, highly scalable, low cost enterprise data warehouse designed to make all your data analysts productive.

• Because there is no infrastructure to manage, you can focus on analyzing data to find meaningful insights using familiar SQL and you don't need a database administrator.

• BigQuery enables you to analyze all your data by creating a logical data warehouse over managed, columnar storage as well as data from object storage, and spreadsheets.

• BigQuery makes it easy to securely share insights within your organization and beyond as datasets, queries, spreadsheets and reports.

• BigQuery allows organizations to capture and analyze data in real-time using its powerful streaming ingestion capability so that your insights are always current.

• BigQuery is free for up to 1TB of data analyzed each month and 10GB of data stored.

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 15

Public data set used: New York Taxi Trips -2016

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 16

Table ID nyc-tlc:yellow.trips

Table Size 130 GB

Long Term Storage Size 130 GB

Number of Rows 1,108,779,463

More than a billion row data set

This dataset includes trip records from all trips completed in yellow taxis in NYC since 2009. Records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab Passenger Enhancement Program (TPEP).

Calculate average price for trips –according to passenger count

• Categorize all 1.1 billion rows by passenger count.

• Calculate average price, distance and time for each category.• So use numbers from every single 1.1

billion rows.

• 41Gb data scanned.

• 6 seconds execution time.

• Theoretical query cost…invisible.

• (Data set cost: Google provides public data sets)

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 17

SELECTCASEWHEN passenger_count <= 0 THEN 'Unknown'WHEN passenger_count >= 7 THEN '>= 7'ELSE STRING(passenger_count)

END AS PassengerCount,COUNT(passenger_count) AS TripCount,ROUND(AVG(total_amount),2) AS AverageTotalAmout,ROUND(AVG(trip_distance),2) AS AverageTripDistance,TIME(SEC_TO_TIMESTAMP(AVG((dropoff_datetime-pickup_datetime)/1000/1000))) AS

AverageTripDurationFROM

[nyc-tlc:yellow.trips]GROUP BY

PassengerCountORDER BY

PassengerCount

Demo can be run online using this

SQL.

Demo Recording

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 18

Add more calculation without increase data scanned: no more extra cost

• Find also max trip price.

• The same column used for both average and max calculation.

• No extra costs.

• Less than 6 seconds execution time.

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 19

SELECTCASE

WHEN passenger_count <= 0 THEN 'Unknown'WHEN passenger_count >= 7 THEN '>= 7'ELSE STRING(passenger_count)

END AS PassengerCount,COUNT(passenger_count) AS TripCount,ROUND(AVG(total_amount),2) AS AverageTotalAmout,ROUND(MAX(total_amount),2) AS MaxTotalAmout,ROUND(AVG(trip_distance),2) AS AverageTripDistance,TIME(SEC_TO_TIMESTAMP(AVG((dropoff_datetime-

pickup_datetime)/1000/1000))) AS AverageTripDurationFROM

[nyc-tlc:yellow.trips]GROUP BY

PassengerCountORDER BY

PassengerCount

Adding more results columns means that more data is scanned

• Find also average tip.

• A new column introduced into query.

• More data scanned as BigQuery is columnar store.

• Still Less than 6 seconds execution time – parallel processing.

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 20

SELECTCASE

WHEN passenger_count <= 0 THEN 'Unknown'WHEN passenger_count >= 7 THEN '>= 7'ELSE STRING(passenger_count)

END AS PassengerCount,COUNT(passenger_count) AS TripCount,ROUND(AVG(total_amount),2) AS AverageTotalAmout,ROUND(MAX(total_amount),2) AS MaxTotalAmout,ROUND(AVG(tip_amount),2) AS AverageTipAmout,ROUND(AVG(trip_distance),2) AS AverageTripDistance,TIME(SEC_TO_TIMESTAMP(AVG((dropoff_datetime-

pickup_datetime)/1000/1000))) AS AverageTripDurationFROM

[nyc-tlc:yellow.trips]GROUP BY

PassengerCountORDER BY

PassengerCount

Serverless – NoOps

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 21

Google Cloud Dataflow is service for real-time data stream processing

• Cloud Dataflow is a fully-managed service for transforming and enriching data in stream (real time) and batch (historical) modes with equal reliability and expressiveness -- no more complex workarounds or compromises needed.

• And with its serverless approach to resource provisioning and management, you have access to virtually limitless capacity to solve your biggest data processing challenges, while paying only for what you use.

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 22

• Hours of nightly runs become distributed around the clock as minute and second size small ‘mini-batches’.

• Windowing for data processing is continuous and data volumes fluctuating up and down requiring dynamic software architecture.

Delay to react to the new data and intelligent decisions becomes minutes instead of days.

Dynamic solution runs continuously and manages changes using small time windows

23© Copyright 2018 Koivu Solutions Oy

Dataflow combines batch and stream operations when needed

• The same logic, so code, is used to process both batch and stream data.

• No double maintenance for two different type of operation.

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 24

Example Dataflow implementation for real-time metrics and log data processing

• Statics calculation of 300+ metrics coming at 20 messages per second:

• Sums, averages, 10%, 90%, medians• Baseline compute, top contributor compute• Minute, hourly, daily values• Window statistics and cumulative statistics• Bottom up statistcs from instance, environment,

application, customer, company.• Log, exception and error file processing.• Real-time processing of constant data stream costs

about $300 per month.

• Great performance. Cloud Dataflow is 2-3x faster and cheaper than Hadoop when evaluating classic MapReduce based pipelines, such as PageRank and WordCount.

• And with dynamic work rebalancing, Cloud Dataflow effectively optimizes resource utilization which provides additional performance gains without requiring manual intervention.

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 25

Customers are saying…

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 26

Hands on Workshop

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 27

Ingest Pipelines

Storage

Analytics

Application &

Presentation

Standard

Devices

HTTPS

Constrained

Devices

Non-TCP

e.g. BLE

Gateway

Real Time Stream Processing - Internet of Things

App

Engine

Container

Engine

Cloud

Storage

Cloud

Pub/Sub

Data

Studio

Monitoring

Logging

Cloud

Dataflow

Cloud

Datastore

Cloud

Bigtable

BigQuery

Cloud

Dataproc

Cloud

Datalab

Compute

Engine

Entry Point

IoT Core

Cloud

Functions

Ingest Pipelines Analytics

Standard

Devices

HTTPS

Real Time Stream Processing - Internet of Things

Cloud

Pub/SubBigQuery

Cloud

Functions

Analytics

Data

Studio

1.4.3. 2.

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 30

Demo Data

• Tieliikenne data

• https://www.liikennevirasto.fi/avoindata/tietoaineistot/lam-tiedot

• http://digitraffic.liikennevirasto.fi/

• Pori, Nakkila, Karkkila, Vihti LAM-asemat ( liikenteen automaattisetmittausasema, induktiosilmukat) ajoneuvonopeudetajoneuvotyypeittäin 1.4.2018.

• http://digitraffic.liikennevirasto.fi/tieliikenne/#ajantasaiset-lam-mittaustiedot

GitHub Link

• Guidance to workshop details and source code:

• https://github.com/koivusolutions/koivu-workshops

• Open this to other browser window.

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 31

Cloud Console - https://console.cloud.google.com/

Big Query – Create New Dataset

1

Create Table

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 34

Import Data to Table (Create Table)

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 35

1

2

3

4

Preview Table

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 36

Query Table

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 37

2

3

1

Data Studio - https://datastudio.google.com

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 38

Add Data Source

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 39

Create Data Connector

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 40

Creating Data Connector

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 41

2

1

Creating Data Connector

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL

42

1

2

Add Table

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 43

Add Chart

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 44

1

2

Add Filter

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 45

1

2

Extra Report Challenge

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 46

Pub/Sub Real-Time Data into Pipeline -https://console.cloud.google.com/cloudpubsub

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 47

Create Topic

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 48

Manual operations to Pub/Sub

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 49

Cloud Functions - Workshop

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 50

1

2

Cloud Functions -Create

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 51

Publish Message from UI

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 52

Did Cloud Function Activate?

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 53

Do you see data?

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 54

2

1

What Next?

• Dataprep

• Dataflow

• Datalab

• Machine Learning

• Big Table

• IoT Core

• …

© Copyright 2018 Koivu Solutions Oy CONFIDENTIAL 55

Recommended