86
Google confidential Do not distribute Google confidential Do not distribute How Google Does Big Data Solving Big Data Problems at Google Scale James Chittenden Cloud Platform Engineer

How Google Does Big Data - DevNexus 2014

Embed Size (px)

DESCRIPTION

 

Citation preview

Google confidential │ Do not distribute Google confidential │ Do not distribute

How Google Does Big DataSolving Big Data Problems at Google Scale

James ChittendenCloud Platform Engineer

Google confidential │ Do not distribute

Quick Google Cloud Platform Overview

Google & Big Data

Running Hadoop on GCE & Leveraging BigQuery

Q & A

1

2

3

4

Agenda

Google confidential │ Do not distribute

Quick Google Cloud Platform Overview

Google & Big Data

Running Hadoop on GCE & Leveraging BigQuery

Q & A

1

2

3

4

Agenda

Google confidential | Do not distributeEnterprise

Google confidential | Do not distributeEnterprise

Google confidential | Do not distributeEnterprise

Google confidential │ Do not distribute

Google has been running some of the world’s largest distributed systems with unique and stringent requirements.

Images by Connie Zhou

Google confidential | Do not distribute

"This is what makes Google Google: its physical network, its thousands of fiber miles, and those many thousands of servers that, in aggregate, add up to the mother of all clouds."

- Wired 'Google Throws Open Doors To Its Top Secret Data Center', Wired, October 2012

Google confidential │ Do not distribute

For the past 15 years, Google has been building out the world’s fastest, most powerful, highest quality cloud infrastructure on the planet.

Images by Connie Zhou

Google confidential | Do not distribute

Cloud Platform is built on the same infrastructure that powers Google

Images by Connie Zhou

Google's Global OpenFlow Network

Google confidential | Do not distribute

2002 2004 2006 2008 2010 2012

ColossusDremelMapReduce

SpannerGFS Big Table

Cloud Platform

Google Innovations in Software

Google confidential | Do not distributeCloud Platform

Investing In Our Cloud $2.9B in additional data center investments worldwide

Google confidential | Do not distributeCloud Platform

Storage App ServicesCompute

Cloud Storage

Cloud SQL

Cloud Datastore

Compute Engine

App Engine

BigQuery

Cloud Endpoints

Google confidential │ Do not distribute

Quick Google Cloud Platform Overview

Google & Big Data

Running Hadoop on GCE & Leveraging BigQuery

Q & A

1

2

3

4

Agenda

Big Data LandscapeComplex Big Data Landscape

ONLINE AD IMPRESSIONS

5.3TRILLION ONLINE AD

IMPRESSIONS IN 2012 1

EMAIL

191BILLION EMAILS

SENT EVERY DAY 2

ONLINEAD SPEND

US $109.7BILLION ESTIMATED

FOR 2014 3MOBILE

AD

US $14.3BILLION SPENT IN

U.S. ON MOBILE ADS IN 2013 4

FACEBOOK

500+TERABYTES OF DATAUPLOADED DAILY 5

The Flood of BIG DATA

Google confidential │ Do not distribute

How does Google process Big Data?

Application servers

User requests

Note: diagram is dramatically simplifiedInterop NYC - Oct 2, 2012

Big Data processing example - application logs

Application servers

BigTable / GFS

Logs clusterUser requests

App logging

App persistence

Note: diagram is dramatically simplifiedInterop NYC - Oct 2, 2012

Big Data processing example - application logs

Application servers

BigTable / GFS

Logs clusterUser requests

App logging

App persistence

Dremel

MapReduce

MapReduce

Note: diagram is dramatically simplifiedInterop NYC - Oct 2, 2012

Big Data processing example - application logs

Application servers

BigTable / GFS

Logs clusterUser requests

App logging

App persistence

DremelAnalysts

& reporting

SQL

MapReduce

MapReduce

Note: diagram is dramatically simplifiedInterop NYC - Oct 2, 2012

Big Data processing example - application logs

Application servers

BigTable / GFS

Logs clusterUser requests

App logging

App persistence

DremelAnalysts

& reporting

SQL

MapReduce

Note: diagram is dramatically simplifiedInterop NYC - Oct 2, 2012

Big Data processing example - application logs

MapReduce

Google confidential │ Do not distribute

How does this applyto my business?

Big DataSolutionswith Google Cloud Platform

Sensor DataHOW CAN I CAPTURE

MORE CUSTOMER INTERACTIONS?

Social DataWHAT DO CUSTOMERS

SAY ABOUT MY BUSINESS?

Internal DataHOW CAN I COMBINE ALLOF MY TRANSACTIONAL

DATA WITH OTHER DATA?

Google DataHOW CAN GOOGLE

DATA HELP ME WITH MULTI-CHANNEL

MODELING? 01001010110010001001000100100110001110011v10011000100110001 01001010110010001001000100010001 01100011100010 01100011100001110 10011001000011001100101 01

Display ads analytics(global top-5 media agency)

Ads network reporting(3rd party mobile ads)

Fleet reservations(online travel operations)

Revenue optimization(holiday/travel properties)

Analyze global campaign performance for F500 clients1 client = 20GB/day of DoubleClick impressions logs

Deliver x-platform performance analytics dashboards 1B events/day x 100s of ads customers

Monitor customer demand vs supply shortfalls10,000 routes x 1000s customers = millions of daily events

Correlate marketing effectiveness vs global reservations10MBs / day from multiple data warehouses

Mobile app statistics(online reading vendor)

Usage analysis on 60M installs; 10M active users2B API requests/day, 20GB log data/day

Interop NYC - Oct 2, 2012

Sample use cases - BigQuery usage today

Opportunities Challenges

Single place to capture data

Combine data from different sources

Detect patterns and correlations

Easily share data insights with org

Distributed decision making

The data is large

High rate of growth

Non-relational

Multiple sources

Up to 100-200 TB

Up to 100-200 GB / day

Unstructured or semi-structured

SaaS, Mobile devices, legacy

Interop NYC - Oct 2, 2012

Today we see the same trends happening in industry

● 2004: Google releases the MapReduce* framework paper

● 2011: Big Data "analysis" using Hadoop and other Google inspired technology

Processing Data: MapReduce

Google Big Query

Google Compute Engine Scalable VMs

TBs of Data

Ad-hoc queries

Data Collection

ETL

Raw Data Storage

Aggregation

Analytics Storage

Visualization

Google Cloud Storage

1

2

3

4

5

6

Interactive Dashboards + apps

BI tools

Google Spreadsheets

4

Visualization

3

Raw Data Storage

5

6

2

Google App Engine Web Apps1

Auto-scaleweb services

Data processingCleansing

Cloud StorageHadoop Connector

REST API

Raw Data StorageBig Query Staging

Collection Transformation

Data Ingestion

Data Collection

ETL

Raw Data Storage

Aggregation

Analytics Storage

Visualization

1

2

3

4

5

6

Interactive Dashboards + apps

BI tools

Google Spreadsheets

Visualization6Google Compute Engine

Collection Transformation

Data processingCleansing

2

1

4

Scalable VMs

Data Collection

ETL

Raw Data Storage

Aggregation

Analytics Storage

Visualization

1

2

3

4

5

6

Interactive Dashboards + apps

BI tools

Google Spreadsheets

Visualization6

Raw Data StorageBig Query Staging

3

1

Google Cloud Storage

Huge Storage Capacity

Consistent, fast speeds

Multiple storage types

TBs of Data

Data Collection

ETL

Raw Data Storage

Aggregation

Analytics Storage

Visualization

1

2

3

4

5

6

Interactive Dashboards + apps

BI tools

Google Spreadsheets

Visualization6

1

Google BigQuery Ad-hoc queries

5

Visualization Partners

AdHoc QueriesREST API

Google confidential │ Do not distribute

Quick Google Cloud Platform Overview

Google & Big Data

Running Hadoop on GCE & Leveraging BigQuery

Q & A

1

2

3

4

Agenda

Google confidential │ Do not distribute

Google Compute Engine

+

Google confidential | Do not distribute

DemoHadoop on Google Compute Engine

Google confidential | Do not distribute

BigData at Google Running Hadoop on Google Compute Engine

./compute_cluster_for_hadoop.py setup google-platform-demo hadoop-demo-jameschi

1. Set up Hadoop

Google confidential | Do not distribute

./compute_cluster_for_hadoop.py start google-platform-demo hadoop-demo-jameschi 500

2. Start Cluster

BigData at Google Running Hadoop on Google Compute Engine

Google confidential | Do not distribute

./compute_cluster_for_hadoop.py mapreduce google-platform-demo hadoop-demo-jameschi \--input gs://hadoop-demo-jameschi-input \--output gs://hadoop-demo-jameschi-output \--mapper sample/shortest-to-longest-mapper.pl \--reducer sample/shortest-to-longest-reducer.pl \--mapper-count 5 \--reducer-count 1

3. Start MapReduce

BigData at Google Running Hadoop on Google Compute Engine

Google confidential | Do not distribute

./compute_cluster_for_hadoop.py shutdown google-platform-demo

4. Shutdown Cluster

BigData at Google Running Hadoop on Google Compute Engine

Google confidential | Do not distribute

Total Costs for Running Hadoop:

5 node Hadoop cluster x 20 minutes = $0.215,000 node Hadoop cluster x 20 min = $8.67

Google Compute Engine Costs

● MapReduce based analysis can be slow for ad-hoc queries

● Managing data centers and tuning software takes time & money

● Analytics tools should be services

How Google Approaches Analytics

DremelAd-hoc query system for terabyte datasets

● Query execution tree● Column Oriented records● ...and a lot of nodes

How Google Approaches Analytics

Application servers

BigTable / GFS

Logs clusterUser requests

App logging

App persistence

DremelAnalysts

& reporting

SQL

MapReduce

Note: diagram is dramatically simplifiedInterop NYC - Oct 2, 2012

Big Data processing example - application logs

MapReduce

Google confidential | Do not distribute

BigQuery: a fully-managed data analytics service in the cloud.Unlimited storage. Interactive analysis on multi-terabyte datasets.

Scalable Storage

Google Spreadsheets

App EngineApp

Co-Workers

AdwordsDCLKAnalytics

Corporate data3rd party data

API

Analyze interactivelyMash it up

Securely Share/

distribute the resultsStore all your data

in the cloud

SQL

Other BI Tools

What is BigQuery?Google BigQuery: An Overview

Store

Backends + MapReduce

Extract &Transform

Hadoop

BigQuery

APISQL

Analyze interactively Serve

Logstore

Cloud Storage

Datastore

Log data

Unstructured data

Structured data Interactive Dashboards + apps

Application level code

Custom logic & 3rd party libraries

BI tools

Google Spreadsheets

Interop NYC - Oct 2, 2012

Cloud-based data analytics pipeline

Google confidential │ Do not distribute

Origins of BigQuery

+

Google confidential | Do not distribute

How can a business analyst find Top 20 Apps in matter of

seconds?

Origins of BigQueryGoogle BigQuery: An Overview

Google confidential | Do not distribute

How can a business analyst find Top 20 Apps in matter of seconds?

SELECT top(appId, 20) AS app, count(*) AS countFROM installlog.2012;ORDER BY count DESC

Result in ~20 seconds!

Origins of BigQueryGoogle BigQuery: An Overview

Google confidential | Do not distribute

How do you find slow running servers from billions of log entries -

in seconds?

Origins of BigQueryGoogle BigQuery: An Overview

Google confidential | Do not distribute

How do you find slow running servers from billions of log entries - in seconds?

SELECT count(*) AS count, source_machine AS machineFROM product.product_log.liveWHERE elapsed_time > 4000GROUP BY source_machineORDER BY count DESC

Result in ~20 seconds!

Origins of BigQueryGoogle BigQuery: An Overview

Google confidential | Do not distribute

Google BigQuery

Origins of BigQueryGoogle BigQuery: An Overview

Google confidential | Do not distribute

Use cases/applications where

Interactive analysis on large data sets is a requirement

Data mashups at scale and ease is important

Elasticity of environment is important

Minimal administration is required

Empowering Analysts or Data scientist with self service capabilities

BigQuery -- SweetspotsGoogle BigQuery: An Overview

Google confidential | Do not distribute

DemoAd-hoc Analysis with BigQuery

Google confidential │ Do not distribute

Why is BigQuery so Special?

+

Google confidential | Do not distribute

ZERO Administration for Performance and Scale0No complex data architecture required.

● Simple denormalized data structure● Data can be loaded in csv or JSON format

Supports Open Standards● SQL like language● Standard BI/ETL tools supported● REST API Support for integrating analytics programmatically

Why BigQuery is Special?Google BigQuery: An Overview

Google confidential | Do not distribute

Nested Field Support● Analyze details for headers (items of the order, clicks of the session in

single SQL statement)● Ingest Web data (Typically in JSON format) directly into BigQuery

Query Result Caching at Scale● Caching of query results at Scale (24 hours best effort)● Option to re-generate results / update cache

Real-time Data Ingest and Reporting● Data ingest at 100 rows/second (with peak 1000 rows/second) into

BigQuery● Support for real-time to near real-time query/reporting

Why BigQuery is Special?Google BigQuery: An Overview

Google confidential | Do not distribute

And many more...

SQL like Query Language and REST based APISHigh performance queries on strings (use of REGEXP and pattern matching)Data window functions and many aggregation capabilitiesTable WildcardsTable decorators for time based queries..

Integration with Google Stack

Google Analytics Premium (hit level data) - BigQuery Google App Engine Datastore to BigQuery - Trusted Tester

Why BigQuery is Special?Google BigQuery: An Overview

Google confidential │ Do not distribute

What BigQuery is Not?

+

Google confidential | Do not distribute

Relational Database Management System (RDBMS)● Data is stored in BigQuery in Columnar fashion● BigQuery tables are immutable (no updates or deletes)*

Not a replacement for existing Data warehouse en-mass● Specifically designed for large data sets● Queries requiring massive processing

On-premise solution or Appliance● It is available only in Google Cloud in a fully hosted fashion● Data needs to be in Google Cloud (BigQuery) for analysis

* alternate ways to support these capabilities* alternate ways to support these capabilities

* alternate ways to support these capabilities

Why BigQuery is Not?Google BigQuery: An Overview

Google confidential | Do not distribute

Why Dremel can be so drastically fast as the examples show?

The answer can be found in two core technologies which gives Dremel this unprecedented performance:

● Columnar Storage. Data is stored in a columnar storage fashion which makes possible to achieve very high compression ratio and scan throughput.

● Tree Architecture is used for dispatching queries and aggregating results across thousands of machines in a few seconds.

Columnar Storage and Tree Architecture of Dremel

Google confidential | Do not distribute

Google has been using Dremel in production since 2006 and has been continuously evolving it for the last 6 years. Examples of applications include:

● Analysis of crawled web documents● Tracking install data for applications in the Android Market● Crash reporting for Google products● OCR results from Google Books● Spam analysis● Debugging of map tiles on Google Maps● Tablet migrations in managed Bigtable instances● Results of tests run on Google’s distributed build system● Disk I/O statistics for hundreds of thousands of disks● Resource monitoring for jobs run in Google’s data centers● Symbols and dependencies in Google’s codebase

Dremel: Key to Run Business at “Google Speed”

Google confidential | Do not distribute

BigQuery Sample Use Cases

Google confidential | Do not distribute

Google confidential | Do not distribute

Sensor Data Analytics

Google confidential | Do not distribute

Sensor Data Analytics

Google confidential | Do not distribute

Sensor Data Analytics

Google confidential | Do not distribute

Sensor Data Analytics

Google confidential | Do not distribute

Sensor Data Analytics

Google confidential | Do not distribute

Sensor Data Analytics

Google confidential | Do not distribute

Generate Ingest/Process Store Compute Visualize

Sensors Endpoints/App Engine DatastoreCompute Engine/

BigQuery

Analyze

Data Sensing Lab

Sensor Data Analytics

Google confidential | Do not distribute

Notice trend change

Slice user data, identify segments

Compare segments vs general population

Mobile and Social Gaming User Analysis

Google confidential | Do not distribute

● BigQuery is designed as an interactive data analysis tool for large datasets

● MapReduce is designed as a programming framework to batch process large datasets

BigQuery versus MapReduce

Google confidential | Do not distribute

BigQuery versus MapReduce

Google confidential | Do not distribute

BigQuery versus MapReduce

Google confidential | Do not distribute

Agile : BigData Analytic Hosted Environment● Fully managed cloud service ● Interactive querying on Terabytes of data● Always on

Cost Effective: Low TCO / TCD● No CapEx as no hardware/software to own● No administration or management overhead● Pay for only what you use.

Non disruptive, fits right into IT landscape● Expanding ISV ecosystem (BI/ETL)● Use of standards (SQL like, REST API etc...)● Integrated with other Google Services (GCE, GAE, GCE etc...)

BigQuery: Key Differentiators

Google confidential | Do not distribute

Secure and Reliable● Uptime SLA (99.9% uptime)● OAUTH 2.0 Support, User/Group ACL ● Uses Google's core network/data center backbone

Innovative● Innovation momentum: 2 release since launch● New capabilities/concepts: e.g. Queryable Nested fields● e.g. Use of RegExp without performance loss

BigQuery versus MapReduce

Recent Manufacturer POC

Local Processing

Cloud StorageAppEngine GCE

Fusion Tables

Big Query

Cloud Storage

BIME Dashboards

CSV

● Achieved a 200,000 rows per second ingestion rate, and it took about 10 minutes from when the file is first dropped into GCS to when you can query the data in BigQuery.

● Demonstrated ingesting 120 million rows in 10 minutes into BigQuery.

Recent Manufacturer POC

● Leveraged Google Fusion Tables to easily collaborate on datasets with business and partners.

● Also discovered manufacturer had machines that consumed 250+ billion gallons of fuel as of July 30th.

Recent Manufacturer POC

● Spun up BIME Dashboard to interactively connect to BigQuery dataset.

● With dashboards connected to BQ (BIME, Tableau, QlikView, etc), we can drill into billions of rows near real-time data with ease.

Recent Manufacturer POC

Google confidential | Do not distribute

BigQuery: Ever Expanding Ecosystem of Partners

Google confidential │ Do not distribute

Quick Google Cloud Platform Overview

Google & Big Data

Running Hadoop on GCE & Leveraging BigQuery

Q & A

1

2

3

4

Agenda

Thank You

Google confidential │ Do not distribute

cloud.google.com

Images by Connie Zhou

Google confidential │ Do not distribute

Google confidential │ Do not distribute

cloud.google.com

Google confidential | Do not distribute