Upload
james-chittenden
View
108
Download
4
Embed Size (px)
DESCRIPTION
Citation preview
Google confidential │ Do not distribute Google confidential │ Do not distribute
How Google Does Big DataSolving Big Data Problems at Google Scale
James ChittendenCloud Platform Engineer
Google confidential │ Do not distribute
Quick Google Cloud Platform Overview
Google & Big Data
Running Hadoop on GCE & Leveraging BigQuery
Q & A
1
2
3
4
Agenda
Google confidential │ Do not distribute
Quick Google Cloud Platform Overview
Google & Big Data
Running Hadoop on GCE & Leveraging BigQuery
Q & A
1
2
3
4
Agenda
Google confidential │ Do not distribute
Google has been running some of the world’s largest distributed systems with unique and stringent requirements.
Images by Connie Zhou
Google confidential | Do not distribute
"This is what makes Google Google: its physical network, its thousands of fiber miles, and those many thousands of servers that, in aggregate, add up to the mother of all clouds."
- Wired 'Google Throws Open Doors To Its Top Secret Data Center', Wired, October 2012
Google confidential │ Do not distribute
For the past 15 years, Google has been building out the world’s fastest, most powerful, highest quality cloud infrastructure on the planet.
Images by Connie Zhou
Google confidential | Do not distribute
Cloud Platform is built on the same infrastructure that powers Google
Images by Connie Zhou
Google confidential | Do not distribute
2002 2004 2006 2008 2010 2012
ColossusDremelMapReduce
SpannerGFS Big Table
Cloud Platform
Google Innovations in Software
Google confidential | Do not distributeCloud Platform
Investing In Our Cloud $2.9B in additional data center investments worldwide
Google confidential | Do not distributeCloud Platform
Storage App ServicesCompute
Cloud Storage
Cloud SQL
Cloud Datastore
Compute Engine
App Engine
BigQuery
Cloud Endpoints
Google confidential │ Do not distribute
Quick Google Cloud Platform Overview
Google & Big Data
Running Hadoop on GCE & Leveraging BigQuery
Q & A
1
2
3
4
Agenda
ONLINE AD IMPRESSIONS
5.3TRILLION ONLINE AD
IMPRESSIONS IN 2012 1
191BILLION EMAILS
SENT EVERY DAY 2
ONLINEAD SPEND
US $109.7BILLION ESTIMATED
FOR 2014 3MOBILE
AD
US $14.3BILLION SPENT IN
U.S. ON MOBILE ADS IN 2013 4
500+TERABYTES OF DATAUPLOADED DAILY 5
The Flood of BIG DATA
Application servers
User requests
Note: diagram is dramatically simplifiedInterop NYC - Oct 2, 2012
Big Data processing example - application logs
Application servers
BigTable / GFS
Logs clusterUser requests
App logging
App persistence
Note: diagram is dramatically simplifiedInterop NYC - Oct 2, 2012
Big Data processing example - application logs
Application servers
BigTable / GFS
Logs clusterUser requests
App logging
App persistence
Dremel
MapReduce
MapReduce
Note: diagram is dramatically simplifiedInterop NYC - Oct 2, 2012
Big Data processing example - application logs
Application servers
BigTable / GFS
Logs clusterUser requests
App logging
App persistence
DremelAnalysts
& reporting
SQL
MapReduce
MapReduce
Note: diagram is dramatically simplifiedInterop NYC - Oct 2, 2012
Big Data processing example - application logs
Application servers
BigTable / GFS
Logs clusterUser requests
App logging
App persistence
DremelAnalysts
& reporting
SQL
MapReduce
Note: diagram is dramatically simplifiedInterop NYC - Oct 2, 2012
Big Data processing example - application logs
MapReduce
Big DataSolutionswith Google Cloud Platform
Sensor DataHOW CAN I CAPTURE
MORE CUSTOMER INTERACTIONS?
Social DataWHAT DO CUSTOMERS
SAY ABOUT MY BUSINESS?
Internal DataHOW CAN I COMBINE ALLOF MY TRANSACTIONAL
DATA WITH OTHER DATA?
Google DataHOW CAN GOOGLE
DATA HELP ME WITH MULTI-CHANNEL
MODELING? 01001010110010001001000100100110001110011v10011000100110001 01001010110010001001000100010001 01100011100010 01100011100001110 10011001000011001100101 01
Display ads analytics(global top-5 media agency)
Ads network reporting(3rd party mobile ads)
Fleet reservations(online travel operations)
Revenue optimization(holiday/travel properties)
Analyze global campaign performance for F500 clients1 client = 20GB/day of DoubleClick impressions logs
Deliver x-platform performance analytics dashboards 1B events/day x 100s of ads customers
Monitor customer demand vs supply shortfalls10,000 routes x 1000s customers = millions of daily events
Correlate marketing effectiveness vs global reservations10MBs / day from multiple data warehouses
Mobile app statistics(online reading vendor)
Usage analysis on 60M installs; 10M active users2B API requests/day, 20GB log data/day
Interop NYC - Oct 2, 2012
Sample use cases - BigQuery usage today
Opportunities Challenges
Single place to capture data
Combine data from different sources
Detect patterns and correlations
Easily share data insights with org
Distributed decision making
The data is large
High rate of growth
Non-relational
Multiple sources
Up to 100-200 TB
Up to 100-200 GB / day
Unstructured or semi-structured
SaaS, Mobile devices, legacy
Interop NYC - Oct 2, 2012
Today we see the same trends happening in industry
● 2004: Google releases the MapReduce* framework paper
● 2011: Big Data "analysis" using Hadoop and other Google inspired technology
Processing Data: MapReduce
Google Big Query
Google Compute Engine Scalable VMs
TBs of Data
Ad-hoc queries
Data Collection
ETL
Raw Data Storage
Aggregation
Analytics Storage
Visualization
Google Cloud Storage
1
2
3
4
5
6
Interactive Dashboards + apps
BI tools
Google Spreadsheets
4
Visualization
3
Raw Data Storage
5
6
2
Google App Engine Web Apps1
Auto-scaleweb services
Data processingCleansing
Cloud StorageHadoop Connector
REST API
Raw Data StorageBig Query Staging
Collection Transformation
Data Ingestion
Data Collection
ETL
Raw Data Storage
Aggregation
Analytics Storage
Visualization
1
2
3
4
5
6
Interactive Dashboards + apps
BI tools
Google Spreadsheets
Visualization6Google Compute Engine
Collection Transformation
Data processingCleansing
2
1
4
Scalable VMs
Data Collection
ETL
Raw Data Storage
Aggregation
Analytics Storage
Visualization
1
2
3
4
5
6
Interactive Dashboards + apps
BI tools
Google Spreadsheets
Visualization6
Raw Data StorageBig Query Staging
3
1
Google Cloud Storage
Huge Storage Capacity
Consistent, fast speeds
Multiple storage types
TBs of Data
Data Collection
ETL
Raw Data Storage
Aggregation
Analytics Storage
Visualization
1
2
3
4
5
6
Interactive Dashboards + apps
BI tools
Google Spreadsheets
Visualization6
1
Google BigQuery Ad-hoc queries
5
Visualization Partners
AdHoc QueriesREST API
Google confidential │ Do not distribute
Quick Google Cloud Platform Overview
Google & Big Data
Running Hadoop on GCE & Leveraging BigQuery
Q & A
1
2
3
4
Agenda
Google confidential | Do not distribute
BigData at Google Running Hadoop on Google Compute Engine
./compute_cluster_for_hadoop.py setup google-platform-demo hadoop-demo-jameschi
1. Set up Hadoop
Google confidential | Do not distribute
./compute_cluster_for_hadoop.py start google-platform-demo hadoop-demo-jameschi 500
2. Start Cluster
BigData at Google Running Hadoop on Google Compute Engine
Google confidential | Do not distribute
./compute_cluster_for_hadoop.py mapreduce google-platform-demo hadoop-demo-jameschi \--input gs://hadoop-demo-jameschi-input \--output gs://hadoop-demo-jameschi-output \--mapper sample/shortest-to-longest-mapper.pl \--reducer sample/shortest-to-longest-reducer.pl \--mapper-count 5 \--reducer-count 1
3. Start MapReduce
BigData at Google Running Hadoop on Google Compute Engine
Google confidential | Do not distribute
./compute_cluster_for_hadoop.py shutdown google-platform-demo
4. Shutdown Cluster
BigData at Google Running Hadoop on Google Compute Engine
Google confidential | Do not distribute
Total Costs for Running Hadoop:
5 node Hadoop cluster x 20 minutes = $0.215,000 node Hadoop cluster x 20 min = $8.67
Google Compute Engine Costs
● MapReduce based analysis can be slow for ad-hoc queries
● Managing data centers and tuning software takes time & money
● Analytics tools should be services
How Google Approaches Analytics
DremelAd-hoc query system for terabyte datasets
● Query execution tree● Column Oriented records● ...and a lot of nodes
How Google Approaches Analytics
Application servers
BigTable / GFS
Logs clusterUser requests
App logging
App persistence
DremelAnalysts
& reporting
SQL
MapReduce
Note: diagram is dramatically simplifiedInterop NYC - Oct 2, 2012
Big Data processing example - application logs
MapReduce
Google confidential | Do not distribute
BigQuery: a fully-managed data analytics service in the cloud.Unlimited storage. Interactive analysis on multi-terabyte datasets.
Scalable Storage
Google Spreadsheets
App EngineApp
Co-Workers
AdwordsDCLKAnalytics
Corporate data3rd party data
API
Analyze interactivelyMash it up
Securely Share/
distribute the resultsStore all your data
in the cloud
SQL
Other BI Tools
What is BigQuery?Google BigQuery: An Overview
Store
Backends + MapReduce
Extract &Transform
Hadoop
BigQuery
APISQL
Analyze interactively Serve
Logstore
Cloud Storage
Datastore
Log data
Unstructured data
Structured data Interactive Dashboards + apps
Application level code
Custom logic & 3rd party libraries
BI tools
Google Spreadsheets
Interop NYC - Oct 2, 2012
Cloud-based data analytics pipeline
Google confidential | Do not distribute
How can a business analyst find Top 20 Apps in matter of
seconds?
Origins of BigQueryGoogle BigQuery: An Overview
Google confidential | Do not distribute
How can a business analyst find Top 20 Apps in matter of seconds?
SELECT top(appId, 20) AS app, count(*) AS countFROM installlog.2012;ORDER BY count DESC
Result in ~20 seconds!
Origins of BigQueryGoogle BigQuery: An Overview
Google confidential | Do not distribute
How do you find slow running servers from billions of log entries -
in seconds?
Origins of BigQueryGoogle BigQuery: An Overview
Google confidential | Do not distribute
How do you find slow running servers from billions of log entries - in seconds?
SELECT count(*) AS count, source_machine AS machineFROM product.product_log.liveWHERE elapsed_time > 4000GROUP BY source_machineORDER BY count DESC
Result in ~20 seconds!
Origins of BigQueryGoogle BigQuery: An Overview
Google confidential | Do not distribute
Google BigQuery
Origins of BigQueryGoogle BigQuery: An Overview
Google confidential | Do not distribute
Use cases/applications where
Interactive analysis on large data sets is a requirement
Data mashups at scale and ease is important
Elasticity of environment is important
Minimal administration is required
Empowering Analysts or Data scientist with self service capabilities
BigQuery -- SweetspotsGoogle BigQuery: An Overview
Google confidential | Do not distribute
ZERO Administration for Performance and Scale0No complex data architecture required.
● Simple denormalized data structure● Data can be loaded in csv or JSON format
Supports Open Standards● SQL like language● Standard BI/ETL tools supported● REST API Support for integrating analytics programmatically
Why BigQuery is Special?Google BigQuery: An Overview
Google confidential | Do not distribute
Nested Field Support● Analyze details for headers (items of the order, clicks of the session in
single SQL statement)● Ingest Web data (Typically in JSON format) directly into BigQuery
Query Result Caching at Scale● Caching of query results at Scale (24 hours best effort)● Option to re-generate results / update cache
Real-time Data Ingest and Reporting● Data ingest at 100 rows/second (with peak 1000 rows/second) into
BigQuery● Support for real-time to near real-time query/reporting
Why BigQuery is Special?Google BigQuery: An Overview
Google confidential | Do not distribute
And many more...
SQL like Query Language and REST based APISHigh performance queries on strings (use of REGEXP and pattern matching)Data window functions and many aggregation capabilitiesTable WildcardsTable decorators for time based queries..
Integration with Google Stack
Google Analytics Premium (hit level data) - BigQuery Google App Engine Datastore to BigQuery - Trusted Tester
Why BigQuery is Special?Google BigQuery: An Overview
Google confidential | Do not distribute
Relational Database Management System (RDBMS)● Data is stored in BigQuery in Columnar fashion● BigQuery tables are immutable (no updates or deletes)*
Not a replacement for existing Data warehouse en-mass● Specifically designed for large data sets● Queries requiring massive processing
On-premise solution or Appliance● It is available only in Google Cloud in a fully hosted fashion● Data needs to be in Google Cloud (BigQuery) for analysis
* alternate ways to support these capabilities* alternate ways to support these capabilities
* alternate ways to support these capabilities
Why BigQuery is Not?Google BigQuery: An Overview
Google confidential | Do not distribute
Why Dremel can be so drastically fast as the examples show?
The answer can be found in two core technologies which gives Dremel this unprecedented performance:
● Columnar Storage. Data is stored in a columnar storage fashion which makes possible to achieve very high compression ratio and scan throughput.
● Tree Architecture is used for dispatching queries and aggregating results across thousands of machines in a few seconds.
Columnar Storage and Tree Architecture of Dremel
Google confidential | Do not distribute
Google has been using Dremel in production since 2006 and has been continuously evolving it for the last 6 years. Examples of applications include:
● Analysis of crawled web documents● Tracking install data for applications in the Android Market● Crash reporting for Google products● OCR results from Google Books● Spam analysis● Debugging of map tiles on Google Maps● Tablet migrations in managed Bigtable instances● Results of tests run on Google’s distributed build system● Disk I/O statistics for hundreds of thousands of disks● Resource monitoring for jobs run in Google’s data centers● Symbols and dependencies in Google’s codebase
Dremel: Key to Run Business at “Google Speed”
Google confidential | Do not distribute
Generate Ingest/Process Store Compute Visualize
Sensors Endpoints/App Engine DatastoreCompute Engine/
BigQuery
Analyze
Data Sensing Lab
Sensor Data Analytics
Google confidential | Do not distribute
Notice trend change
Slice user data, identify segments
Compare segments vs general population
Mobile and Social Gaming User Analysis
Google confidential | Do not distribute
● BigQuery is designed as an interactive data analysis tool for large datasets
● MapReduce is designed as a programming framework to batch process large datasets
BigQuery versus MapReduce
Google confidential | Do not distribute
Agile : BigData Analytic Hosted Environment● Fully managed cloud service ● Interactive querying on Terabytes of data● Always on
Cost Effective: Low TCO / TCD● No CapEx as no hardware/software to own● No administration or management overhead● Pay for only what you use.
Non disruptive, fits right into IT landscape● Expanding ISV ecosystem (BI/ETL)● Use of standards (SQL like, REST API etc...)● Integrated with other Google Services (GCE, GAE, GCE etc...)
BigQuery: Key Differentiators
Google confidential | Do not distribute
Secure and Reliable● Uptime SLA (99.9% uptime)● OAUTH 2.0 Support, User/Group ACL ● Uses Google's core network/data center backbone
Innovative● Innovation momentum: 2 release since launch● New capabilities/concepts: e.g. Queryable Nested fields● e.g. Use of RegExp without performance loss
BigQuery versus MapReduce
Recent Manufacturer POC
Local Processing
Cloud StorageAppEngine GCE
Fusion Tables
Big Query
Cloud Storage
BIME Dashboards
CSV
● Achieved a 200,000 rows per second ingestion rate, and it took about 10 minutes from when the file is first dropped into GCS to when you can query the data in BigQuery.
● Demonstrated ingesting 120 million rows in 10 minutes into BigQuery.
Recent Manufacturer POC
● Leveraged Google Fusion Tables to easily collaborate on datasets with business and partners.
● Also discovered manufacturer had machines that consumed 250+ billion gallons of fuel as of July 30th.
Recent Manufacturer POC
● Spun up BIME Dashboard to interactively connect to BigQuery dataset.
● With dashboards connected to BQ (BIME, Tableau, QlikView, etc), we can drill into billions of rows near real-time data with ease.
Recent Manufacturer POC
Google confidential │ Do not distribute
Quick Google Cloud Platform Overview
Google & Big Data
Running Hadoop on GCE & Leveraging BigQuery
Q & A
1
2
3
4
Agenda