Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Dashboard Engine for Hadoop
June 2015
Matt McDevittSr. Project Manager
Pavan ChallaSr. Data Engineer
Think Big Start Smart Scale Fast
CONFIDENTIAL | 2
Agenda
• Think Big Overview
• Engagement Model
• Solution Offerings
• Dashboard Engine
• Demo
• Q&A
2© 2015 Think Big, a Teradata Company
CONFIDENTIAL | 3
3
© 2015 Think Big, a Teradata Company
CONFIDENTIAL | 4
• Founded in 2010, acquired in 2014, International in 2015
• First and leading professional services firm exclusively focused on big data
• End to End Services: Strategy, Design, Implementation, IP/Software, Support and Managed Services
• Academy to scale delivery capability
• Extend and integrate open source with UDA
• Team-based delivery with Solution Center
• Growing quickly: we’re hiring!
Think Big Overview
Think Big
Founded 2010
4
PRESTO
© 2015 Think Big, a Teradata Company
CONFIDENTIAL | 5
Think Big Engagement Model
5© 2015 Think Big, a Teradata Company
CONFIDENTIAL | 6
Big Data
Program Mgt
Business
Analytics
Managed
Services
Data
Engineering
Think Big Analytics VELOCITY Methodology
• Solutions
• Planning and Design
• Prioritization
• Capability Backlog
• Grooming for engineering
• Engineering
• Sprint(s)
• Releases
• Quality Assurance & Test
• Managed Support
• Break Fix
• Sustaining Engineering
• New Models
• New Analytics
• New Insights
• New Data Requirements
• New Data
• Big Data Approach
• Use Cases
• Roadmap
• Data Science
• Discovery
• R&D
Big Data Lab
6© 2015 Think Big, a Teradata Company
CONFIDENTIAL | 7
1. Big Data Strategy Roadmap
2. Data Lake Starter Program
3. Data Lake Optimization
4. Data Lake Managed Services
5. Presto for the Enterprise – new as of June 10, 2015
6. Big Data Managed Services
7. Think Big Academy
Think Big Solution Offerings
7
• Device Data Manufacturing Operations
• Omni-Channel Marketing Analytics
• Financial Services Fraud/Risk Analytics
• Healthcare personalization
Custom Analytics Solution Services
• Device Data Behavior Analytics
• IT Threat Detection
• Public Sector Risk Analysis
• Gaming Analytics
© 2015 Think Big, a Teradata Company
MAKING BIG DATA COME ALIVEMAKING BIG DATA COME ALIVE
Data Lake Implementation
CONFIDENTIAL | 9
Data Lake: Starter Program
− Stand up a Data Lake and build 3 governed batch data ingest streams
− Includes Services and Subscription Software Frameworks
Data Lake: Optimization
− Add governance to your Data Lake
− For Data Lakes not originally built by Think Big
Data Lake: Dashboard Engine Reporting
− Install and configure engine with Data Lake to build dashboard analytics for deep dimensional rollup reporting capabilities with Tableau on Hadoop
Data Lake: Security
− Data Security & InfoSec, Cluster Hardening, Perimeter, Connectivity
Data Lake: Managed Services
− Only for Data Lakes that Think Big Designs and Builds
− On Premise, Public Cloud (AWS) and Private Cloud (Teradata and Altiscale)
Data Lake Program Offers
9© 2015 Think Big, a Teradata Company
CONFIDENTIAL | 10
Design Build & Test Integrate & Tune Assess, Mentor & Plan
• Collaborative workshops with
business groups
• Identification and prioritization
of high-value data streams
• Gap analysis
• Develop Ingest
workflows
• Install Metadata and
Info Security Services
• Prepare Cluster for
Integration test
• Install Ingest & System
Test
• Begin Profiling Data
• Learn about Information
Security and data wrangling
• Begin Building DL Reporting
• Final tuning, assessment and
next steps
Think Big Data Lake Starter Program(8 Week Engagement)
Develop & Unit
Testing
Data Stream
Prioritization
Info Security
Objectives
Data Profiling
and Capability
Follow-up
Roadmap
2 weeks 2 week 2 week 2 weeks
Executive
Presentation
Objective: Design, Develop and Deploy Data Lake Ingestion with Governance
Software
Component
Installation
Data
Sources
Organization &
Training
Cluster
configuration &
Integration
System
Integration
Testing
10© 2015 Think Big, a Teradata Company
CONFIDENTIAL | 11
Enterprise Data Lake
Information Sources
Evaluate Source Data
Ingest
Collect & Manage
Metadata
ApplyStructure
Sequence
Compress
Automate
Protect
Prepare Data for Ingest
Prepare Source Metadata
Perimeter-Authentication-Authorization
InfoSecDownstream Applications
DashboardEngine
Think Big Enterprise Data Lake
© 2015 Think Big, a Teradata Company
CONFIDENTIAL | 12
Data Lab
Data Repository
Security, Archival RainStor – System of Record,
Archive
Governed Ingestion
CDC
Buffer Server
Spark
Msg Queue
Kafka
Experimental Data
RawData
Processing DerivedViews
Loom – integrated Metadata, lineage,
WranglingMetadata Repository
Dashboard Engine
API
RealtimeProcessing
API
Discovery Zone
Statistics
Machine Learning
Graph
Analytics
12
© 2015 Think Big, a Teradata Company
CONFIDENTIAL | 13
13
© 2015 Think Big, a Teradata Company
CONFIDENTIAL | 14
Why a Dashboard Engine?
14
Events Hadoop
© 2015 Think Big, a Teradata Company
CONFIDENTIAL | 15
• Near real-time analytics
• Easily scales to 100s of simulaneous users
• Query latency typically under 100 ms
• Deep dimensional drill-down
• Works with popular BI tools
− javascript, jquery
− Tableau
− others announced soon
ThinkBig Dashboard Engine Strengths
15© 2015 Think Big, a Teradata Company
CONFIDENTIAL | 16
Using Tableau without Dashboard Engine
Hadoop
Middle
Tier Server
Extract
• Queryable data limited by
size of Server.
• Doesn’t scale as users grow.
16© 2015 Think Big, a Teradata Company
CONFIDENTIAL | 17
• For the time the query is running, most or all of the cluster is dedicated to that one query.
− Has limitations if the cluster has other loads
− Has limitations for simultaneous dashboard users
• Low latencies possible only if all the event data is in RAM at query time.
Using Impala without Think Big Dashboard Engine
© 2015 Think Big, a Teradata Company 17
18
Dash Board Engine Architecture
CONFIDENTIAL | 19
• Uses the power of Apache Spark to pre-aggregate data
• Scales as event volume grows.
• Scales as number of users grows.
Think Big’s Dashboard Engine for Hadoop
API
© 2015 Think Big, a Teradata Company 19
CONFIDENTIAL | 20
Store cube data
Arr
ivals
-s:C
A-2
014-0
1-0
4
Arr
ivals
-s:C
A-2
014-0
1-0
32053
1911
1965
14147
14158
14269
Arr
ivals
-a:S
FO
-s:C
A-2
014
-01-0
2
Arr
ivals
-a:S
FO
-s:C
A-2
014
-01-0
3
Arr
ivals
-a:S
FO
-s:C
A-2
014-0
1-0
4429
479
433
… …A
rriv
als
-s:C
A-2
014-0
1-0
2
Arr
ivals
-2014-0
1-0
2
Arr
ivals
-2014-0
1-0
3
Arr
ivals
-2014-0
1-0
4
© 2015 Think Big, a Teradata Company
CONFIDENTIAL | 21
• Aggregate API that understands metrics, dimensions, time ranges.
• Relational API that understands (some) SQL.
API - Connecting to the Dashboard Engine
Aggregate API
SQL API
© 2015 Think Big, a Teradata Company 21
22
Demo
CONFIDENTIAL | 23
• Running on a 16-node cluster (TD Appliance for Hadoop)
• Process and store all data in ~ 2 hours
Flight Data Statistics for Demo
Rows Storage space
Flight records 160 million 30 GB
MOLAP cube 35 billion 2.1 TB
© 2015 Think Big, a Teradata Company 23
CONFIDENTIAL | 24
• Sends SQL queries to the API
SQL Query to REST API Example
SELECT FlightData.Date AS "none_Date_ok",
FlightData.State AS "none_State_nk”,
SUM(FlightData.Arrivals) AS "sum_Arrivals_nk”
FROM "default"."FlightData" "FlightData"
GROUP BY "none_Date_ok” , "none_State_nk”
• Translated to Aggregate API queries
http://10.25.12.241:52080/clickstream/aggregate/v1/?
period=day&start=1970-01-01&dimension=State:&metric=Arrivals
© 2015 Think Big, a Teradata Company 24
CONFIDENTIAL | 25
<index name="AirportsByState">
<periods>
<period>day</period>
</periods>
<indexDimensions>
<dimension name="State" />
</indexDimensions>
<listDimensions>
<dimension name="Airport" />
</listDimensions>
</index>
Example index: List all Airports for a specific State
© 2015 Think Big, a Teradata Company 25
CONFIDENTIAL | 26
Aggregate use: Show arrivals for all airports for NY
© 2015 Think Big, a Teradata Company 26
http://10.25.12.241:52080/clickstream/aggregate/v1/?period=da
y&start=2014-01-04&end=2014-01-
05&dimension=Airport:&dimension=State:NY&metric=Arrivals&head
ers=on
Day Start Airport State Arrivals
2014-01-04 ALB NY 20
2014-01-04 ART NY 1
2014-01-04 BUF NY 40
...
2014-01-04 JFK NY 167
2014-01-04 LGA NY 206
2014-01-04 ROC NY 17
2014-01-04 SWF NY 2
2014-01-04 SYR NY 14
CONFIDENTIAL | 27
<index name="ListFlightNoCarrierCityState">
<periods>
<period>day</period>
</periods>
<indexDimensions>
</indexDimensions>
<listDimensions>
<dimension name="State" />
<dimension name="City" />
<dimension name="Carrier" />
<dimension name="FlightNo" />
</listDimensions>
</index>
Index: List Flight No / Carrier / City / State combinations
© 2015 Think Big, a Teradata Company 27
CONFIDENTIAL | 28
Dimensions use: Show all Flight/Carrier/City/State
© 2015 Think Big, a Teradata Company 28
http://10.25.12.241:52080/clickstream/dimensions/v1/?period
=day&start=2014-01-04&end=2014-01-
05&dimension=State:&dimension=City:&dimension=Carrier:&dime
nsion=FlightNo:
"results":[
["AK","Anchorage, AK","AS","101"],
["AK","Anchorage, AK","AS","102"],
["AK","Anchorage, AK","AS","103"],
["AK","Anchorage, AK","AS","106"],
["AK","Anchorage, AK","AS","108"],
...
["AL","Huntsville, AL","DL","1782"],
["AL","Huntsville, AL","DL","2077"],
...
["WY","Rock Springs, WY","OO","7413"]]
CONFIDENTIAL | 29
<index name="ListFlightNoByCarrierState">
<periods>
<period>day</period>
</periods>
<indexDimensions>
<dimension name="State" />
<dimension name="Carrier" />
</indexDimensions>
<listDimensions>
<dimension name="FlightNo" />
</listDimensions>
</index>
Index Question
© 2015 Think Big, a Teradata Company 29
Q: Drill down to a list of flights that had caused delay in Colorado done by Delta?
A: Create the index below, rerun index creation step, query delay metrics forgiven state and carrier, while listing flight numbers dimension=FlightNo:
30
Questions?
DATA ANALYTICS
DATA ENGINEERS
DATA SOLUTIONS
Think Big International
We are hiring!!!
http://thinkbigcareers.teradata.com/