Upload
sumit-sarkar
View
155
Download
3
Embed Size (px)
Citation preview
Building a Hybrid Data
Pipeline for Salesforce and
Hadoop
Sumit Sarkar, Chief Data Evangelist for Progress
@SAsInSumit
www.linkedin.com/in/meetsumit
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.2
Agenda
Overview of project
Hybrid pipeline for ingestion of SaaS sources into Hadoop
Hybrid pipeline to access Hadoop from Salesforce Cloud
Best practices and lessons learned
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.3
Overview of project
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.4
Need answers at marketing speed..
1. Invite contacts to our webinar who identified specific interests with active opportunities
and include related contacts to that same project?
2. Identify common ProductA pages that convert to ProductB evaluations (i.e. measure
cross sell potential)
3. What web content did the 1100 webinar attendees view following the webinar?
4. Analyze lead histories to track accuracy in lead routing assignment (Salesforce limits
values that can be reported against)
5. Identify which of our web content is most visited for sales opportunities that were
closed/won?
6. Create a list based on content consumption and 2017 survey answers?
7. What content was consumed across our strategic accounts?
8. Who complained about a broken link in the survey?
9. …
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.5
Analytics Today
Marketing
Data Management Platform
Embedded Insights
Operational Insights
Information Technology
Data Warehouse
Enterprise Reporting/Analytics
Enterprise Data Integration
LoB
Desktop Analytics / Spreadmarts
Sumologic Analytics
Mixed access to Martech/IT analytics
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.6
Thus, we embarked on a Data Lake for Progress
Detailed data such as activities in Eloqua were
not suitable to store in Corporate Data
Warehouse or Marketing DMP
Data fragmented across sales CRM and
service CX; marketing automation; web
analytics; usage for cloud apps; survey
platforms; webinar data, etc
Did not know what questions to ask in advance
– how to define star schema?
Many analytics tools across Progress
Emerging data science expertise
Started with Pilot to start experimenting rather
than wait to build a business case for funding
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.7
Overview of Progress Marketing Data Lake
Progress Corporate Firewall
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.8
Hybrid pipeline for ingestion of SaaS sources into Hadoop
Data Collection Process
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.10
Sample Data: Oracle Eloqua Profiler data
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.11
Sample Data: CRM
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.12
Sample Data: Web traffic
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.13
Sample Data: Survey
Other popular responses
Linode
Pironet
Redhat OpenShift
OpenStack
Cloud Share
Thomson Reuters Elektron
SAP HANA
Claro Cloud
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.14
Hybrid Environments can Limit Access to Data
Locked behind
the firewall
Locked behind
other clouds
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.15
Not trivial to get data across different marketing data sources
Data Source API
Eloqua Web Services API (REST/SOAP)
Bulk and non-Bulk APIs
No query language
Oracle Service Cloud Web Services APIs (REST/SOAP)
ROQL
Google Analytics Hypercube (query limits of 10 metrics grouped by
max of 7 dimensions)
Veeva CRM SOAP, BULK, Metadata APIs
SOQL
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.16
REST API for bulk export to support analytics
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.17
Hybrid pipeline for ingestion of cloud data sources to ground Hadoop system
Define SaaS
data model
for data
integration
Optimize SQL
request
against SaaS
APIs
Mapping
JDBC user
auth to SaaS
APIs
Standard
JDBC Client
for Apache
Sqoop
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.18
Sample Insights: Trends in 2016 for product usage following related events
March &
September
June
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.19
Hybrid pipeline to access Hadoopfrom Salesforce Cloud
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.20
Connect Salesforce to Big Data
Success Scoring
Personalization
Archived Insight
360 Reporting
Corporate Firewall
?
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.21
Salesforce
Connect maps Salesforce external
objects to data tables in external systems.
Instead of copying the data into your
organization, Salesforce
Connect accesses the data on demand
and in real time. The data is never stale,
and we access only what you need.
Recommended when:
• You have a large amount of data that you don’t want
to copy into your Salesforce organization.
• You need small amounts of data at any one time.
• You want real-time access to the latest data.
Salesforce Connect integration for Big Data
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.22
An open protocol to allow the creation and consumption of
queryable and interoperable RESTful APIs in a simple and standard way.
OASIS Standard REST API (“SQL for the web”)
Ratified as an OASIS standard February, 2014
Operations built on REST principles
Uniform URL conventions
Surface metadata in standard way
Access requires OData endpoint
First member to join OData Technical Committee
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.23
For Data Behind a Firewall, there is no Common Access Approach for Clouds
1. Network Based VPN
2. SSH Tunneling
3. Reverse Proxy Servers
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.24
Firewall Becoming Barrier for Hybrid Data Tech Adoption
Source: The 2017 State of the Firewall” produced by Firemon
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.25
Hybrid pipeline for on-demand access to Hadoop system on the ground from Salesforce Cloud
Publish
OData
endpoint for
on-demand
access
On-premises
data gateway
for firewall
friendly
connection
Mapping user
auth to
Hadoop
ecosystem
Reverse
Engineer
OData REST
API entity
data model
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.26
Firewall Friendly Architecture for On-premises Data Gateway
On-premises
data gateway
for firewall
friendly
connection
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.27
Getting Started with Salesforce External Object Reports
Report with data blended
from Standard and External
Objects (pulled on-demand
from on-premises Data
Lake)
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.28
Best practices and lessons learned
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.29
What worked in building the Data Lake
Building advocacy for
marketing data
Brought together several teams across Product, Sales, Marketing
Operations, Engineering, and others interested in the data
available for analysis.
Revenue attribution to content using activity data correlated across
opportunities and content consumed.
Analyze detailed CRM lead histories to measure lead routing
effectiveness. Salesforce reports do not support analysis on
values of detailed lead activity fields.
Able to leverage SMEs to identify laser focused targets. I.e. which
leads have the specific tech stack that the next webinar is
targeting.
Trends and Research
Supplement existing
CRM analytics
Identify new and highly
focused segments
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.30
Lessons Learned ingesting SaaS data into Hadoop
InfrastructureOn-premises Hadoop cluster was not ideal for LoB to manage and
tune for the pilot, but we did not have approval to land LOB data in
the cloud. Need additional support to go live.
Data lake dumps raw data from source systems, so we end up
with activity from automated tests that can skew data, for example.
Schema changes to source objects impacted ingestion so that
needs to be planned for with SaaS applications.
Even with a standard pipeline in place, there are still limits that
apply for initial load on certain objects as many APIs are primarily
designed for application integration.
Data Quality
Metadata
Data Integration
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.31
Lessons Learned accessing Hadoop from Salesforce
1. Mapping OData entities to Big Data objects
2. Primary keys for Big Data entities
3. HiveServer1 vs HiverServer2 for concurrency
4. External Objects have limits and 2 minute max timeout
5. Native Reporting support was added in Winter ‘17
6. Search considerations
7. Need agile OData service with Data Lake
8. Data Governance and Masking
9. CRM User Experience (strategies to improve performance)
Accessing external Big Data objects
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.32
Design Patterns for external objects
Enable Separate Loading of Related Lists of External Objects
Performance Tuning
Tips for Related Lists
in Account
000148978
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.33
Decrease latency accessing Big Data over Hive
Tuning details
1. Use Apache Tez as execution engine for Hive
2. Use ORCfile, new storage format
3. Use vectorization query execution (Hive 0.13)
4. Performance Tuning (Partitions, Indexes, Buckets, Block Sizes, etc)
5. Consider another query interface (i.e. Apache Hawq)
© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.34
Roadmap
Sumit Sarkar
Product Marketing
@SAsInSumit
linkedin.com/in/meetsumit