35
Building a Hybrid Data Pipeline for Salesforce and Hadoop Sumit Sarkar, Chief Data Evangelist for Progress @SAsInSumit www.linkedin.com/in/meetsumit

Building a Hybrid Data Pipeline for Salesforce and Hadoop

Embed Size (px)

Citation preview

Page 1: Building a Hybrid Data Pipeline for Salesforce and Hadoop

Building a Hybrid Data

Pipeline for Salesforce and

Hadoop

Sumit Sarkar, Chief Data Evangelist for Progress

@SAsInSumit

www.linkedin.com/in/meetsumit

Page 2: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.2

Agenda

Overview of project

Hybrid pipeline for ingestion of SaaS sources into Hadoop

Hybrid pipeline to access Hadoop from Salesforce Cloud

Best practices and lessons learned

Page 3: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.3

Overview of project

Page 4: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.4

Need answers at marketing speed..

1. Invite contacts to our webinar who identified specific interests with active opportunities

and include related contacts to that same project?

2. Identify common ProductA pages that convert to ProductB evaluations (i.e. measure

cross sell potential)

3. What web content did the 1100 webinar attendees view following the webinar?

4. Analyze lead histories to track accuracy in lead routing assignment (Salesforce limits

values that can be reported against)

5. Identify which of our web content is most visited for sales opportunities that were

closed/won?

6. Create a list based on content consumption and 2017 survey answers?

7. What content was consumed across our strategic accounts?

8. Who complained about a broken link in the survey?

9. …

Page 5: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.5

Analytics Today

Marketing

Data Management Platform

Embedded Insights

Operational Insights

Information Technology

Data Warehouse

Enterprise Reporting/Analytics

Enterprise Data Integration

LoB

Desktop Analytics / Spreadmarts

Sumologic Analytics

Mixed access to Martech/IT analytics

Page 6: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.6

Thus, we embarked on a Data Lake for Progress

Detailed data such as activities in Eloqua were

not suitable to store in Corporate Data

Warehouse or Marketing DMP

Data fragmented across sales CRM and

service CX; marketing automation; web

analytics; usage for cloud apps; survey

platforms; webinar data, etc

Did not know what questions to ask in advance

– how to define star schema?

Many analytics tools across Progress

Emerging data science expertise

Started with Pilot to start experimenting rather

than wait to build a business case for funding

Page 7: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.7

Overview of Progress Marketing Data Lake

Progress Corporate Firewall

Page 8: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.8

Hybrid pipeline for ingestion of SaaS sources into Hadoop

Page 9: Building a Hybrid Data Pipeline for Salesforce and Hadoop

Data Collection Process

Page 10: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.10

Sample Data: Oracle Eloqua Profiler data

Page 11: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.11

Sample Data: CRM

Page 12: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.12

Sample Data: Web traffic

Page 13: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.13

Sample Data: Survey

Other popular responses

Linode

Pironet

Redhat OpenShift

OpenStack

Cloud Share

Thomson Reuters Elektron

SAP HANA

Claro Cloud

Page 14: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.14

Hybrid Environments can Limit Access to Data

Locked behind

the firewall

Locked behind

other clouds

Page 15: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.15

Not trivial to get data across different marketing data sources

Data Source API

Eloqua Web Services API (REST/SOAP)

Bulk and non-Bulk APIs

No query language

Oracle Service Cloud Web Services APIs (REST/SOAP)

ROQL

Google Analytics Hypercube (query limits of 10 metrics grouped by

max of 7 dimensions)

Veeva CRM SOAP, BULK, Metadata APIs

SOQL

Page 16: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.16

REST API for bulk export to support analytics

Page 17: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.17

Hybrid pipeline for ingestion of cloud data sources to ground Hadoop system

Define SaaS

data model

for data

integration

Optimize SQL

request

against SaaS

APIs

Mapping

JDBC user

auth to SaaS

APIs

Standard

JDBC Client

for Apache

Sqoop

Page 18: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.18

Sample Insights: Trends in 2016 for product usage following related events

March &

September

June

Page 19: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.19

Hybrid pipeline to access Hadoopfrom Salesforce Cloud

Page 20: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.20

Connect Salesforce to Big Data

Success Scoring

Personalization

Archived Insight

360 Reporting

Corporate Firewall

?

Page 21: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.21

Salesforce

Connect maps Salesforce external

objects to data tables in external systems.

Instead of copying the data into your

organization, Salesforce

Connect accesses the data on demand

and in real time. The data is never stale,

and we access only what you need.

Recommended when:

• You have a large amount of data that you don’t want

to copy into your Salesforce organization.

• You need small amounts of data at any one time.

• You want real-time access to the latest data.

Salesforce Connect integration for Big Data

Page 22: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.22

An open protocol to allow the creation and consumption of

queryable and interoperable RESTful APIs in a simple and standard way.

OASIS Standard REST API (“SQL for the web”)

Ratified as an OASIS standard February, 2014

Operations built on REST principles

Uniform URL conventions

Surface metadata in standard way

Access requires OData endpoint

First member to join OData Technical Committee

Page 23: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.23

For Data Behind a Firewall, there is no Common Access Approach for Clouds

1. Network Based VPN

2. SSH Tunneling

3. Reverse Proxy Servers

Page 24: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.24

Firewall Becoming Barrier for Hybrid Data Tech Adoption

Source: The 2017 State of the Firewall” produced by Firemon

Page 25: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.25

Hybrid pipeline for on-demand access to Hadoop system on the ground from Salesforce Cloud

Publish

OData

endpoint for

on-demand

access

On-premises

data gateway

for firewall

friendly

connection

Mapping user

auth to

Hadoop

ecosystem

Reverse

Engineer

OData REST

API entity

data model

Page 26: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.26

Firewall Friendly Architecture for On-premises Data Gateway

On-premises

data gateway

for firewall

friendly

connection

Page 27: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.27

Getting Started with Salesforce External Object Reports

Report with data blended

from Standard and External

Objects (pulled on-demand

from on-premises Data

Lake)

Page 28: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.28

Best practices and lessons learned

Page 29: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.29

What worked in building the Data Lake

Building advocacy for

marketing data

Brought together several teams across Product, Sales, Marketing

Operations, Engineering, and others interested in the data

available for analysis.

Revenue attribution to content using activity data correlated across

opportunities and content consumed.

Analyze detailed CRM lead histories to measure lead routing

effectiveness. Salesforce reports do not support analysis on

values of detailed lead activity fields.

Able to leverage SMEs to identify laser focused targets. I.e. which

leads have the specific tech stack that the next webinar is

targeting.

Trends and Research

Supplement existing

CRM analytics

Identify new and highly

focused segments

Page 30: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.30

Lessons Learned ingesting SaaS data into Hadoop

InfrastructureOn-premises Hadoop cluster was not ideal for LoB to manage and

tune for the pilot, but we did not have approval to land LOB data in

the cloud. Need additional support to go live.

Data lake dumps raw data from source systems, so we end up

with activity from automated tests that can skew data, for example.

Schema changes to source objects impacted ingestion so that

needs to be planned for with SaaS applications.

Even with a standard pipeline in place, there are still limits that

apply for initial load on certain objects as many APIs are primarily

designed for application integration.

Data Quality

Metadata

Data Integration

Page 31: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.31

Lessons Learned accessing Hadoop from Salesforce

1. Mapping OData entities to Big Data objects

2. Primary keys for Big Data entities

3. HiveServer1 vs HiverServer2 for concurrency

4. External Objects have limits and 2 minute max timeout

5. Native Reporting support was added in Winter ‘17

6. Search considerations

7. Need agile OData service with Data Lake

8. Data Governance and Masking

9. CRM User Experience (strategies to improve performance)

Accessing external Big Data objects

Page 32: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.32

Design Patterns for external objects

Enable Separate Loading of Related Lists of External Objects

Performance Tuning

Tips for Related Lists

in Account

000148978

Page 33: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.33

Decrease latency accessing Big Data over Hive

Tuning details

1. Use Apache Tez as execution engine for Hive

2. Use ORCfile, new storage format

3. Use vectorization query execution (Hive 0.13)

4. Performance Tuning (Partitions, Indexes, Buckets, Block Sizes, etc)

5. Consider another query interface (i.e. Apache Hawq)

Page 34: Building a Hybrid Data Pipeline for Salesforce and Hadoop

© 2016 Progress Software Corporation and/or its subsidiaries or affiliates. All rights reserved.34

Roadmap

Sumit Sarkar

Product Marketing

@SAsInSumit

linkedin.com/in/meetsumit

Page 35: Building a Hybrid Data Pipeline for Salesforce and Hadoop