Upload
ken-owens
View
306
Download
0
Tags:
Embed Size (px)
Citation preview
Ken OwensCTO Cisco Intercloud Services07/15/15
How Cisco Migrated from MapReduce Jobs to Spark Jobs
1
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Introduction
2
Trends
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Introduction
3
Alignment to Business Outcomes
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Introduction
4
ServicesVs
Legos
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Introduction
5
Platform
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Introduction
6
Software DefinedDisruption
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Source: IDC 7
30MNew devices connected every week
78%Workloadsprocessed
in Cloud DCsby 2018
5TB+of data per person
by 2020
180BMobile apps downloaded
in 2015
277XData created by IoE devices
v. end-user
The Uber Trend: Exponential Rise in Connectivity
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Exponential Trend
Linear Trend
Disruptive Stress/Opportunity
Knee of Curve
Exponential Growth Drives Opportunities
Peter Diamandis: BOLD
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
When Products Become Cloud-enabled, They Become 10X More Valuable
$23.19
$249.00
$18.01
$199.00
$5.99
$59.99
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
SaaS
PaaS IaaS
A Broader Perspective than Hybrid Cloud Is Required…
Data Center Cloud Edge / IoT
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco PublicPresentation ID
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Hyperscale applications serving several thousands of users very quickly
Traditional enterprise applications
IoE and increasing connectivity driving the need for such workloads
Hadoop, Mobile back-ends, Gaming, Social
Small (~10%), yet rapidly growing percentage of applications in the Cloud
ERP, CRM, Applications that leverage traditional databases
Majority of applications being run for/by Enterprises today
CIOs Need to Embrace Both Traditional and Hyperscale Application Deployment
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
SaaS
PaaS IaaS
Application Portability and Interoperability Is the Key
TraditionalApplications
ERP, Financial, Client/Server, CRM, email, …
Cloud NativeApplications
IoT, BigData, Analytics, Gaming, ...
Data Center Cloud Edge / IoT
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco PublicPresentation ID
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Source: Gartner, Lydia Leong
of CIOs currently have a second fast/agile mode
of operation
45%Traditional
Mode
Requires Reliability
(ITIL, CMMI, COBIT)
Nonlinear Mode
Accept Instability
(DevOps, automation,
reusable)
Systems of
Differentiation
Systems of
Innovation
Systems of
Record
Ch
an
ge
Go
ve
rna
nc
e
Bimodal IT Is the New Normal
Source: Gartner, Lydia Leong
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Intercloud
The Intercloud
Web-scale Architecture API-Driven Automation
Open, Secure, Compliant, Hybrid IT
Internet
The Internet
IP Based
Open Standards
World of Isolated Clouds (2000s)
Individual custom-built clouds without consistent APIs
Connected for application acceleration with Open APIs
The Intercloud
Intercloud
Islands of Isolated PC LAN Networks (1990s)
Multiple LANs usinga multitude of protocols
The Internet
Connected using industry-standard IP protocol
We Must Connect the Clouds
15© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Use Case: Customer Interaction Analytics
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Omni-Channel Customer Journeys
Server Logs
Social & Chat
MobileEvent
StreamsCall
Center
S/W Download
Open Trouble Ticket
Assign Engineer
Update Trouble Ticket
Close Trouble Ticket
Resolve Trouble Ticket
Read Support Documents
View Design Documents
View Tech Documents
New Registration
Bug Search FAQs
Contract Details
Product Details
Device Coverage
Interaction Touch points
Channels
Journey
Case Resolution
Software Upgrade
The customers’ interaction with Cisco across multiple touch points to get the desired business outcome.
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
• Software Upgrades• Bug Inquiry• Software Inquiry• Trouble Ticket Lifecycle• Device Troubleshooting• New Registration• Contract Renewal
• Customer Interest Analytics
• Customer Experience Analytics
• Resource Forecasting• Security and
Compliance
Customer Journeys Behavioral Insights
• Boost Self Service• Real-time Content
Optimization & Recommendation
• Context Based Predictive Alerts
• Implicit Personalization
Impact
Customer Interaction AnalyticsFrom Journey to Outcome…
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Server Logs
Customer Interaction Analytics
Big Data Platform
Synthesize customer journey maps into behavioral insights.
Call Center
Mobility
Social
Event Streams
Data Sources
Data Ingestion
CiscoDV
Kafka
Redis
ETL
Analytics Model
Build Model
Activity Refinement
Activity Synthesis
Synthesized Insights
Real-time Processing
Batch Analytics
Insight Services
CiscoDV
Interact
ImpalaHive
Pig ES Zo
om
dat
a, P
latf
ora
19© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
AWS and CIS Intercloud Solution
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
AWS Platform
Component Cloud::Hadoop(Batch
Analytics)
Cloud::Queries
(Interactive Queries)
Cloud::Streams
(Near Real-time
Analytics)
Virtual Machines
30 6 5
AWS Instance
Sizing
m3.2xlarge c3.xlarge m3.xlarge
Virtual Cores
8/VM 4/VM 4/VM
RAM 30GB/VM 7.5GB/VM 15GB/VM
Disk 1.5 TB/VM 1.5 TB/VM 1.5 TB/VM
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Case for Cisco Intercloud Services for Analytics…
Cisco Security and Compliance requirements• Workloads that deal with personally identifiable data and Cisco
confidential content cannot be uploaded to AWS. Cisco internal cloud solution is a better fit.
Customer journey beyond the enterprise• Applications are hosted on AWS • Partner systems hosted on AWS and other cloud providersPresence in AWS and other cloud services required to support these scenarios for end-end customer journey insights.
Data virtualization integrated in the CIS Analytics Stack• Connect data from multiple clouds and multiple big data platforms
Integrated visualization toolset
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
CIS Analytics Platform
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
CIS Analytics Platform Requirements
Infra ProvisioningDeploy a virtual private cloud (VPC) on CIS with compute, storage and memory requirements comparable to the current production system. OpenStackIcehouse OpenStack with Neutron, Nova, and Swift installed. Big Data EcosystemCloudera’s Hadoop distribution version CDH 5.1.3., ELK Stack, Apache Kafka and Apache Storm. Data virtualization & Cloud IntegrationAccess to data services and data stores via Cisco Data Virtualization
Runtime ServicesFoundational PaaS capabilities including SLAs for uptime, performance, latency, data retention, issue escalation
and support priorities, issue resolution, problem management, deployment process, patch management.
API ServicesProvide both fine-grained and coarse-grained access to the all service layers of the CIS Analytics Platform. In the hybrid cloud model it must support interoperability across platform service providers and promote the cloud concepts of extensibility and flexibility.
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
AWS to CIS Migration – Success Criteria
Successful synthesis of customer interaction data
Successful automation of the end-end data process pipeline
Build behavioral insight services
Access to data and services via data discovery and visualization tools
Meet the performance, scale and platform stability requirements
Successful deployment of CiscoDV on CIS
Connect HDFS and Hive DS with CiscoDV via Hive and Impala
Build and expose insight services for consumption by limited users
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
AWS and CIS Data Node Sizing Comparison Hadoop Cluster for Batch and Query Analytics
Node Service AWS Instance Type vCPU Mem Storage Number of Data Nodes Comments
Data Nodes/Node Master m3.2xlarge 8 30 2x80 GB 30
Each hadoop data node has 1500GB of EBS available for HDFS storage
AWS Sizing
CCS Sizing Node Service CCS Instance Type vCPU Mem Storage Number of
Data Nodes Comments
Data Nodes/Node Master GP-2XLarge 8 32 50 35
Each hadoop data node has 1500GB of EBS available for HDFS storage
Less than AWS sizing (Storage)
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Pilot Test Data
• Test performed on one day’s production data • Total no. of records processed – 110,852,667• Total data size – 32GB• Total no. of M/R jobs in the data pipeline – 17• Two test cycles
• Cycle 1: Heterogeneous CCS nodes (vCPUs, storage, memory) • Cycle 2: Homogeneous CCS nodes
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
CIS Performance of Batch Analytics – Limited Test
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Test Details by M/R job
Job Name
CCS 12 nodes: cycle1
CCS 18 nodes: cycle1
CCS 24 nodes: cycle1
CCS 30 nodes: cycle1
CCS 18 nodes: cycle2
CCS 24 nodes: cycle2
CCS 30 nodes: cycle2
CCS 35 nodes: cycle2
New_cleanse 249 176 143 117 82 67 55 51Process_private_ip 27 14 11 10 7 5 6 6join_web_and_ip_data 142 95 76 61 49 40 34 29combine_ip_decorated_files 26 14 11 10 9 7 8 7filterBotEntries 34 19 15 13 10 8 7 7sessionize 71 64 69 62 60 63 15 13firstActivitiesFilter 26 15 13 10 9 8 6 6allOtherActivitiesFilter 29 18 13 13 11 9 7 6matchFirstActivities 21 13 11 13 13 11 8 8buildActivities 27 15 12 10 7 6 9 9filterBUG 8 5 3 2 3 3 4 4filterSEA 8 5 3 2 3 3 4 4filterTCO 8 5 3 2 3 3 4 4filterTDV 8 5 3 2 3 3 4 4filterWDV 8 5 3 2 3 3 4 4filterMOD 8 5 3 2 3 3 4 4filterTOOL 8 5 3 2 3 3 4 4
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
PoC: Analytics with Spark on CIS
Existing code Made in Ruby with Wukong to run on Hadoop A history of changes and modifications Script-based, steps communicate via intermediary filesGoal Revise, rethink and reimplement with Spark on CIS Open for advanced cloud analytics Improve maintainability by moving away from aging Ruby on Hadoop
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Sessionize
Cleanse
logs
cleanse
private web
decorate
sessionize (cookie, time)
sessioned
match 1st (IP, UA, time)
build actions merge session PSV
add to hivebug tool
first, others, bots
1..7
onlyBots
firstothers
private
Main computation
happens here
cleansed
Pre-process log records (‘cleanse’)
Extract HTTP sessions (‘sessionize’)
Extract user actions, such as ‘search’, ‘download patch’, ‘open manual’, ‘open a bug’
Ruby: Scripts with temp files
Each box on the figure is a script in a separate file
They pipe Gb of data as input and output
Random matching of nodes to data for sessionizing
Lots of redundant shuffling
Ruby Flow
global sort in timeglobal group by IP
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Sessionize
Cleanse
logs
cleanse
private web
decorate
sessionize (cookie, time)
sessioned
match 1st (IP, UA, time)
build actions merge session PSV
add to hivebug tool
first, others, bots
1..7
onlyBots
firstothers
private
Main computation
happens here
cleansed
Same flow, but each box is a Java or Scala function
No intermediate temp files
Steps are chained by Spark, often without any need for intermediate data
If still needed, the data is stored in memory and local disk as much as possible
Local computation
Cleansing is computed on nodes local to data blocks (same as Ruby)
Sessions are built per IP
On separate nodes each handling a single IP range
One copied to the node on partition the data remains local
Spark Flow
global partition by IPlocal sort in time
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Volumes Logs of a single day: 52 Gb Total of 110 mil records Where 53 mil records are kept after pre-filtering Producing over 1 mil user actions Cluster of 30 nodes
Ruby Runtime 140 min
Spark Runtime 7 min (20 times faster )
Runtime comparison
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Extracting sessions means sort in time and group by IP
Ruby: sorting in time and per-IP grouping is performed across the whole cluster (very bad, lots of IO)
Spark is good at dealing with partitions: per-IP groups are placed on different machines (partitions) global sort in time is replaced by many local per-IP sorts done on machines responsible for
extracting sessions for specific groups of IP addressed
Other improvements Avoid redundant temp files, redundant (de)-serialization of objects (comes with Java/Scala), stages
keep data in memory when possible (comes with Spark) Cache results of user agent resolution that are heavy on regular expressions
Why?
34© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
CiscoDV on CIS
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Data Virtualization for Intercloud Analytics
Customer Benefits Discover data beyond the enterprise: Virtual integration that combines traditional
enterprise data, Big Data stores on CIS and AWS, cloud data from SaaS providers and, Cisco Customers and Partners
Seamless interoperability offers easy access to data across distributed data sources in the intercloud analytics platform
Universal data governance maximizes enforcement of data security rules
Analytics Data Hubs: Deployment flexibility to build hybrid/virtual sandboxes that enable nimble data discovery and rapid data analytics to support multiple LOBs
Deliver data to any number of analytics tools.
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Use Case 1: Get Case Interactions
Use Case Description # of cases opened by company X that are currently open. (other variations would include cases by company, trends etc.)
CiscoDV Value CiscoDV enforces data security rules to restrict access on the intercloud platform to customer sensitive data.
Data Sources SalesForce
Intercloud Solution CIS CiscoDV service can access the “sanitized” version of CSOne data through JDBC from RIDES(SWTG CiscoDV) API.
Connection Type DV on hybrid cloud Enterprise data store
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Use Case 2: Get Customer JourneyUse Case Description Customer interactions on the web
pertaining to bug search and case submission process. Foundational data can be used to explore trends and feed into content recommendation models
CiscoDV Value Direct access to Data on CIS Intercloud Analytics Platform
Data Sources SAS Analytics
Intercloud Solution By direct network access to the Impala Server, the CIS CiscoDV server connects to the Impala Service in Hadoop also on CIS as a Data Source. SQL Queries configured in CiscoDV execute Impala queries
Connection Type DV on hybrid cloud VPC Big Data platform
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Use Case 3: Get Bug Interactions
Use Case Description
Another foundational data service that provides a breakdown of customer exposure or interest in bugs. The service can be refined further to look at trends specific to a company or a product for further analytics.
CiscoDV Value Real-time data federation that accesses extremely large data in CIS Intercloud Analytics platform and join that with Bug Data accessed via departmental CiscoDV instance (RIDES)
Data Sources SASA Analytics and QDDTS via RIDES
Intercloud Solution
By building on the access to the Impala Server, the DV server can join the Bug Data from the Enterprise Data Stores with the HDFS data to provide a federated view.
Connection Type
DV on hybrid cloud VPC Big Data platform and Enterprise data store
Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
CiscoDV on Intercloud Analytics Platform (CIS)
Scenario 1
CIS Cisco DV to Cisco Enterprise Data Store
Scenario 2
CIS CiscoDV to Impala and Hive on CIS Intercloud Analytics Platform
Scenario 3
CIS Cisco DV to Hive on AWS Big Data Cluster
Sce
na
rio
1
Scenario
2
Scenario 3