Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
ENTERPRISE PERFORMANCE
ASSURANCE BASED ON BIG
DATA ANALYTICS
St Louis CMG
Boris Zibitsker, PhD www.beznext.com
© 2016
Enterprise Performance Assurance Based on Big Data Analytics
www.BezNext.com
| Optimizing
Business and IT
All Rights Reserved
Abstract
Today’s fast-paced businesses have to make business decisions in real time. That
creates pressure on IT leaders to develop Big Data infrastructure and applications
capable to process large volume of data from different sources, apply advanced analytics
and present recommendations in real time. These applications work in distributed, multi-
tier, virtualized, parallel processing environment where each cluster and system has own
performance management tools and dedicated performance repositories. It creates a lot
of obstacles for Applications Performance Management and Capacity Management of
Big Data environment. In this presentation we will discuss challenges of creating an Enterprise Performance
Assurance platform for processing streams of measurement data in memory using Kafka
and Spark and storing filtrated and aggregated data in Data Lake. We will review case
studies illustrating application of Big Data advanced analytics, including descriptive,
diagnostic, predictive and prescriptive analytics for organizing Enterprise Performance
Assurance processes.
Enterprise Performance Assurance Based on Big Data Analytics
www.BezNext.com
| Optimizing
Business and IT
All Rights Reserved
Outline
• Problem
• Performance Assurance for Big Data World
• Data Collection
• Role of Data Lake
• Data Aggregation and Transformation
• Application of Advanced Analytics for:
• Workload Characterization
• Identification of Seasonal Peaks
• Workload Forecasting
• Performance Prediction
• Performance Management
• Workload Management
• Dynamic Capacity Planning
3
Enterprise Performance Assurance Based on Big Data Analytics
www.BezNext.com
| Optimizing
Business and IT
All Rights Reserved
PROBLEM
Enterprise Performance Assurance Based on Big Data Analytics
www.BezNext.com
| Optimizing
Business and IT
All Rights Reserved
Challenges
• Each line of business uses different applications, systems and has different
SLGs
• Real time decisions
• Business transaction often access several systems
• Complexity
• Cost
• Growth
5
Enterprise Performance Assurance Based on Big Data Analytics
www.BezNext.com
| Optimizing
Business and IT
All Rights Reserved
Business Transaction Often Processed by Different
Systems
• Clouds
• Data Centers
• Systems
• Hardware
• Software
• Subsystems
• Workloads
• Applications
• Data
• Networks
• Workloads SLGs
6
Enterprise Performance Assurance Based on Big Data Analytics
www.BezNext.com
| Optimizing
Business and IT
All Rights Reserved
Risk of Performance Surprises
• Software performance engineering
• POC – feasibility study, selection of the platform and infrastructure
• New application design, development, testing and implementation
• Applications modification
• Dynamic capacity management
• Workload management – change of priorities, concurrency and resource allocation
• Performance management – change of OS and Software subsystems parameters,
application tuning
• Capacity planning
• Hardware upgrade
• Moving workload from one platform to another
• Software upgrade
7
Software
Performance
Engineering
Dynamic
Capacity
Management
Capacity
Planning
Enterprise Performance Assurance Based on Big Data Analytics
www.BezNext.com
| Optimizing
Business and IT
All Rights Reserved
SOLUTION
8
Enterprise Performance Assurance Based on Big Data Analytics
www.BezNext.com
| Optimizing
Business and IT
All Rights Reserved
Applying Big Data Infrastructure and Advanced
Analytics
• Data Collection
• Streaming performance measurement data from different systems, subsystems,
workloads
• In memory processing using Kafka and Spark to enable dynamic capacity
management
• Storing aggregated data in Data Lake
• Use Reservoirs for specific applications
• Applying Machine learning algorithms for
• Workload characterization using Descriptive Analytics
• Workload forecasting and seasonal peaks determination
• Determining anomalies and their root causes
• Tuning of OS/Linux, software subsystems and applications
• Performance prediction
• Workload management - Priorities and Concurrency
• Development of recommendations – Prescriptive Analytics
9
Enterprise Performance Assurance Based on Big Data Analytics
www.BezNext.com
| Optimizing
Business and IT
All Rights Reserved
Multi-Criteria problem of optimization
• Major Criteria
• Response Time, Throughput and
Cost
• Major Variables
• CPU time, I/O, Memory, Network
demand by workload
• Hardware and software
configuration
• Availability - Frequency of errors by
workloads and applications and
frequency of hardware outages
• Power consumption
• Plan
• Workload and volume of data
growth
• New applications implementation
• Seasonal peaks
• Moving workloads between systems
• Options
• Hardware, software and Virtual
configuration
• Application tuning
• Major Limitations
• Budget
• SLGs
10
Enterprise Performance Assurance Based on Big Data Analytics
www.BezNext.com
| Optimizing
Business and IT
All Rights Reserved
Performance Assurance Technology and
Process Technology
• Big Data Infrastructure • Data streaming
• In memory processing
• Data storing - Big Data Lakes
• Workload aggregation and
characterization across all systems
• Electrical Power consumption
management
• Advanced Analytics • Descriptive, Diagnostic, Predictive,
Prescriptive and Control
• Applications, Data and Systems life cycle
• Recommender
• Automation
• Software Performance Engineering • Design and development of new
applications for performance
• Predicting new application
implementation impact
• Dynamic Capacity Management • Performance Management
• Workload Management
• Long term and short term capacity
planning
11
Process
Enterprise Performance Assurance Based on Big Data Analytics
www.BezNext.com
| Optimizing
Business and IT
All Rights Reserved
Action
Automation
12
Descriptive
What happened?
Predictive
What will happen?
Diagnostic
Why did it happen?
Decision Action
Decision
Support
Decision Automation
Analytics Human Input
Data
Prescriptive
What Should I do?
• Big Data advanced analytics makes possible to implement self-healing, self-
adapting systems based on predictive and prescriptive analytics and
automate Infrastructure & Operations Management. Source “Gartner Group”
Enterprise Performance Assurance Based on Big Data Analytics
www.BezNext.com
| Optimizing
Business and IT
All Rights Reserved
Open Source Solutions
• AT&T
• Open Source Solution
• ECOMP (Enhanced Control, Orchestration, Management & Policy) Architecture
White Paper
• http://about.att.com/content/dam/snrdocs/ecomp.pdf
• Oracle:
• Accelerating EPM Deployment With Planning in the Cloud
• https://go.oracle.com/LP=13101?elqCampaignId=22192&src1=ad:pas:go:dg:epm&s
rc2=wwmk15047923mpp013&SC=sckw=WWMK15047923MPP013&mkwid=shwlM
GmQb|pcrid|75197069299|pkw|enterprise%20performance|pmt|p|pdv|c|sckw=srch:e
nterprise%20performance
• Team Quest
• CMG webinar on July 20th
13
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
DATA COLLECTION
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
WORKLOAD AGGREGATION
AND CHARACTERIZATION
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
Workload Aggregation and Characterization Process
So
ftw
are
Pe
rfo
rman
ce
En
gin
ee
ring
Dyn
am
ic C
apa
city
Manag
em
ent
Big Data
Clusters
Teradata,
Oracle,
DB2 EDW
Other IT
Platforms
Age
nts
Data
Tra
nsfo
rma
tion
Data Lake W
ork
load
Aggre
ga
tio
n &
Cha
racte
riza
tio
n
16
Big Data
Clusters
Teradata,
Oracle,
DB2 EDW
Other IT
Platforms
Cap
acity P
lan
nin
g
Each Workload on Each
System has Unique
Performance, Resource
Utilization and Data Usage
Profiles
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
Workload Characterization by System
• Building workloads’ profiles
• Performance
• Resource utilization
• Data usage
• Results are used as input for:
• Workload forecasting
• Performance management
• Workload management
• Capacity planning
17
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
Applying ML Algorithms for Seasonal Peaks
Determination
18
WORKLOADNAME PARAMETER PERIOD MEANDURATION MEANAMPLITUDE STD MIN MAX 95PERCENTILE PEAKRANGE PEAKLENGTHSTD
ACT TOTALCPUTIME 1 2 739970 369825 27034 1333714 1206858 18-19 0.5
APPLDEV TOTALCPUTIME 1 5 77480 16444 2 251513 17875 14-18 1.581
CAT TOTALCPUTIME 1 3 492061 242717 40725 1350987 846437 19-21 1
CLIENTRPTG TOTALCPUTIME 1 1 667757 201179 3815 1015400 719991 15-15 0
LOAD TOTALCPUTIME 1 1 1353761 627225 836171 3179082 2903503 15-17 0
FIN TOTALCPUTIME 1 1 332444 231195 17520 1050619 846173 16-16 0
FRAUD TOTALCPUTIME 1 7 1060 201 38 859 611 16-22 1.972
HELPDESK TOTALCPUTIME 1 1 24408 40362 2 150933 150933 18-18 0
HR TOTALCPUTIME 1 3 507048 166211 607 822801 665044 15-17 1
IT TOTALCPUTIME 1 2 4688 1805 15 5180 5180 18-19 0.5
MKT TOTALCPUTIME 1 2 151394 189874 261 679415 592849 18-19 0.5
DBA TOTALCPUTIME 7 1 685778 236532 54058 868273 838730 125-125 1.213
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
CPU Utilization by Business Workloads During
Seasonal Peak
19
Sales
Marketing
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
PERFORMANCE MANAGEMENT
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
Diagnostics and Root Cause Analysis
Anomaly Detection
• Short term performance prediction using linear regression model
• If measurement data for an hour is significantly greater than predicted it’s as
Anomaly
21
Date Time Workload Name Parameter Name Root Cause Real Value
Expected Value
1/13/2016 21:00 HR-BATCH MEANRESPTIME 200518.75 76990.13
1/14/2016 9:00 HR-BATCH MEANRESPTIME HR-BATCH MEANIOOPS 759813 155484.41
1/14/2016 10:00 HR-BATCH MEANRESPTIME HR-BATCH MEANIOOPS | HR-BATCH TOTALEXECCOUNT 1390956.67 414569.87
1/15/2016 0:00 HR-BATCH MEANRESPTIME HR-BATCH MEANCPUTIME | HR-BATCH MEANIOOPS 1194485 176682.57
1/15/2016 9:00 HR-BATCH MEANRESPTIME 175011.82 53677.7
1/15/2016 10:00 HR-BATCH MEANRESPTIME HR-BATCH MEANCPUTIME | HR-BATCH MEANIOOPS 364970.63 35
1/16/2016 10:00 HR-BATCH MEANRESPTIME 180744.17 82025.79
1/17/2016 3:00 HR-BATCH MEANRESPTIME HR-BATCH MEANIOOPS | HR-BATCH TOTALEXECCOUNT 1021060 189942.42
1/17/2016 10:00 HR-BATCH MEANRESPTIME HR-BATCH TOTALEXECCOUNT 101169.3 40438.32
1/17/2016 16:00 HR-BATCH MEANRESPTIME HR-BATCH MEANIOOPS 858115 172927.81
1/17/2016 19:00 HR-BATCH MEANRESPTIME 649193.33 239180.25
1/18/2016 2:00 HR-BATCH MEANRESPTIME HR-BATCH MEANIOOPS 544447.38 167940.06
1/18/2016 6:00 HR-BATCH MEANRESPTIME HR-BATCH MEANIOOPS 1017892.31 225624.01
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
Anomaly Detection • Determining Significant Changes with RT, Throughput and Resource Utilization
22
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
• For workload having response time anomaly check if throughput, CPU time
and number of I/O operations have an anomaly and find users and programs
responsible for that
• Check if other workloads, users and programs had throughput, CPU time and
number of I/O operations anomaly at the same time
23
DateTime WorkloadCause ProgramName UserName 1/14/2016 9:00 CIIOUT-BATCH MEANIOOPS None CII_SL_NO_DSHBRD_OUT
1/14/2016 10:00 CIIOUT-BATCH MEANIOOPS JOBSERVERCHILD CII_SL_NO_DSHBRD_OUT
1/14/2016 10:00 CIIOUT-BATCH TOTALEXECCOUNT JOBSERVERCHILD CII_SL_NO_DSHBRD_OUT
1/15/2016 0:00 CIIOUT-BATCH MEANCPUTIME JOBSERVERCHILD CII_SL_ALL_RPT_OUT
1/15/2016 0:00 CIIOUT-BATCH MEANIOOPS JOBSERVERCHILD CII_SL_ALL_RPT_OUT
1/15/2016 10:00 CIIOUT-BATCH MEANCPUTIME CRPROC CII_SL_ALL_RPT_OUT
1/15/2016 10:00 CIIOUT-BATCH MEANIOOPS CRPROC CII_SL_ALL_RPT_OUT
1/17/2016 3:00 CIIOUT-BATCH MEANIOOPS JOBSERVERCHILD CII_SL_NO_DSHBRD_OUT
1/17/2016 3:00 CIIOUT-BATCH TOTALEXECCOUNT JOBSERVERCHILD CII_SL_NO_DSHBRD_OUT
1/17/2016 10:00 CIIOUT-BATCH TOTALEXECCOUNT JOBSERVERCHILD CII_SL_NO_DSHBRD_OUT
1/17/2016 16:00 CIIOUT-BATCH MEANIOOPS None CII_SL_NO_DSHBRD_OUT
1/18/2016 2:00 CIIOUT-BATCH MEANIOOPS JOBSERVERCHILD CII_SL_NO_DSHBRD_OUT
1/18/2016 6:00 CIIOUT-BATCH MEANIOOPS None CII_SL_ALL_RPT_OUT
Root Cause Determination
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
Root Cause Determination Algorithms
• Decision Trees
• Logistic Regression Analysis
• Predictive analytics
24
Decision Tree - Leaf page and branches identify the root cause
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
DYNAMIC CAPACITY
MANAGEMENT
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
Short Term Prediction Variable Learning Interval (hours) Prediction Interval (hours)
Response Time 720 - 4320 1 - 72
Throughput 720 - 4320 1 - 72
CPU Time 720 - 4320 1 - 72
# I/Os 720 - 4320 1 - 72
26
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
Workload Management
Priorities
Concurrency
Resource Allocation
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
Examples of Workload Management by Queues
• By Organization 100%
• Marketing 33%
• Finance 33%
• Sales 33%
• By Type of Workload 100%
• Near Real Time 70%
• Batch 30%
• Hybrid 100%
• Marketing 20%
• Batch 15%
• Real-Time 5%
• Finance – 40%
• Real-Time 10%
• Batch 30%
• Sales – Batch 40%
By Organization
(100%)
Marketing (33%)
Finance (33%) Sales (33%)
By Type of Workload (100%)
Near Real Time (70%)
Batch (30%)
Hybrid (100%)
Marketing (20%)
Batch (15%) Real Time (5%)
Finance (40%)
Batch (30%) Real Time (10%)
Sales (Batch 40%)
28
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
Examples of Resource Manager Scheduler
• FIFO Scheduler • Processing Jobs in order
• Capacity Scheduler (Default) • Queue shares as percentage of clusters
• FIFO scheduling within each queue
• Supporting preemption
• Fair Scheduler • Fair to all users
29
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
Example of Capacity Scheduler
Set Limits on Capacity
• Minimum capacity for the queue
• Maximum capacity (% of cluster
resources) for a queue
• Resource elasticity when not being
used by other queues
• Minimum User Limits – user sharing
for a given queue
• User Limit factor – Maximum queue
capacity that one user can take up
• Application Limit – Maximum # of
applications submitted to one queue
30% 50% 20%
Guaranteed Resources
Marketing Finance Sales
Queue 1
Qu
eu
e 2
Queu
e 3
30
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
Example of Predicting Workload Concurrency Change
Impact
31
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
Example of Predicting Workload Priority Change
Impact
32
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
CAPACITY PLANNING
33
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
Input and Output for Performance Prediction and
Dynamic Capacity Management and Capacity Planning
34
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
Long Term Prediction Predicting Workload Growth Impact
• Apply Predictive Analytics • Predict the impact of expected
workload and volume of data growth
• Predict how new application will
perform on production system
• Individual Hadoop Clusters vs Data
Lake
• Determine how planned hardware
upgrade will affect performance of
the individual workloads
• Apply Prescriptive Analytics
• Evaluate options
• Justify what should be done
proactively to meet SLGs
• Set realistic expectations
35
Predict the impact of workload
and volume of data growth
Determine when workloads SLGs
will not be met
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
Capacity
Demand
New
Application
More
Data
More
Users
Predicting New Application Implementation Impact
36
Current Applications: Sales, Mkt, HR, ERP, ETL
GB PB
Test Production • Data Collection
• Workload
Characterization
• Workload
Forecasting
• Modeling Test and
Production
Systems
• Predicting new
Application
Implementation
Impact
• Verification
New
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
Predicting Impact of Different Changes and
Justification of Decision
• Technique:
• queueing theory, machine learning, data mining,
analytic and simulation modelling and game
theory.
• Example: • predicting the impact of the expected increase in
number of users and volume of data,
implementation of new applications
• Comparison of different options
• hardware upgrades, server, data and application
consolidation, virtualization, moving workloads
between systems
• Justification of decision
Predict how new
application will affect
performance of
existing applications
Predict the impact of
hardware upgrade
37
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
PRESCRIPTIONS
38
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
Prescriptive Analytics - Advice
• Prescriptive Analytics • Use outputs of Descriptive, Diagnostic and Predictive
Analytics to Recommend: “what and when should be done” to
most economically and effectively achieve business goals
and meet SLAs.
• Technique • machine learning, artificial intelligence, queueing theory, and
optimization algorithms, compare impact of different decision
options
• Value • Better-informed decisions
• Reduce risk
• Set realistic expectations.
• Enables Verification and Automation
Recommends Tuning, Workload Management and
Capacity Planning actions to continuously meet SLGs
Evaluates prediction results to find how most
effectively satisfy SLGs of each workload
39
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
VERIFICATION AND
AUTOMATION
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
Verification Actual vs. Expected (A2E) is a base for feedback
control
41
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
Goal is Automation
• Data Collection 24*7
• Configuration Auto Discovery
• Workload Characterization
• Determining Seasonal Peaks
• Modeling and Performance
Prediction
• Setting Rules Automating Resource
Allocation and Workload
Management
• Dynamic change of rules based on
short term predictions
• Verification – A2E
• New Prescription
Optimization of Infrastructure and organization of the
continuous proactive management process
Control
42
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
SUMMARY
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
Value of Performance Assurance
• Unification of Performance Assurance methodology and tools across
multiple platforms
• Optimization of Design, Development and Testing
• Optimization of Dynamic Capacity Management - Performance
Management and Workload Management
• Optimization of Capacity Planning
• Optimization of Big Data Infrastructure
• Setting Realistic Expectations
• Enables Verification
• Automation of Performance Assurance process
• Reduce uncertainty and risk of performance surprises
• Collaborative capacity management process
44
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com
QUESTIONS? [email protected]
www.beznext.com
45
Enterprise Performance Assurance Based on Big Data Analytics
All Rights Reserved
| Optimizing Business and IT
www.BezNext.com 46
Collection and Aggregation of Performance
Measurement Data
Linux Agent
Kafka Agent
Spark Agent
Storm Agent
Cassandra
Agent
YARN Agent
Tez Agent
Agent Manager
Auto Discovery
Agent
Workload Characterization
Data
Tra
nsfo
rma
tio
n
Workload
Forecasting
Workload
Management
Performance
Management
Performance
Prediction
Capacity
Planning
Verification &
Control Other Agents
Data Lake
Performance
DW
Diagnostic
Analytics
Descriptive
Analytics
Predictive
Analytics
Control
Analytics
Prescriptive
Analytics
Wo
rklo
ad
Ag
gre
ga
tio
n
ADVANCED
Analytics
PERFORMANCE
Assurance Big Data
Clusters
Teradata,
Oracle,
DB2 EDW
Other IT
Platforms
Big Data
Clusters
Teradata,
Oracle,
DB2 EDW
Other IT
Platforms