Upload
thousandeyes
View
3.406
Download
0
Embed Size (px)
Citation preview
Operational Analytics “Data is the New Soil” –David McCandless
October 16th 2015
Darrell Westbury Director of Operational Analytics
About me….
October 16th 2015 2
§ I’m Darrell Westbury
§ I work in Global Technology Services for Credit Suisse
§ Credit Suisse is a fortune 200, Swiss-based Investment Bank, with about ~46,000 employees world wide
§ I’ve worked at CS for ~7 years
§ I’ve held various roles ranging from: • Head of Storage Ops for the Americas • Head of Capacity and Inventory Services • Director of Operational Analytics (current)
“Data is the new Soil” - David McCandless
October 16th 2015 3
What is Operational Analytics?
“Operations Analytics (OA) is an approach or method of applying big data principles and data analytics to the IT Operations realm”
“OA centers on discovering trends and patterns in high volume, complex and noisy IT systems data and making predictions that will help avoid impact to key services where possible and ‘recover quickly *’ when issues do occur” (* reduced MTTR)
October 16th 2015 4
October 16th 2015 5
Machine Data
Wire Data
Agent Data
Synthetic Transactions
Human Maintained
System and Application logs, System Events, Performance and Capacity Metrics…
Simulated views of a customer’s experience while interacting with a service
Asset Inventories, Lifecycle Status, Data classes, Apps Names and the people who manage them, etc.
Intercepted system calls and Application Method Invocations
Network Packet Captures that have been pre-decoded for ease of use
What Types of Data Does OA Target?
So, What Does OA Actually Do? Phase I: Data Onboarding
October 16th 2015 6
§ Identify Golden Data Sources § Extract and Transform (sometimes) § Load & Ingest
§ Manage Data Quality § Reference Data, Maturity Scales § Accountable Data Owners
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Data Onboarding
EFFO
RT
§ Track Progress with Score cards § Remove any Blockers § Update Data Documentation
So, What Does OA Actually Do? Phase II: Data Science
October 16th 2015 7
§ Identify trends & seasonal patterns in data § look for baselines & outliers using statistics and linear algebra techniques
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Data Onboarding Data Science
EFFO
RT
October 16th 2015 8
So, What Does OA Actually Do? Phase III: Data Visualizations
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Data Onboarding Data Science Data Visualization
EFFO
RT
How are we using ThousandEyes?
October 16th 2015 9
1 Basic DNS Testing Alert on internet domain name resolution failures
2 Port Listener Health Tickle a TCP port over the internet to ensure it’s listening and responding
3 Data Path Testing Observe the end-to-end path through the internet to a target service and monitor for route changes, packet loss, latency & jitter
4 Page Load Testing Ensure internet facing web sites and services are responding correctly and consistently
5 Synthetic Transactions Test authentication and site navigation; collect performance stats at the page object level
Detecting Internet Client Access Issues
Thousand Eyes probes began reporting a 50-100% drop in authentication responses from a public facing Web Service
Evidence of issue was provided to Web Operations, who were unaware , as all of their monitoring tools depicted what they believed to be a healthy and stable infrastructure (CPU, Memory, Disk & Network I/O, Capacity, Performance, etc.)
October 16th 2015 10
Evidence of Improved Performance
Implemented a synthetic transaction to compare the relative End User Experience of using the infrastructure with and without acceleration.
Collected concrete evidence of a ~33% service time / latency performance improvement when accessing an accelerated URL
Also able to demonstrate relative smoothing of service time inconsistency
October 16th 2015 11
Insight into a Production Incident Pre-Incident, Paths Look OK
October 16th 2015 12
10:45 AM 11:00 AM
Insight into a Production Incident Alert Received - BGP Route Disruption
October 16th 2015 13
10:45 AM 11:00 AM
Insight into a Production Incident Internet Paths are Failing
October 16th 2015 14
10:45 AM 11:00 AM
Insight into a Production Incident No BGP Routes – SPoF Detected
October 16th 2015 15
10:45 AM 11:00 AM
Insight into a Production Incident Path Failover to DR Site Successful
October 16th 2015 16
10:45 AM 11:00 AM
Some Final Thoughts…
October 16th 2015 17
§ ThousandEyes lets us run multiple types of tests at various levels of sophistication and granularity from all over the world (DNS Test, TCP Port Tickle, Internet Path Test, Page Loads, Full Synthetic Transactions)
§ We see exactly how our clients are experiencing our services (Internet Path, BGP Route health, packet loss, jitter & latency)
§ We receive alerts on any significant variations from our established baselines (Paths, Routes, Performance, Service Quality, etc.)
§ We’re leveraging real quantifiable data to take the guesswork and subjectivity off the table
§ We’re empowering important business decisions
Thank You
October 16th 2015
Darrell Westbury Director of Operational Analytics