Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Quality health plans & benefits Healthier living Financial well-‐being Intelligent solu;ons
Alexander Norris [email protected]
Video
How Aetna Uses Splunk and Prelert to Improve the Consumer Directed Healthcare Experience
Aetna Inc. 2
Our Values Splunk Prelert Path Cool
Aetna Inc. 3 3 ©2014 Aetna Inc.
Our values drive us to do beIer… for People
Aetna Inc. 4
Our Values Splunk Prelert Path Cool
History Founded in 1853 $47.2 Billion ’13 Revenue
23.1 Million Medical Members
Aetna Inc. 5 5 ©2014 Aetna Inc.
Our healthy:
Health care built around people
“Our healthy as a company is to find solu5ons that help people live healthier
lives and to help them manage their health versus their health care.”
— Mark Bertolini, CEO, Aetna
Aetna Inc. 6
Our Values Splunk Prelert Path Cool
Aetna Inc. 7
Our Analy;cs Journey
Levels of analyPcs maturity depends on how much of the decision process is automated, and how much is leR for human intervenPon.
DescripPve analyPcs (dashboards and query/drill down tools) -‐ the intelligence is enPrely leR to the human.
Advanced analyPcs -‐ more of the intelligence is automated.
Moving from descripPve analyPcs to more advanced analyPcs gets us closer to decisions and acPons (i.e. direct business impact).
Aetna Inc. 8
Our key to open machine data is Splunk
Splunk
Websphere, IIS, IHS, Logviewer, BPM, DataPower, UDB, MQ, Pure, San, Omnibus, Windows, VDI, Esx, ICE, Proxy, Netscaler, SAN, Avaya, Cloud, Linux, F5, AIX, IRONMAIL, Mainframe (CICS, DB2, WAS)
ASD, ATV, AVA, NAV, Docfind (FAST/DSE), Lifesuite, Workability, IQS, IUS, Incedo, QRS, APMCAS, AQC, Dynamo, NICE
PlaSorms Products
AnalyPcs Across Users and Silos CreaPng a Single ‘Glass Pane’
Aetna Inc. 9
Our Values Splunk Prelert Path Cool
Aetna Inc. 10
• Data across Splunk made available in Prelert
• Let the machine learn what is ‘normal’ • Create acPonable anomalies • People intersect with other technologies
as Prelert maps together the picture • Dedicated Search Heads
mean workloads outside of convenPonal Splunk usage
Prelert
WAS/JAVA
I.H.S
Compuware Vantage
z/OS CICS
COBOL Datapower
MQ
DB2/UDB
Our Prelert Splunk Applica;on
Aetna Inc. 11
Source: Prelert
Our Prelert Splunk Applica;on
Aetna Inc. 12
• Retrospec;ve ‒ Auto-‐detect: This is a Prelert UI to write searches against any log(s) that pre-‐exist in Splunk and idenPfy anomalies
• Real-‐;me ‒ Prelert searches are pre-‐defined and setup to run at regular intervals to establish/adjust baselines on metrics
Transforms this to a handful of meaningful anomalies per day
Our Prelert Splunk Applica;on Tac;cs
Aetna Inc. 13
Our Prelert Splunk Applica;on Con;nuous Improvement Approach
If the search was successful, implement in real ;me on a wider scale for con;nuous benchmarking and
aler;ng
If the search needs improvement (i.e.
addi;on/exclusion of new metrics), begin the
cycle again
Customer
Plan
Do
Check
Act
Iden;fy an opportunity and plan for predic;ve analy;cs (anomaly detec;on)
Implement the search on a small scale in auto-‐detect mode
Use data to analyze the results of the search and determine whether it
iden;fied anomalies
Aetna Inc. 14
Our Values Splunk Prelert Path Cool
Aetna Inc. 15
Reac;ve Path Case Study
An outage where source and impact were unknown to flagship self service applicaPon StarPng at 4:10PM an response team was acPvated as Navigator transacPons hung. Self registraPon JVMs crashed. ODR failure followed At 6:30PM Navigator UDB instance recycled ulPmately allowing the return to stability
Aetna Inc. 16
Reac;ve Path Focal Point / Transac;on Alert Event
The first wave of misbehavior began at 4:10PM when self registra;on failed
JVM Request : * Method : N/A SQL : N/A Resource : Resident Time Trap CondiPon : ApplicaPon Offending Content : /registraPon/MyAssist/ListResults.jsp Threshold : 120,000 milliseconds MaxMin : Maximum Number of Hits : 7 Offending Value : 180,029 milliseconds Severity : Low
Aetna Inc. 17
Reac;ve Path Failure Log Output
[1/13/14 16:08:14:169 EST] 00000ae4 TCPChannel W TCPC0004W: TCP Channel TCP_4 has exceeded the maximum number of open connec;ons 100. [1/13/14 16:09:09:971 EST] 0000002e ThreadMonitor W WSVR0605W: Thread "WebContainer : 2" (00000040) has been ac;ve for 395389 milliseconds and may be hung. There is/are 1 thread(s) in total in the server that may be hung. at java.net.SocketInputStream.socketRead0(Na;ve Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at com.ibm.db2.jcc.t4.z.b(z.java:199) at com.ibm.db2.jcc.t4.z.c(z.java:289) at com.ibm.db2.jcc.t4.z.c(z.java:402) at com.ibm.db2.jcc.t4.z.v(z.java:1170) at com.ibm.db2.jcc.t4.ab.c(ab.java:137) at com.ibm.db2.jcc.t4.b.Wc(b.java:1308) at com.ibm.db2.jcc.t4.b.b(b.java:1227) at com.ibm.db2.jcc.t4.b.a(b.java:5983) at com.ibm.db2.jcc.t4.b.c(b.java:792) at com.ibm.db2.jcc.t4.b.b(b.java:735) at com.ibm.db2.jcc.t4.b.a(b.java:402) at com.ibm.db2.jcc.t4.b.<init>(b.java:331) at com.ibm.db2.jcc.DB2PooledConnec;on.<init>(DB2PooledConnec;on.java:84)
Log file analysis in Splunk showed the threads being hung in db2 and connec;on limits reached
Aetna Inc. 18
Reac;ve Path Failure SQL Deficiency Isola;on
These SQL integrity constraint viola;ons communicated resource conten;on
Aetna Inc. 19
Reac;ve Path Failure We con;nued to understand the connec;on interac;ons with this shared db resource within Tivoli. This picture is the DB connec;on escala;on paIern
Aetna Inc. 20
Reac;ve Path Ethernet Traffic Sniff DB Isola;on Compuware DCRUM
These are DB views from watching traffic to the database. It helped visualize DB interac;ons
Aetna Inc. 21
Reac;ve Path Conflict Detec;on in Splunk
2014-‐01-‐21-‐12.54.05.315574-‐300 E14573A538 LEVEL: Warning PID : 10158854 TID : 75911 PROC : db2sysc 0 INSTANCE: navudbp1 NODE : 000 DB : DSRGP000 APPHDL : 0-‐9855 APPID: IP.HERE.54686.140121173034 AUTHID : LRTUSER EDUID : 75911 EDUNAME: db2agent (DSRGP000) 0 FUNCTION: DB2 UDB, data management, sqldEscalateLocks, probe:3
MESSAGE : ADM5502W The escala;on of "700" locks on table "GSRGP00D.T007000" to lock intent "X" was successful. 2014-‐01-‐21-‐12.56.02.803084-‐300 I15112A544 LEVEL: Error PID : 10158854 TID : 84770 PROC : db2sysc 0 INSTANCE: navudbp1 NODE : 000 DB : DSRGP000
APPHDL : 0-‐7410 APPID: IP2.HERE.44561.140121151200 AUTHID : LRTUSER EDUID : 84770 EDUNAME: db2agent (DSRGP000) 0 FUNCTION: DB2 UDB, common communica;on, sqlcctcptest, probe:11
MESSAGE : Detected client termina;on DATA #1 : Hexdump, 2 bytes 0x070000019BFF4D42 : 0036 .6
• DB log output communicated the conflict to SME
• Data showed contenPon caused by overlap of online and batch systems
Aetna Inc. 22
Reac;ve Path Tivoli Enterprise Portal (TEP) Memory and CPU doubled for the batch
server making jobs not run within ;meframes
Aetna Inc. 23
Reac;ve Path Summary
• No hardware or infrastructure soRware issues were found on any of the Websphere, Frameworks, or DB servers
• The team determined lock waits for DB2 came from a batch system which caused the crash of several online JVMs
• DBA Support implemented a monitoring change to capture data base lock waits in excess of 20 seconds. This also captures lock Pme outs and deadlocks
• DBA and AD are reviewing SQL, adding “WITH UR” to increase concurrency where needed
FROM COMMAND LINE TO SPLUNK = FASTER INFORMATION
Aetna Inc. 24
Predic;ve Path Aetna Voice Advantage (AVA)
• Award winning natural voice recogniPon system • Customers opPng out to “live” service reps due to
delays in backend systems • Defined the unknown: Applied Prelert machine
learning to understand ‘normal’ events from many plaporms and products
This was impossible before
Aetna Inc. 25
December January February March
Prelert Anomaly Recognition with AVA in the I.H.S/ODR/DataPower System Out Data
Manual Isolation and identification of interdependencies beyond single Silo • Direct Results:
• Coding Defect in AVA • Datapower Firmware Deficiency (IBM
Product Request) • Limitation of CICS job modified
• Indirect Results: • Created working team comprised of
effected components • Integrated into existing specialized
alerting to enhance process
Explored auto learn dynamic correlation between silos which enhanced insight. Feeding continued research. • Direct Results:
• An association between CICS job failures and AVA OPT OUTS is recognized
• Indirect Results: • Understanding the tool. A lot of cross-discipline
teamwork.
Implemented enhanced correlated search for AVA • Direct Results:
• A new live search will consume a larger set of machine data enhancing anomaly detection.
• Indirect Results: • The anomaly detection will
initiate a deeper knowledge set, enhancing results.
Predic;ve Path: AVA Con;nuous Improvement
Aetna Inc. 26
Predic;ve Path: AVA Con;nuous Improvement Anomaly Detec;on Results
This is the aggregate dump time for CICS zone (z/OS) online transaction impacting
April 17.5 Minutes May 8 Minutes June 2.5 Minutes July 30 Seconds
System Dump Counts by Month
Aetna Inc. 27
Weekly count of Pmeouts for the anomalous AVA queue Q151P.P001.NTWKCATINQ.ONLINE.AVA.REQ
Predic;ve Path: AVA Con;nuous Improvement Anomaly Detec;on Results
Aetna Inc. 28
The number of criPcal events resulPng from synthePc Hammer transacPons for AVA has dropped off. From over 400 per month to single digits ( 3 and 8 in June and July)
Predic;ve Path: AVA Con;nuous Improvement Anomaly Detec;on Results
Aetna Inc. 29
• Improved applicaPon resiliency for applicaPons beyond the targeted systems (AVA, NAV, ACAS, OPT)
• Lowered ‘stop the world’ dump acPvity in CICS region from over 17 minutes in April to 30 seconds in July
• CICS region stability resulted in less backend Pmeouts in Datapower plaporm.
• Prelert anomalies resulted in people paying arenPon, adding to engagement legiPmacy and prompted engagement to answer the quesPon of ‘why?’
• The more transacPons that are self serviced the lower cost to service
Predic;ve Path: Results Summary
Aetna Inc. 30
Our Values Splunk Prelert Path Cool
Aetna Inc. 31
Two contests were held this year igniPng the spirit of innovaPon. All employees and any company that work with Aetna were invited to the challenge • The first was a ‘Happy New Year Hackathon’ the challenge was to create an
original soluPon to berer people’s health
• Most Recent challenge ‘Big Datathon’ was to create a predicPon algorithm for our internaPonal business for claims costs. Hadoop!
Cool: Hackathon & Datathon
Aetna Inc. 32
Splunk’s Hunk is in our Lab Possibili6es are endless
Other Stuff:
Top Products & Plaporm Resiliency Effort OOM/Coding Defects (Dev Ops feedback) Mainframe data (Syncsort)
Cool: Hunk & Other Stuff
Thank you