33
ca Opscenter Case Study: Increasing Produban's Critical Systems Availability and Performance Vitor Sousa OCX15S #CAWorld Director, Monitoring Tools and Processes Produban

Case Study: Increasing Produban's Critical Systems Availability and Performance

Embed Size (px)

DESCRIPTION

The Santander Group is a Spanish banking group and the largest bank in the Eurozone by market value. It is also one of the largest banks in the world in terms of market capitalization. Produban is Santander’s group company responsible for Santander's entire IT infrastructure. The Produban challenge was to monitor - proactively and in real time - all transactions running in critical systems and being able to take action before major problems happen. Considering this scenario, Produban adopted CA Core APM (CA Introscope) in order to count with alerts that permit to the technical team to detect problems before they impact business. For more information on DevOps solutions from CA Technologies, please visit: http://bit.ly/1wbjjqX

Citation preview

Page 1: Case Study: Increasing Produban's Critical Systems Availability and Performance

ca Opscenter

Case Study: Increasing Produban's Critical Systems Availability and Performance Vitor Sousa

OCX15S #CAWorld

Director, Monitoring Tools and Processes Produban

Page 2: Case Study: Increasing Produban's Critical Systems Availability and Performance

2 © 2014 CA. ALL RIGHTS RESERVED.

Abstract

The Santander Group is a Spanish banking group and the largest bank in the

Eurozone by market value. It is also one of the largest banks in the world in

terms of market capitalization. Produban is Santander’s group company

responsible for Santander's entire IT infrastructure. Produban challenge was

to monitor proactively and in real time, all transaction running in some

critical system and being able to take actions before major problems happen.

Considering this scenario, Produban adopted CA Core APM (Introscope) in

order to count with alerts that permit to the technical team to detect

problems before they impact business. Also Produban uses APM Core to

create dashboards to make easier to identify when thresholds are reached

and help operations team to take actions to normalize the situation. With

this measures Produban reduced their MTTR from days to hours at the same

time they heavily increase their visibility of critical IT services.

Vitor Sousa

Produban

The Santander Group

Page 3: Case Study: Increasing Produban's Critical Systems Availability and Performance

3 © 2014 CA. ALL RIGHTS RESERVED.

Agenda

ABOUT THE SPEAKER

COMPANY OVERVIEW

CHALLENGES FROM A NEW WAY OF THINKING ABOUT APPLICATION MONITORING

THE PROJECT

THE SOLUTION AND RESULTS WITH CA APM

1

2

3

4

5

Page 4: Case Study: Increasing Produban's Critical Systems Availability and Performance

4 © 2014 CA. ALL RIGHTS RESERVED.

Produban

Vitor Sousa Director Monitoring Tools and Processes

Produban Brazil – Santander Group

[email protected] +5511 96192-5194

Background

BS in Economy, post-graduate in Systems Administration and MBA in Finance; almost 20 years in IT market; experienced in several IT areas:

IT Solutions and Sales

IT Processes

Infrastructure Management

Software Development (focused on Infrastructure Monitoring)

Page 5: Case Study: Increasing Produban's Critical Systems Availability and Performance

5 © 2014 CA. ALL RIGHTS RESERVED.

Santander Group

Founded in 1857, Santander, Spain

Strong presence in 10 major countries in Europe and the Americas, with businesses in over 40 countries

The largest bank in the Eurozone and one of the largest in the world

Commercial bank

Page 6: Case Study: Increasing Produban's Critical Systems Availability and Performance

6 © 2014 CA. ALL RIGHTS RESERVED.

Produban Company

Produban manages and controls the entire IT infrastructure of the Santander Group: – Retail Banking – Global units – Corporate Units

Established in the May 1, 2005

100 percent owned by the Group

Page 7: Case Study: Increasing Produban's Critical Systems Availability and Performance

7 © 2014 CA. ALL RIGHTS RESERVED.

Mission

Production management excellence

Efficiency

Service

quality

Operational

risk

Based on:

Perform a unified and standardized production management of the Financial Santander Group entities and the

establishment of the Infrastructure Group.

Adding value to the business

Time-to-market

Flexibility

Page 8: Case Study: Increasing Produban's Critical Systems Availability and Performance

8 © 2014 CA. ALL RIGHTS RESERVED.

Produban – Subsidiaries and Branches

+ 5.000 professionals

Page 9: Case Study: Increasing Produban's Critical Systems Availability and Performance

9 © 2014 CA. ALL RIGHTS RESERVED.

Produban – Major Customers

Produban provides service to more than 120 Financial Institutions Groups.

Page 10: Case Study: Increasing Produban's Critical Systems Availability and Performance

10 © 2014 CA. ALL RIGHTS RESERVED.

Infrastructure Group – Data Center

Carlton Park (3.000 m2)

Shenley Wood (2.500 m2)

Bletchley (1.950 m2)

UK

ES

BR

MX

Querétaro (3.000 m2)

Campinas (3.600 m2) Boadilla (3.900 m2 - 1.950 m2 x 2)

Cantabria (6.000 m2 - 3.000m2 x2)

Page 11: Case Study: Increasing Produban's Critical Systems Availability and Performance

11 © 2014 CA. ALL RIGHTS RESERVED.

Infrastructure Group – Private Network

Page 12: Case Study: Increasing Produban's Critical Systems Availability and Performance

12 © 2014 CA. ALL RIGHTS RESERVED.

Infrastructure Group – Processing

+ 28.000 Physical servers

+ 56.000 Logical servers

+ 22.000 Data Bases

Volumetric Processing Equipment

Page 13: Case Study: Increasing Produban's Critical Systems Availability and Performance

13 © 2014 CA. ALL RIGHTS RESERVED.

Volumetric Processing

106,6 million Banking retail customers

11,6 million active Internet customers

2,6 million Mobile banking customers

30 million Credit cards

80 million Debit cards

30 million Call in contact center per month

5.000 million Transactions per month

67 million Card transactions peak day

9,6 million Batch executions per month

16,7 million Payments per day

Page 14: Case Study: Increasing Produban's Critical Systems Availability and Performance

14 © 2014 CA. ALL RIGHTS RESERVED.

Page 15: Case Study: Increasing Produban's Critical Systems Availability and Performance

15 © 2014 CA. ALL RIGHTS RESERVED.

The arrival of a new Executive Officer (Enrique Sanchez) with new ideas, encouraging the team to a different way of thinking

He brought us back the power to seek new solutions, most appropriate to the needs of modern IT.

A mindset change in the way of monitoring: Monitoring much more focused on automation and proactivity Develop visions related to "health service“ Focus on improve team productivity and assertiveness

Page 16: Case Study: Increasing Produban's Critical Systems Availability and Performance

16 © 2014 CA. ALL RIGHTS RESERVED.

Challengers

Decrease the number of incidents caused by applications.

Not Alarmed 75%

Alarmed 25%

Incidents Number of Alerts Incidents without alerts – Reasons

Application 64%

Business Rules 14%

Items not monitored

22%

September 2013

1

A new model of monitoring applications with greater productivity and efficiency, using dashboard for simpler and easier monitoring.

2

Improve proactive and real-time monitoring, so that technical teams will be able to detect problems before they impact services.

3

Improve thresholds management, considering changes in application behavior and false positives.

4

Page 17: Case Study: Increasing Produban's Critical Systems Availability and Performance

17 © 2014 CA. ALL RIGHTS RESERVED.

The Project Milestones and Time

Project kick-off

Environment stabilization

Improved performance

Script creation to optimization performance

Change Scope – focusing module generator automation

Requisites and process definition

Developing module generator

Dynamic threshold definitions

Go to production

12/12 1/13 2/13 3/13 4/13 5/13 6/13 7/13 8/13

Gabriel Mochnacs Arruda Responsible for monitoring team Produban Brasil

Plinio Augusto Moreira CA Technical

Page 18: Case Study: Increasing Produban's Critical Systems Availability and Performance

18 © 2014 CA. ALL RIGHTS RESERVED.

Challenge Decrease the number of incidents caused by applications.

Goal Decrease the development time of new "application monitoring plans.”

Solution Automate the construction of new application monitoring services, based on CA APM.

Results This solution has been used in preproduction and production for the systems Portal CIC Cuentas, Portal CIC Cards and Norkom (Risk Manager) since September 2013. We reduced the number of application incidents for these systems by 66 percent, and the time for troubleshooting dropped 10 times approximately.

1

Page 19: Case Study: Increasing Produban's Critical Systems Availability and Performance

19 © 2014 CA. ALL RIGHTS RESERVED.

2 Challenge Create an automated process to identify new services and new application into existing services.

Goal Keep the environment always updated with new servers and applications based on automatic tools.

Solution Connect with WebSphere® Deploy Manager to known new functions or new application servers in the environment.

Results

After implementing this connection with DMGR, we reduced to zero the number of new applications or servers deployed without being monitored – for the systems Portal CIC Cuentas, Cartões and Norkom.

Page 20: Case Study: Increasing Produban's Critical Systems Availability and Performance

20 © 2014 CA. ALL RIGHTS RESERVED.

3 Challenge Require a new model of monitoring application with greater productivity and efficiency, using dashboard more.

Goal Improve troubleshooting response time to application and infrastructure events with greater assertiveness.

Solution Automate the new CA APM dashboard construction for easy viewing of the support and monitoring teams.

Prerequisites: Meeting with architecture application to understand how the system works, the

most important points to be monitored and the boundaries of application (flows of inputs and output). Create a new CA APM template if the monitored application does not meet the existing models in our library.

Results: 264 dashboards created in five minutes. Effort to create without Module Generator:

270 hours or 33 workdays.

Page 21: Case Study: Increasing Produban's Critical Systems Availability and Performance

21 © 2014 CA. ALL RIGHTS RESERVED.

Technical Details

Modulo generator

Creates automated dashboards

Shows the applications path through an application server

Presents the health of Java components, front-ends, back-ends and JVM resources

Developed flow Dashboard

DMGR App

Server APM

Process Dash

Template Thresholds

Create systematic

connection.

Create an engine.

Template with

information.

Create standard

templates images.

Return data processed

to the APM.

Page 22: Case Study: Increasing Produban's Critical Systems Availability and Performance

22 © 2014 CA. ALL RIGHTS RESERVED.

Modules Generator – Diagram

XML DMGR

Template

Modules Generator Java Application

HSQL

APM server

Web service

HSQL database

Dashboard created

Daily routine for storing thresholds

Direct connection between the application and the DMGR for reading XML

Page 23: Case Study: Increasing Produban's Critical Systems Availability and Performance

23 © 2014 CA. ALL RIGHTS RESERVED.

Modules Generator – Components XML DMGR: Communication between the Modules Generator and Deploy Manager WebSphere. Modules Generator reads the serverindex.xml file, which contains the application distribution between AppServers. It is the input to generate the first module and is necessary to ensure that the generator modules can communicate with the DMGR to consume XML. Template: Pre-configured APM module with list of Metric Groups, Alerts and Dashboards to be created. All items in this module have variables that will be used by the Modules Generator.

HyperSQL Database: Database embedded in the application. No installation is necessary. It is used to store the thresholds and provide analysis of these and update these values in the APM module. XML Verification Routine: Monitoring of serverindex.xml. Whenever a new module changes must be generated to update information in the Dashboard. Thresholds Recording Routine: Daily execution routine for recording data calculated in Generator modules in the database. The routine will write the data from the previous day.

XML DMGR

Template

HSQL

ZABBIX

Page 24: Case Study: Increasing Produban's Critical Systems Availability and Performance

24 © 2014 CA. ALL RIGHTS RESERVED.

Main Flow Routine

Install APM agent in the application that will

be monitored.

Communication with DMGR WebSphere – Collect information from applications and App Servers that are running through the XML

Server Index.

Run Generator Modules – Phase 1 Creating Metric

Groups and Alerts.

.jar Deploy – (.jar created by Generator Modules APM)

Daily routine data collection – Necessary to achieve the thresholds, identify the

application operating time and possible deviations

Run the generator modules with the application´s thresholds.

Create .jar file to deploy in APM.

Dashboards are created.

Mandatory parameters Hostname and APM

Communication Port

User with access to the tool.

Deploy Manager address.

ServerIndex.xml path in server;

Ensure the communication between the Modules Generator and DMGR.

Include the execution routine for .jar into the server. Process that to record historical data in HSQL database.

Page 25: Case Study: Increasing Produban's Critical Systems Availability and Performance

25 © 2014 CA. ALL RIGHTS RESERVED.

What is monitored?

CPU

Garbage collector (Java memory manager)

JVM

Servlets (XML/HTML translator)

JSP (Java Server page)

EJB (Motor Java)

JMS (Msgs Java processor)

Java

Thread pool

Connection pools

AppServer

Queries

Connection count

MQ

Web services

Backends

URLs

Application

Frontends

Time the transaction

Response time, freezing, number of calls and errors are monitored

Information from PMI and JMX

Metric groups

Application

Page 26: Case Study: Increasing Produban's Critical Systems Availability and Performance

26 © 2014 CA. ALL RIGHTS RESERVED.

Setting Alerts – Metrics Groups Grouped metrics that allow information- gathering in one or more applications,

or one or more Java component

Metric groups are used to define the alerts, and to follow the health of the application or component that is grouped

Defined by regular expressions that will “match" the information displayed

Metrics Groups Example

Page 27: Case Study: Increasing Produban's Critical Systems Availability and Performance

27 © 2014 CA. ALL RIGHTS RESERVED.

Setting Alerts – Metrics Signature

Application 01 in AppServer_ServerName presenting bottleneck symptoms. Click the link http://XPTO.com.br to view the corresponding Dashboard.

Example – Application Bottleneck

Increase in average response

time

Increase in concurrent invocation

Increase in stall count

Less threads

available

Possible bottleneck

in application

With the above condition being true, an alert will be sent to front- end with the following message:

Metric signature is a combination of several Metric Groups types that indicates the application most common problems

Integration with others monitoring systems like Alert Modeling

Page 28: Case Study: Increasing Produban's Critical Systems Availability and Performance

28 © 2014 CA. ALL RIGHTS RESERVED.

Dashboard

Page 29: Case Study: Increasing Produban's Critical Systems Availability and Performance

29 © 2014 CA. ALL RIGHTS RESERVED.

4 Challenge Improve thresholds management, considering changes in application behavior and false positives.

Goal Decrease or eliminate “false positives” in monitoring events, caused by thresholds deviations.

Solution Creating the concept of dynamic thresholds based on historical occurrences and automatically configure.

Results Decrease in false positive application alerts by 77 percent. Proactive monitoring: Thresholds adjusted to alarm before it becomes an incident; trend analysis and deviation in the application behavior; alerts accuracy and automated thresholds updating; thresholds validation mechanism based on application history; input information for application capacity process; thresholds calculated for all active Metric Groups in CA APM’s Module Manager.

Page 30: Case Study: Increasing Produban's Critical Systems Availability and Performance

30 © 2014 CA. ALL RIGHTS RESERVED.

Technical Details – Dynamic Thresholds

APM metrics

Threshold calculations

Data stored

Threshold- checking

Upgraded module

Java application generator modules

Create a new database to store indicators historical data.

Create an automatic extraction of observations to feed the database items occurrences.

Develop a logic to identify the thresholds "optimal point."

Implement a new loading process in CA APM when it identifies the need for a new threshold.

Create a new flow of threshold validation and the level of "false positives" rates.

Page 31: Case Study: Increasing Produban's Critical Systems Availability and Performance

31 © 2014 CA. ALL RIGHTS RESERVED.

Database Thresholds Example

Requires monthly validation process to determine if the registered thresholds remain appropriate, or if an update is needed

Data for analysis will always be from the last two months.

Generator Modules will bring statistics data to help the analysis.

Possibility to export data to a .csv file

System name

Metric group

Daily values

Statistical data Current thresholds

Updated thresholds

Page 32: Case Study: Increasing Produban's Critical Systems Availability and Performance

32 © 2014 CA. ALL RIGHTS RESERVED.

For More Information

To learn more about DevOps, please visit:

http://bit.ly/1wbjjqX

Insert appropriate screenshot and text overlay from following “More Info Graphics” slide here;

ensure it links to correct page DevOps

Page 33: Case Study: Increasing Produban's Critical Systems Availability and Performance

33 © 2014 CA. ALL RIGHTS RESERVED.

For Informational Purposes Only Terms of this Presentation

This presentation provided at CA World 2014 is intended for information purposes only and does not form any type of warranty. Content provided in this presentation has not been reviewed for accuracy and is based on information provided by CA Partners and Customers.