Case Study: Increasing Produban's Critical Systems Availability and Performance

ca Opscenter

Case Study: Increasing Produban's Critical Systems Availability and Performance Vitor Sousa

OCX15S #CAWorld

Director, Monitoring Tools and Processes Produban

2 © 2014 CA. ALL RIGHTS RESERVED.

Abstract

The Santander Group is a Spanish banking group and the largest bank in the

Eurozone by market value. It is also one of the largest banks in the world in

terms of market capitalization. Produban is Santander’s group company

responsible for Santander's entire IT infrastructure. Produban challenge was

to monitor proactively and in real time, all transaction running in some

critical system and being able to take actions before major problems happen.

Considering this scenario, Produban adopted CA Core APM (Introscope) in

order to count with alerts that permit to the technical team to detect

problems before they impact business. Also Produban uses APM Core to

create dashboards to make easier to identify when thresholds are reached

and help operations team to take actions to normalize the situation. With

this measures Produban reduced their MTTR from days to hours at the same

time they heavily increase their visibility of critical IT services.

Vitor Sousa

Produban

The Santander Group


Agenda

ABOUT THE SPEAKER

COMPANY OVERVIEW

CHALLENGES FROM A NEW WAY OF THINKING ABOUT APPLICATION MONITORING

THE PROJECT

THE SOLUTION AND RESULTS WITH CA APM

1

2

3

4

5


Produban

Vitor Sousa Director Monitoring Tools and Processes

Produban Brazil – Santander Group

[email protected] +5511 96192-5194

Background

BS in Economy, post-graduate in Systems Administration and MBA in Finance; almost 20 years in IT market; experienced in several IT areas:

IT Solutions and Sales

IT Processes

Infrastructure Management

Software Development (focused on Infrastructure Monitoring)


Santander Group

Founded in 1857, Santander, Spain

Strong presence in 10 major countries in Europe and the Americas, with businesses in over 40 countries

The largest bank in the Eurozone and one of the largest in the world

Commercial bank


Produban Company

Produban manages and controls the entire IT infrastructure of the Santander Group: – Retail Banking – Global units – Corporate Units

Established in the May 1, 2005

100 percent owned by the Group


Mission

Production management excellence

Efficiency

Service

quality

Operational

risk

Based on:

Perform a unified and standardized production management of the Financial Santander Group entities and the

establishment of the Infrastructure Group.

Adding value to the business

Time-to-market

Flexibility


Produban – Subsidiaries and Branches

+ 5.000 professionals


Produban – Major Customers

Produban provides service to more than 120 Financial Institutions Groups.


Infrastructure Group – Data Center

Carlton Park (3.000 m2)

Shenley Wood (2.500 m2)

Bletchley (1.950 m2)

UK

ES

BR

MX

Querétaro (3.000 m2)

Campinas (3.600 m2) Boadilla (3.900 m2 - 1.950 m2 x 2)

Cantabria (6.000 m2 - 3.000m2 x2)


Infrastructure Group – Private Network


Infrastructure Group – Processing

+ 28.000 Physical servers

+ 56.000 Logical servers

+ 22.000 Data Bases

Volumetric Processing Equipment


Volumetric Processing

106,6 million Banking retail customers

11,6 million active Internet customers

2,6 million Mobile banking customers

30 million Credit cards

80 million Debit cards

30 million Call in contact center per month

5.000 million Transactions per month

67 million Card transactions peak day

9,6 million Batch executions per month

16,7 million Payments per day



The arrival of a new Executive Officer (Enrique Sanchez) with new ideas, encouraging the team to a different way of thinking

He brought us back the power to seek new solutions, most appropriate to the needs of modern IT.

A mindset change in the way of monitoring: Monitoring much more focused on automation and proactivity Develop visions related to "health service“ Focus on improve team productivity and assertiveness


Challengers

Decrease the number of incidents caused by applications.

Not Alarmed 75%

Alarmed 25%

Incidents Number of Alerts Incidents without alerts – Reasons

Application 64%

Business Rules 14%

Items not monitored

22%

September 2013

1

A new model of monitoring applications with greater productivity and efficiency, using dashboard for simpler and easier monitoring.

2

Improve proactive and real-time monitoring, so that technical teams will be able to detect problems before they impact services.

3

Improve thresholds management, considering changes in application behavior and false positives.

4


The Project Milestones and Time

Project kick-off

Environment stabilization

Improved performance

Script creation to optimization performance

Change Scope – focusing module generator automation

Requisites and process definition

Developing module generator

Dynamic threshold definitions

Go to production

12/12 1/13 2/13 3/13 4/13 5/13 6/13 7/13 8/13

Gabriel Mochnacs Arruda Responsible for monitoring team Produban Brasil

Plinio Augusto Moreira CA Technical


Challenge Decrease the number of incidents caused by applications.

Goal Decrease the development time of new "application monitoring plans.”

Solution Automate the construction of new application monitoring services, based on CA APM.

Results This solution has been used in preproduction and production for the systems Portal CIC Cuentas, Portal CIC Cards and Norkom (Risk Manager) since September 2013. We reduced the number of application incidents for these systems by 66 percent, and the time for troubleshooting dropped 10 times approximately.

1


2 Challenge Create an automated process to identify new services and new application into existing services.

Goal Keep the environment always updated with new servers and applications based on automatic tools.

Solution Connect with WebSphere® Deploy Manager to known new functions or new application servers in the environment.

Results

After implementing this connection with DMGR, we reduced to zero the number of new applications or servers deployed without being monitored – for the systems Portal CIC Cuentas, Cartões and Norkom.


3 Challenge Require a new model of monitoring application with greater productivity and efficiency, using dashboard more.

Goal Improve troubleshooting response time to application and infrastructure events with greater assertiveness.

Solution Automate the new CA APM dashboard construction for easy viewing of the support and monitoring teams.

Prerequisites: Meeting with architecture application to understand how the system works, the

most important points to be monitored and the boundaries of application (flows of inputs and output). Create a new CA APM template if the monitored application does not meet the existing models in our library.

Results: 264 dashboards created in five minutes. Effort to create without Module Generator:

270 hours or 33 workdays.


Technical Details

Modulo generator

Creates automated dashboards

Shows the applications path through an application server

Presents the health of Java components, front-ends, back-ends and JVM resources

Developed flow Dashboard

DMGR App

Server APM

Process Dash

Template Thresholds

Create systematic

connection.

Create an engine.

Template with

information.

Create standard

templates images.

Return data processed

to the APM.


Modules Generator – Diagram

XML DMGR

Template

Modules Generator Java Application

HSQL

APM server

Web service

HSQL database

Dashboard created

Daily routine for storing thresholds

Direct connection between the application and the DMGR for reading XML


Modules Generator – Components XML DMGR: Communication between the Modules Generator and Deploy Manager WebSphere. Modules Generator reads the serverindex.xml file, which contains the application distribution between AppServers. It is the input to generate the first module and is necessary to ensure that the generator modules can communicate with the DMGR to consume XML. Template: Pre-configured APM module with list of Metric Groups, Alerts and Dashboards to be created. All items in this module have variables that will be used by the Modules Generator.

HyperSQL Database: Database embedded in the application. No installation is necessary. It is used to store the thresholds and provide analysis of these and update these values in the APM module. XML Verification Routine: Monitoring of serverindex.xml. Whenever a new module changes must be generated to update information in the Dashboard. Thresholds Recording Routine: Daily execution routine for recording data calculated in Generator modules in the database. The routine will write the data from the previous day.

XML DMGR

Template

HSQL

ZABBIX


Main Flow Routine

Install APM agent in the application that will

be monitored.

Communication with DMGR WebSphere – Collect information from applications and App Servers that are running through the XML

Server Index.

Run Generator Modules – Phase 1 Creating Metric

Groups and Alerts.

.jar Deploy – (.jar created by Generator Modules APM)

Daily routine data collection – Necessary to achieve the thresholds, identify the

application operating time and possible deviations

Run the generator modules with the application´s thresholds.

Create .jar file to deploy in APM.

Dashboards are created.

Mandatory parameters Hostname and APM

Communication Port

User with access to the tool.

Deploy Manager address.

ServerIndex.xml path in server;

Ensure the communication between the Modules Generator and DMGR.

Include the execution routine for .jar into the server. Process that to record historical data in HSQL database.


What is monitored?

CPU

Garbage collector (Java memory manager)

JVM

Servlets (XML/HTML translator)

JSP (Java Server page)

EJB (Motor Java)

JMS (Msgs Java processor)

Java

Thread pool

Connection pools

AppServer

Queries

Connection count

MQ

Web services

Backends

URLs

Application

Frontends

Time the transaction

Response time, freezing, number of calls and errors are monitored

Information from PMI and JMX

Metric groups

Application


Setting Alerts – Metrics Groups Grouped metrics that allow information- gathering in one or more applications,

or one or more Java component

Metric groups are used to define the alerts, and to follow the health of the application or component that is grouped

Defined by regular expressions that will “match" the information displayed

Metrics Groups Example


Setting Alerts – Metrics Signature

Application 01 in AppServer_ServerName presenting bottleneck symptoms. Click the link http://XPTO.com.br to view the corresponding Dashboard.

Example – Application Bottleneck

Increase in average response

time

Increase in concurrent invocation

Increase in stall count

Less threads

available

Possible bottleneck

in application

With the above condition being true, an alert will be sent to front- end with the following message:

Metric signature is a combination of several Metric Groups types that indicates the application most common problems

Integration with others monitoring systems like Alert Modeling


Dashboard


4 Challenge Improve thresholds management, considering changes in application behavior and false positives.

Goal Decrease or eliminate “false positives” in monitoring events, caused by thresholds deviations.

Solution Creating the concept of dynamic thresholds based on historical occurrences and automatically configure.

Results Decrease in false positive application alerts by 77 percent. Proactive monitoring: Thresholds adjusted to alarm before it becomes an incident; trend analysis and deviation in the application behavior; alerts accuracy and automated thresholds updating; thresholds validation mechanism based on application history; input information for application capacity process; thresholds calculated for all active Metric Groups in CA APM’s Module Manager.


Technical Details – Dynamic Thresholds

APM metrics

Threshold calculations

Data stored

Threshold- checking

Upgraded module

Java application generator modules

Create a new database to store indicators historical data.

Create an automatic extraction of observations to feed the database items occurrences.

Develop a logic to identify the thresholds "optimal point."

Implement a new loading process in CA APM when it identifies the need for a new threshold.

Create a new flow of threshold validation and the level of "false positives" rates.


Database Thresholds Example

Requires monthly validation process to determine if the registered thresholds remain appropriate, or if an update is needed

Data for analysis will always be from the last two months.

Generator Modules will bring statistics data to help the analysis.

Possibility to export data to a .csv file

System name

Metric group

Daily values

Statistical data Current thresholds

Updated thresholds


For More Information

To learn more about DevOps, please visit:

http://bit.ly/1wbjjqX

Insert appropriate screenshot and text overlay from following “More Info Graphics” slide here;

ensure it links to correct page DevOps






For Informational Purposes Only Terms of this Presentation

This presentation provided at CA World 2014 is intended for information purposes only and does not form any type of warranty. Content provided in this presentation has not been reviewed for accuracy and is based on information provided by CA Partners and Customers.

Technology

Case Study: Increasing Produban's Critical Systems Availability and Performance