Syoncloud big data for retail banking

Syoncloud Big Data for Retail Banking

Syoncloud offers comprehensive Big Data / Data Science solution for retail banks.We cover areas such as:

Individualization of product offers to existing clientsEarly fraud detection and fraud damage mitigationPrediction of products cancellations and client's defectionsOptimal allocation of cash to ATMs and bank branchesMinimization of usage of expensive bank channels such as branch visitsReliable assessment of clients for debt products

Common DatasetsCommon Datasets are used as a foundation for complex analysis.

Creation of Common Datasets for Analysis Related to Bank's ClientsWe create a dataset of monthly expenses and incomes categories for all clients, all their accounts and complete history. This dataset iscreated from bank accounts movements, direct debits and standing orders. Each account movement is usually accompanied with type ofmovement code such as electricity, phone bill, restaurant type code and so on. We also use merchant's name, description and commentfields to categorize each transaction. Direct debits and standing orders are also accompanied with type codes.

We recognize several categories of expenses such as housing expenses (rent or mortgage), energy expenses (gas and electricity), food andhousehold related expenses, education (schools, books, courses), car expenses (fuel and repairs), restaurants, big ticket items (TV, furniture),taxes, recreation and hobby, credit card and loan payments, luxury items and so on.

Income categories are salaries, dividends, tax refunds, social benefits, rental income, sales and so on. Simple regression analysis of thisdataset gives us overall trends for total expenses, incomes and savings as well as detail trends for each category of incomes and expenses foreach client.

Machine Learning and PredictionsWe use full range of machine learning algorithms and models to make predictions. There are two broad categories supervised andunsupervised algorithms.

Supervised learning algorithms use historical data to learn that certain combinations and values of inputs cause certain outputs. We createmodels that are trained and verified on samples of historical data. Sample data can be chosen randomly but we have seen better results if wecategorize our datasets first. In case of customer dataset we create categories such as age, income, location based on town size, educationand savings. Each category is split into brackets. For example age category is split into 20 five years age brackets. We know how manycustomers is in each age bracket so we can sample certain percentage of records from each age bracket. The same way we sample othercategories. These samples are ideal to see what category make largest contribution to overall results. For example we can see that educationmakes largest contribution to accept certain investment product.

Unsupervised machine learning algorithms look for unknown patterns in available data.For example we find patterns of unusual behaviour of clients to find early signs of frauds. In past we were limited by statistical analysis ofbehaviour that was common for all clients all large groups of clients. We unsupervised learning models we can find patterns that surfaceonly in small number of records.

Individualization of Product OffersIndividualization of product offers to existing clients. Banks save money on expensive broad marketing campaigns for bank products.Products will be offered only to customers that need them and are likely to accept them. Customers should see less of irrelevant offers. Thisrequires deep knowledge who accepted given products in past.

News and EventsNews and Events Retail BankingRetail Banking Risk ManagementRisk Management About UsAbout Us ContactContact

Big Data Analytics

Syoncloud Big Data for Retail Banking | Syoncloud 14/10/2013

http://www.syoncloud.com/Syoncloud_Big_Data_for_Retail_Banking 1 / 5

http://www.syoncloud.com/

http://www.syoncloud.com/

http://www.syoncloud.com/blog

http://www.syoncloud.com/Syoncloud_Big_Data_for_Retail_Banking

http://www.syoncloud.com/Syoncloud_Big_Data_Risk_Management

http://www.syoncloud.com/company

http://www.syoncloud.com/contact

As an input for our models we use dataset of subscriptions to bank products and service for each client. This dataset includes previoussubscriptions and cancellation dates. We also use common dataset of incomes and expenses categories for each client and CRM data aboutclients. We have created separate models for each product and subscription. In order to prepare suitable models we have to not only choseand verify the best learning algorithm but also to find which categories and variables do have the biggest influence.

Early fraud detection and fraud damage mitigationThis includes detection of identity frauds, credit card frauds, wire frauds, attacks on internet and mobile banking and money laundering.New types of frauds and new schemes require flexible and fast detection algorithms. In past banks used only statistical and rule basedalgorithms to find if suspicious activity is taken place on customer's account. These algorithms were limited because they can only recognizeknown frauds, they require expensive maintenance, they do not work with full history of each client and they have high level of falsepositives.

We utilized dataset of known fraud cases. We have created several categories of these frauds such as overdraft fraud with stolen identity,stolen credit card, consumer loan fraud, credit card top up with fraudulent check, stolen checks, skimming with card duplication, attacks ononline banking with stolen customer's credential and/or security devices, rogue online merchant frauds using credit cards and so on. We useneuronal networks with back propagation, decision tree algorithms and classification to find patterns and unknown occurrences of thesefrauds in our existing data.

Prediction of Product Cancellations and Client's defectionsA prediction of bank products cancellations and client's defections is very time sensitive. Bank has just days to act before client irreversiblydecide to cancel a product or move to competition. Bank needs to identify clients who are likely to defect, contact them and pro-activellyoffer alternative products or solve client's issues. It is much cheaper to retain highly profitable clients than to attract them back.

We have used account movements, debit and credit card movements, clients dataset from CRM, product subscription dataset, call centreand branch visits transactions and log information as primary data sources for our analysis. We have also utilized common datasets ofincomes and expenses.

We have prepared timeseries of key events such as direct debits cancellations, income to the account from salaries, dividends and rents,transfers to client's accounts at different banks, call centre and branch contacts made by the client separated into categories, cancellations ofcredit cards and so on.

We have prepared another set of clients that do match categories such as age, income, saving and location for the same time interval butwho still remain clients. We have prepared matching timeseries for these clients as well.

Based on this data we were able to create models that are able to predict behaviour of clients before they irreversibly decide to move tocompetitors. We have used several supervised learning algorithms such as Support Vector Machines for binary classification and NeuralNetwork with Backpropagation for predictions. From unsupervised machine learning algorithms we have utilized K-Means and Mean ShiftClustering after Principal Component Analysis was applied to reduce dimensions of input data.

We have identified several hundreds profitable clients in recent data who match patterns of clients who moved their accounts tocompetitors. These clients should be contacted by their respective bank branches.

Optimal Allocation of Cash for ATMs and Bank BranchesDemand for cash is highly variable during year at many ATMs and bank branch locations. The variability is caused by weather, local events,vacations, tourism and so on. It is important to predict right amount cash that needs to be deposited into ATMs as well as bank branches. Itis costly to service ATMs too often, it is also costly to have cash machines out of order due lack of cash. In the same time we want to limitamount of unnecessary cash that is stored for long times in ATMs and bank branches. It leads to suboptimal cash allocation as well as itattracts crime.

As the primary datasets we have used ATM service logs, geographic locations of ATMs and bank branches, withdraws dataset for each ATM,weather reports for ATMs and bank branch locations, schedules of sports, cultural or other events as well as holidays for all locations. Wehave utilized credit and debit card movements to assess demand for cash at various locations and during different times of the year. Wehave used common datasets of incomes to see when salaries, social benefits and other incomes arrived to client's accounts at differentlocations.

We have created dataset of median amounts of cash withdraws for each day of the year and hour of day for all ATMs. This dataset is used tocalculate influence of weather, events, day of the week or holidays on demands for cash at given location.



We have prepared dataset of significant cultural, sport and other events during past 4 years with location coordinates. We have calculatedinfluence of each event on cash demand for all ATMs that are in 300m radius of given event. We were able to sort all events based oninfluence on cash demand. This dataset is used for predictions of influence of similar events.

We have also calculated correlation between local weather parameters such precipitation, temperature and wind at location of each ATMwith cash demand.

We have created correlation dataset between days when clients receive incomes, such as salaries and social benefits, and cash demands atdifferent locations.

We have prepared models that can predict cash demand for each day of the year for each ATM and bank branch location. This model takesinto results from historical datasets as well as weather forecast data and schedules of events. We have utilized algorithms such as RestrictedBoltzmann Machine, Perceptron and Gaussian Discriminative Analysis.

Minimize Use of Expensive ChannelsWe can minimize the use of expensive bank channels such as over-the-counter operations and other visits of bank branches as well as callsto call centres.

This can be achieve by optimizations of online banking and mobile banking applications, help pages and wizards as well as optimization ofweb pages on bank's websites. Another way to encourage reluctant clients to switch to cheaper channels is by targeted campaigns.

Our primary sources of data for analysis were web log files from online banking application as well as mobile banking applications. We havealso used bank accounts movements with codes of bank channels, dataset of call centre transactions, CRM dataset with information aboutcustomers and dataset of transactions from bank branches.

An important dataset was complains and enquiries from call centre, emails, letters and branches. We have sorted this datasets by areas ofinterest and correlated them with help web pages. We were able to identify help pages that were unclear and caused confusion andunnecessary calls to call centre. We have also identified several operations in online banking that were complex and generated higheramount of complains. We have uncovered several areas related to exchange rates during credit cards payments that were not covered byhelp pages but were often discussed over the phone or even by bank branch visits. Changes made to bank products related web pages, selfhelps, search optimizations, online banking operations and mobile banking applications can bring quick savings on outsourced call centresand bank branch visits.

We have analysed results from marketing campaigns to move reluctant clients to online and mobile banking or self serving kiosks. We haveused correlation analysis and we have seen that broad marketing campaigns were not efficient. We have analyse patterns of bank clientswho recently moved most of the operations online. This gave us a tool to select portion of clients that are more likely to move online. Thesecustomers should be targeted by personalized marketing campaigns or by demonstration of advantages at bank branches.

Assessment of Clients for Debt ProductsIn order to reliably assess risks and approve debt products to existing clients we need take into account not just current credit scores andcurrent disposable income of the clients but also complete history of the client as well as social context. This decreases risk for the bank aswell increase income from valuable clients who would be otherwise rejected.

As a primary source of data we have used common dataset of incomes and expenses, complete history of payment morale for credit cards,consumer loans, mortgages, overdrafts and other debt products and CRM information about clients.

We have used Markov Chain stochastic process to assess debt and payment morale related behaviour of clients. This model was tested onhistorical data of profitable and defaulted loans, credit cards and other debt products. We have noticed improved of reliability of credit scoresand we were able to suggest suitable alternative debt products for rejected clients.

Overview of Primary Datasets and Sizing ExampleThese are examples of primary datasets and sizing calculations. Each project is specific and not all datasets are available but datasizing calculations are likely to be similar.

Account movements for all active and former clients. Given dataset includes complete history of account movements for all current andsavings accounts. This dataset contains 6 millions unique clients and 23 millions active and closed accounts. An average size of movementsper account is 1MB this give us 23TB of uncompressed de-normalized CSV files.



Dataset of debit and credit card movements contains 25 millions unique card Ids. We have on an average 3 thousand transactions persingle card number. Total number of records is 75 billions. Each record in uncompressed CSV form has 1kB. The total size of this dataset is75TB.

Technical log files from internet and mobile banking applications have 50TB. These files include front-end Apache log files as wellas applications logs.

Bank transactions, requests for help and complains from call centre. This datasets contains bank transactions, requests for helpand complains from 1 million unique customers. An average number of interactions per customer is 35. Typical size of an interaction is10kB. The total size of the dataset is 350GB.

CRM information about clients with historical values include personal information about customers such as employment, education,age, family status. Dataset includes current and historical information for about 6 millions clients with typical size 100kB per client. Totalsize is 600GB.

Direct debits and standing orders of bank clients with historical values. The typical number of standing orders and direct debitsper client with historical values is 50. A size of single record is 1kB. The total size of dataset for 6 millions clients and 50 records per client is300GB.

Product subscriptions data for all clients with complete history. A typical number of current and historical subscriptions per singleclient is 12. This includes accounts, mortgages, loans, credit cards and other bank products. We have 6 millions clients multiplied by 12average number of subscriptions per client and multiplied by 1kB per subscription is 72GB.

Customer's data from branch visits. This dataset includes over-the-counter bank transactions, help requests, product subscriptions andcancellations and complains. Typical number of interactions per client is 10. We do have large differences in utilization of branch servicesamong clients. 3 millions clients and 10kB per interaction means 300GB.

Dataset of debtors and dataset of failed applications for debt products. The total size of 1 million records in these datasets is 1GB

Help files usage from mobile and internet banking. 6 millions users multiplied by 1000 average number of clicks to help filesmultiplied by 1kB an average size of the record is 6TB

The total size of all primary datasets is 156TB. The result is calculated as a simple sum such as: 75TB + 50TB + 23TB + 6TB + 600GB+ 350GB + 300GB + 300GB + 72GB + 1GB = 156TB. We can reduce overall size by using compression and we can remove technical fieldsthat do not carry any business meaning from the datasets. Log files are also reduced by removing lines with no business meaning.

Implementation Steps

Isolation of sensitive data from Big Data analyticsIn order to isolate Big Data analytics from sensitive data we remove clients' names, addresses, telephone numbers and emails during dataexport processes.

The next step is to create process that replaces real credit and debit card numbers, account numbers and customer's Ids by randomlygenerated numbers. These randomly generated numbers must be identical for the same entity across different datasets to enable analytics.This process stores pairs of matching real numbers and randomly generated numbers into tables. These tables are stored in separate securerelational database that is continuously updated. This database is also used to match randomly generated numbers with real numbers afterBig Data analysis are performed. This enables isolation of data scientists and administrators from sensitive information that is onlyaccessible to authorized bank's employees.

Extraction, Transformation and Loading of Primary DatasetsWe do have initial ETL (Extraction, Transformation and Loading) of data and continuous processes of daily or hourly updates and importsof recent data from production systems of the bank.

Initial extraction was performed by bank's production and backup systems. Data was extracted in denormalized text form in CSV or fixlength field formats. This form is an ideal for bulk uploads into Big Data systems. Denormalized form uses concrete values instead ofreference Ids as in relational databases.

Continuous data exports are channeled via JMS, MQ Series, CSV files and via Sqoop. Exported data are picked up by Big Data scripts such asPig or Hive. These scripts are triggered via Oozie processes.

Transformation of Input DataTransformation rules and scripts are shared by initial and continuous ETL processes. We have used Pig and Hive scripts and Java writtenUDF (User Defined Functions) to perform transformation steps. Oozie workflows were used to chain transformation steps.

We have used several practical rules for data transformations:Various file formats are separated into its own directories inside HDFS (Hadoop file system)Unprocessed and failed records are written into specific directories for manual investigation.Intermediate result files are deleted after all transformation steps are successfully performed. This saves HDFS space as well as enableto investigate and re-run incomplete transformations.



Pig and Hive scripts are kept simple and single purpose. This enables easy debugging and re-use.Java UDFs are only used if given function was not available in standard library or in PiggyBank library.Transformation scripts are reused for processing updates.

Powered by Drupal



http://drupal.org/

Economy & Finance

Syoncloud big data for retail banking