Crecemas: A transactional data-based big data solution to ...ceur-ws.org/Vol-2029/paper20.pdfBusiness Data Development 1 Product Owner 1 Big Data Architect 1 Back-end developer 1 Navigator

Crecemas: A transactional data-based big data solution to support abank’s corporate clients with their commercial decisions

Antonio Martı́n Cachuán Alipázaga Willy Alexis Peña VilcaBig Data Center of Excellence

Banco de Crédito del PerúLima, Perú

{acahuan,wpena}@bcp.com.pe

Abstract

Peru is recognized in the region of LatinAmerica as a country of entrepreneursso enterprises that have grown rapidlyare more commonly found in the market;however a great number of them ceasetheir operations in a short time due to lackof capabilities to use data to know bettertheir clients and their competition. As asolution to help this kind of business tobe sustainable in time, exists new waysto process and analyse data generated byclients through the transactions they madewith their credit or debit cards. In addi-tion, this information is usually manageby banks and financial services which withthe help of new technologies as Big Dataand Cloud Computing, that banks use inthe daily basis, can help their corporateclients to achieve their goals by providingthem with aggregated information throughanalytic indicators about their clients andcompetition.

In this article, we show the Big Data ar-chitecture of the web platform ”Crece-mas”, which was developed in 16 weeksunder agile methodologies with the Scrumframework and using Cloud Computingtechnologies. In this web, KPIs are shownusing anonymized transactions about thecompany, its clients and its competitors,which is helpful to support commercial de-cisions. Currently the platform handles200gb of information with 7 worker nodesand 3 master nodes and is used by almost500 different companies in Peru.

1 Introduction

The search for dynamism the generation of workin a country is mainly to support new en-trepreneurs in order to contribute directly to in-novation (Harman Andrea, 2012). All this im-proves the country’s economy by bringing wel-fare to its inhabitants. The region of Latin Amer-ica and the Caribbean is characterized for the en-trepreneurship, Peru is not the exception. It is acountry which is occupying the eighth place in agroup of 60 economies according to the GlobalEntrepreneurship Monitor, but it also has the high-est rate in failure (Donna Kelley Slavica Singer,2016), one reason for this result, is the low in-dex of strategic alliances (Global Innovation in-dex, 2015), which means that large companies arenot actively seeking to do business with smallercompanies, all this has an impact on the compet-itiveness index of the country where it is ranked69th out of 140 countries (Donna Kelley SlavicaSinger, 2016). In addition, most of them do nottake advantage of the information they generate inthe daily basis , because they don’t have accessto all the data that they need to accomplish thisor don’t have the required capabilities inside thecompany to bring strategic insights (Brynjolfssonet al., 2011).

Figure 1: Number of credit and debit cards in cir-culation (2015)

On the other hand, according to the Tecnocomreport of 2016 (Tecnocom, 2016), the annual aver-

208

age of transactions with debit and credit card are15 and 5 respectively. That happens because thenumber of debit and credit cards in circulation inPeru is growing steadily over the last five years, asis shown in Fig. 1, with a growth of 9.6% in debitcards. This translates into an increase in the useof payment cards usage. As you can see in Fig. 2,Peru had a ratio of POS card vs. Cash withdrawalsof ATM of less than 0.3, which means that there isa great potential for growth.

Figure 2: Ratio of value of POS card transactionsto ATM withdrawals (2010-2015)

The digital transformation is a chance for theorganizations to get better at managing informa-tion as a strategic asset and Big Data is a gamechanger who adds more sources to the mix. How-ever, unless the lessons of the productivity paradoxare applied (E. Brynjolfsson, 1994), these changeswill only serve as distractions. Companies that an-ticipate the changing needs of the volatile market-place and successfully implement new technolo-gies, place themselves in a great position to over-come their competitors (Earley, 2014).

Consequently, there are favorable circum-stances for banks and financial services companiesinterested in Digital Transformation who are will-ing to use the information generated by the usersof credit and debit card to help the smaller andnewer companies to grow in the country’s econ-omy. To accomplish this, banks can provide aggre-gated information obtained from sales transactionsso that the business can make better decisions.

2 Literature Review

2.1 Concepts

2.1.1 Apache HadoopHadoop is an open-source software for reliable,scalable and distributed computing. It brings a

framework that allows the distributed processingof large data sets across clusters of computers us-ing simple programming models. It is designed toscale up from single servers to thousands of ma-chines, each offering local computation and stor-age. The project includes these modules (TheApache Software Foundation, 2007a)

• Hadoop Common: The common utilities thatsupport the other Hadoop modules.

• Hadoop Distributed File System (HDFS): Adistributed file system that provides high-throughput access to application data.

• Hadoop YARN: A framework for jobscheduling and cluster resource management.

• Hadoop MapReduce: A YARN-based systemfor parallel processing of large data sets.

This distributed framework has been adopted bydifferent vendors, such as Cloudera and Horton-works who have added important features, such asdata governance and security compliance, whichgive to this powerful technology an enterprise at-tractive characteristic. In addition, the communityhas played an important role; for example, some ofthe main tools included by the Apache Foundationare the following:

• Apache Hive: data warehouse software thatfacilitates reading, writing and managinglarge datasets residing in distributed storageusing SQL (The Apache Software Founda-tion, 2011).

• Apache HBase: Hadoop database that bringsthe possibility of random accesses and real-time reading and writing to Big Data storages(The Apache Software Foundation, 2007b).

2.1.2 Cloud ServicesCloud Services are applications or services offeredby means of cloud computing. Nowadays, nearlyall large software companies, such as Google, Mi-crosoft and Amazon, are providing this kind ofservice. In addition, cloud computing has revolu-tionized the standard model of service provision-ing allowing delivery over the Internet of virtual-ized services that can scale up and down in termsof processing power and storage.Cloud comput-ing also provides strong storage, computation, anddistributed capabilities to support Big Data pro-cessing. In order to achieve the full potential of

209

Big Data, it is required to adopt both new dataanalytics algorithms and new approaches to han-dle the dramatic data growth. As a result, one ofthe underlying advantages of deploying serviceson the cloud is the economy of scale. By using thecloud infrastructure, a service provider can offerbetter, cheaper, and more reliable services. Cloudservices offer the following schemas of services(Campbell et al., 2016).:

• SaaS: Costumers do not have control over thehardware and software level configurations ofthe consumed service.

• PaaS: Platform usually includes frameworks,developing and testing tools, configurationmanagement, and abstraction of hardwarelevel resources

• IaaS: Costumers can hire a hardware-level re-sources.

• DBaaS: Database installation, maintenanceand accessibility interface are provided by adatabase service provider.

2.1.3 Agile MethodsAgile methods are contemporary software engi-neering approaches based on teamwork, customercollaboration, iterative development, and con-stantly changing people, process and tech. Thisapproach diverts from traditional methods whichare software engineering approaches based onhighly structured project plans, exhaustive docu-mentation, and extremely rigid processes designedto minimize change. Agile methods are a de-evolution of management thought predating theindustrial revolution and use craft industry prin-ciples like artisans creating made-to-order itemsfor individual customers. Traditional methods rep-resent the amalgamation of management thoughtover the last century and use scientific manage-ment principles such as efficient production ofitems for mass markets. Agile methods are newproduct development processes that have the abil-ity to bring innovative products to market quicklyand inexpensively on complex projects with ill-defined requirements. Traditional methods resem-ble manufacturing processes that have the abil-ity to economically and efficiently produce highquality products on projects with stable and well-defined requirements (Trovati et al., 2016).

2.2 Related Work

Nowadays, data is being generated at an unprece-dented scale. Decisions that previously werebased on guesswork or handcrafted models of re-ality, can now be made using data-driven math-ematical models. However, this increase of theamount of data and the variety of formats has putnew challenges on the table; for example, is morecomplicated to deal with this kind of data andhave a high performance process with the technol-ogy that many organizations are using in the dailybasis. For that reason, Big Data has the poten-tial to revolutionize much more than just researchthe batch processing of data, this technology hascome to enable analysis on every aspect of mobileservices, society, retail, manufacturing, financialservices, life science and others (Jagadish et al.,2014). In addition, in order to accomplish the suc-cessful use of this kind technology, organizationsneed to have an enterprise infrastructure that couldsupport this initiative with the goal of maintain andrun the transformation process in an efficient way.That said, purchasing and deploying equipment ina short term is important, in order to reduce thedelivery time of the solution, Cloud Computingis a revolutionary mechanism that is changing theway that enterprise enable hardware and softwaredesign and procurements in an efficient and eco-nomical way; with this in mind, the possibility toenable an infrastructure in more flexible environ-ments such as those of the cloud, makes the useof this type of technology much more attractive,in order to provide end users with fast and usefulresults for them (Philip Chen and Zhang, 2014).

With all these great benefits the use of Big Datatechnologies and Cloud Computing are a perfectcombination to start this journey. Also, as men-tioned in (Vurukonda and Rao, 2016), it is impor-tant to keep in mind that although the cloud is anattractive option, the biggest challenge regardingcloud is the security and regulatory issues abouta company’s customer data in such environments,so it also carries great challenges that are been cur-rently working (Hashem et al., 2014).

3 Proposed Solution

The solution is structured in three stages that rangefrom obtaining the data directly from the internal

210

Figure 3: Dashboard from crecemas.com

sources, loading them to the cloud and transform-ing them to be able to calculate the indicators end-ing in their visualization in the web (Fig.4).

Figure 4: Proposed solution diagramThe entire solution for Crecemas http://

www.crecemasbcp.com/ (Fig. 3) was devel-oped in 16 weeks led by a scrum master and a to-tal of 13 people dedicated exclusively where eachmember was grouped according to one of the threeroles (Table 1).

• Business: Dedicated to engage internal areasand avoid possible business stoppers. In ad-dition the design of the KPIs.

• Data: Responsible for Big Data stage.

• Development: In charge of the visualizationof the data and the web.

3.1 Data Ingestion

For this stage, different information extractionprocesses were built in order to obtain the infor-

mation from the Enterprise Data Warehouse andthen store it into a file server. This processes alsoperform some field filtering and record accordingto the requirements for the KPI construction. Re-garding the regulatory constraints, the main objec-tive of this processes was to tokenize some sensi-tive fields that couldn’t be stored in a Cloud en-vironment (e.g., client’s names, client’s address,card number).

In addition, each file generated has a control fileto perform a validation in the data upload process.This file contains the number of exported recordsand the date in which the file was processed. Forthis reason, data ingestion worked with two files:the first one just with credit card’s transactionsdata (with extension .dat) and the other one justwith control data (with extension .ctrl)

The files that were extracted with these pro-cesses have the following types:

• Daily master tables: That are completelyloaded.

• Daily incremental tables: That have informa-tion of one day and are stored in order to havehistory of the data.

• Monthly master tables: That are incremen-tally loaded.

211

Business Data Development1 Product Owner 1 Big Data Architect 1 Back-end developer1 Navigator 2 Data Engineer 2 Front-end developer1 Research 2 Data Expert 1 UI Expert

1 UX Expert

Table 1: Crecemas team

Figure 5: Solution proposed for data ingestion

As can be seen in Fig. 5, The ingestion processas performed in the following way: All the orches-tration of the processes in the On-Premise environ-ment was made by the enterprise scheduling tool,which controls and executes the information ex-traction processes. Once this processes have fin-ished running, one final job executes an AzCopycommand (Multi-thread tool to upload data to Mi-crosoft Azure cloud environment), which is incharge of uploading aforementioned files from thefile server to the Linux servers(that were created todeploy the Big Data technology) in the cloud envi-ronment. Next, in these servers another job is exe-cuted which invokes a Ad-Hoc client that uploadsthe data from Linux to HDFS, to finally uploadthe created Apache Hive tables, which are usedfor the KPI’s construction. This client (ApacheHadoop and Apache Hive) was created using theMaven Artifacts from Cloudera, who is the BigData provider selected for this project.

3.2 Data Transformation

After the data ingestion process, data is stored inthe Hadoop ecosystem (HDFS) within a Clouderascluster which uses 7 worker nodes and 3 namenodes (1 main and 2 for backup) its architecturecan be observed in Fig. 6.

3.2.1 Landing AreaThis is the initial zone where the data in HDFSis located as they were loaded by the process of

Figure 6: Data transformation process

data ingestion. Data can be accessed through adatabase composed of External Tables created us-ing HQL (Hive Query Language). First, incon-sistencies in the volume of information processedby each table and business-level inconsistencies inthe values are reviewed (birth date, sex, foreigncharacters and incongruent transactions). Then adata cleaning process occurs, which eliminates du-plicates and replaces empty nulls. Finally, the datamoves to a new HDFS location and is created aHive database called Tmp Transformation Area.

3.2.2 Transformation Area

This area consists of two databases in Hive: thefirst one (Tmp Area Transformation) contains thetables generated by the previous component withpotential update, adds, eliminations if necessary.The second database is the result of a transfor-mation process of the first DB and consists ofa ’Tablon’ (a large table with more than 100columns) that consolidates all the information atthe level of transactions and commerce includingtokenized customer information (such as sex, age,date, educational level and economic level). Thislarge table is used for the KPI Calculations com-ponent.

212

3.2.3 KPI Calculations

In this area, the ’Tablon’ is used to generate 7 ta-bles, each linked to one or more KPIs that are de-tailed in the Data Visualization part. The tables arein a Hive database, which is connected to NoSQLtables in HBase responsible to display the reportson the web.

3.2.4 Consumer Tables

There are 7 tables of type that areconsumed using Java applying a Facade DesignPattern (details in the section 3.3).

Today we handle almost 200 gb of historicaldata and the entire transformation process is ex-ecuted daily in 1 hour for a volume of informationof approximately 12gb (11gb for master tables likecustomers and business and 1gb for transactions)which accumulate during the month. Also everymonth there is a process that goes through all thestages and takes about 1 hour with a volume of 10gb of data (mainly other master tables related tobusiness location, local geography).

At the steady state, daily processing should be20gb and monthly processing 15gb which meansa total of 600gb historically1 (Table 2)

Daily Monthly HistoricActual 12gb 10gb 200gbProjected 20gb 15gb 600gb

Table 2: Data size processed

3.3 Data Vizualization

Data Visualization solution for this project was aweb platform. (Figure 3). For that reason, thisstage is a real-time request processor that allowsto query the Apache HBase database, in order toobtain the data and show the results to the end user.

The workflow for this stage is shown in Fig. 7.

1Historic data from last 2 years.

Figure 7: Data Visualization Diagram

This stage is composed by the following com-ponents:

• External Communication: Services that al-low consumption of the information in theBig Data environment.

• Client Communication: In charge of estab-lishing the remote connection with the appli-cation back-end from web pages.

• Graphics: Statistics graphs on the web pages.

• Web Page: Represent all the system’s webpages.

• Reporting: Allows to access to the systemrepositories in order to perform analysis andcreate new reports.

4 Conclusion and Future Works

The proposed solution allows enterprises to en-able Big Data capabilities inside the organizationmaking more efficient batch processes and reduc-ing solutions time to market. In addition, the useof Cloud environments facilitates the adoption oftechnologies that require an intensive infrastruc-ture deployment. Likewise, the agile frameworkused by the project, demonstrates that having theclient in the center of all decisions and solutionsallows a product to be created in a short time andwith great value to the clients.

For future works, the process developed inApache Hive could be migrated to different tech-nologies that improves the processing speed. Forexample, Apache Impala (Kornacker et al., 2015)or Apache Spark (Zaharia et al., 2016) are great

213

options, because this technologies offer a differ-ent engine of execution that brings new capabili-ties to the solution proposed. On the other hand,the information that users are generating inside theweb could help to make important improvementsin how the company knows better their clients inorder to offer solutions that could help to accom-plish their main goals.

References

Erik Brynjolfsson, Lorin M. Hitt, and Heekyung HellenKim. 2011. Strength in Numbers: How doesdata-driven decision-making affect firm perfor-mance? ICIS 2011 Proceedings page 18.https://doi.org/10.2139/ssrn.1819486.

Jennifer Campbell, Stan Kurkovsky, Chun Wai Liew,and Anya Tafliovich. 2016. Scrum and AgileMethods in Software Engineering Courses. InProceedings of the 47th ACM Technical Sympo-sium on Computing Science Education. ACM, NewYork, NY, USA, SIGCSE ’16, pages 319–320.https://doi.org/10.1145/2839509.2844664.

Mike Herrington Donna Kelley Slavica Singer.2016. Global Enterpreneurship Monitor.http://www.gemconsortium.org/report/49480.

E. Brynjolfsson. 1994. The Productivity Para-dox of Information Technology: Review andAssessment Center for Coordination Science.http://ccs.mit.edu/papers/CCSWP130/ccswp130.html.

S. Earley. 2014. The digital transformation: Stay-ing competitive. IT Professional 16(2):58–60.https://doi.org/10.1109/MITP.2014.24.

Harman Andrea. 2012. Un estudio de los factories deexito y fracaso en emprendedores de un programade incubacion de empresas: Caso del proyectoRAMP Perú. Master’s thesis, Pontificia UniversidadCatolica del Peru.

Ibrahim Abaker Targio Hashem, Ibrar Yaqoob, NorBadrul Anuar, Salimah Mokhtar, Abdullah Gani,and Samee Ullah Khan. 2014. The rise of BigData on cloud computing: Review and open re-search issues. Information Systems 47:98–115.https://doi.org/10.1016/j.is.2014.07.006.

H. V. Jagadish, Johannes Gehrke, AlexandrosLabrinidis, Yannis Papakonstantinou, Jignesh M.Patel, Raghu Ramakrishnan, and Cyrus Sha-habi. 2014. Big data and its technical chal-lenges. Communications of the ACM 57(7):86–94.https://doi.org/10.1145/2611567.

Marcel Kornacker, Alexander Behm, Victor Bittorf,Taras Bobrovytsky, Casey Ching, Alan Choi, Justin

Erickson, Martin Grund, Daniel Hecht, MatthewJacobs, Ishaan Joshi, Lenni Kuff, Dileep Kumar,Alex Leblang, Nong Li, Ippokratis Pandis, HenryRobinson, David Rorke, Silvius Rus, John Rus-sell, Dimitris Tsirogiannis, Skye Wanderman-Milne,and Michael Yoder. 2015. Impala: A Mod-ern, Open-Source SQL Engine for Hadoop. Cidrhttp://impala.io/.

C. L. Philip Chen and Chun Yang Zhang. 2014.Data-intensive applications, challenges, tech-niques and technologies: A survey on BigData. Information Sciences 275:314–347.https://doi.org/10.1016/j.ins.2014.01.015.

Tecnocom. 2016. Informe Tecnocom Tendencias enMedios de Pago. https://goo.gl/95th2L.

The Apache Software Foundation. 2007a. ApacheHadoop. http://hadoop.apache.org.

The Apache Software Foundation. 2007b. ApacheHBase. https://hbase.apache.org.

The Apache Software Foundation. 2011. Apache Hive.https://doi.org/10.1002/ciuz.201500721.

Marcello Trovati, Richard Hill, Ashiq Anjum,Shao Ying Zhu, and Lu Liu. 2016. Big-Data An-alytics and Cloud Computing: Theory, Algorithmsand Applications. Springer.

Naresh Vurukonda and B. Thirumala Rao. 2016.A Study on Data Storage Security Issuesin Cloud Computing. In Procedia Com-puter Science. volume 92, pages 128–135.https://doi.org/10.1016/j.procs.2016.07.335.

Matei Zaharia, Michael J. Franklin, Ali Ghodsi, JosephGonzalez, Scott Shenker, Ion Stoica, Reynold S.Xin, Patrick Wendell, Tathagata Das, MichaelArmbrust, Ankur Dave, Xiangrui Meng, JoshRosen, and Shivaram Venkataraman. 2016. ApacheSpark: a unified engine for big data process-ing. Communications of the ACM 59(11):56–65.https://doi.org/10.1145/2934664.

214

Documents

Crecemas: A transactional data-based big data solution to ...ceur-ws.org/Vol-2029/paper20.pdfBusiness Data Development 1 Product Owner 1 Big Data Architect 1 Back-end developer 1 Navigator