BigDataInTelco

Big Data in Telecommunications

A Practical Roadmap for the Colombian CSP

DavidCallaghan Senior Platform Architect/Data Scientist

david.callaghan@2cdata.com

mailto:[email protected]

Big Data in Telecommunications: A Practical Roadmap for the Colombian CSP

Table of ContentsExecutive Summary..............................................................................................................................2Opportunities.........................................................................................................................................3

Customer Experience........................................................................................................................5Network Management......................................................................................................................7

Challenges.............................................................................................................................................8Staffing and Skills.............................................................................................................................8Business Support..............................................................................................................................8Current Environment........................................................................................................................8

Big Data SDLC.....................................................................................................................................9Roadmap..............................................................................................................................................11Definitions...........................................................................................................................................13

Big Data..........................................................................................................................................13NoSQL............................................................................................................................................16Mobile.............................................................................................................................................19Social..............................................................................................................................................20

Appendix I : Industry Survey..............................................................................................................20

Copyright Dos Chihuahuas, LLC 2013 DRAFT ONLY not for distribution 1of 25


Executive SummaryCommunication Service Providers (CSPs) who can implement Big Data and NoSQL solutions to take advantage of customer, network and location data in nearreal and real time should be able to significantly reduce costs and increase revenue through locationbased services, intelligent marketing campaigns, reliableand intelligent scalable networks, next best actions for sales and services, fraud detection and social media insights. However, there are few practical examples. In this paper, we address possible causes and solutions.

The majority of literature in the Big Data space shows that a few industries, including telecommunications, are spending far more and getting less value back than the average. The consensus among vendors is that these industries need to spend more in order to realizes the ROI of those industries spending less. The same is true for departments within corporations; sales and marketing gets the most money and receives the least ROI. Again, the same response is to spend more in those areas to realize additional gains. In this paper, a different assumption will be made. Industries and departments that are spending more and getting less should start spending less. This argument can be supported both mathematically and by example.

• CSP's needs a Fast Data solution, not a Big Data solution.

• CSP's are spending too much money. Money decreases the effectiveness of a Big Data solution after acertain minimum point.

• A small team with a limited budget tasked with providing clear and measurable analysis to the sames and marketing team and the network team will provide the greatest benefit.

From a business perspective, these five key recommendations will help drive big data initiatives where benefits are most likely to be realized from a minimal cost outlay.

• Commit initial efforts to customercentric outcomes

• Develop an enterprisewide big data blueprint

• Start with existing data to achieve nearterm results

• Build analytics based on business priorities

• Create a business case based on measurable outcomes

In this paper, Big Data and NoSQL will be stripped of their marketing hype and defined clearly, general areas of potential benefit and perceived concerns will be addressed as they pertain to CSPs and a practial roadmap for establishing an operationally effective Center of Excellence for Analytics will be well defined. Inconclusion, some practical case studies of CSP's who have faced challenges that parallel Colombia's will be identified. While the concepts in this paper are relevant to CSPs in general, this white paper will address specific strategies for the successful implementation of an operational Big Data/NoSQL environment for Communication Service Providers (CSPs) operating in Colombia, SA.



OpportunitiesCSPs have a strange relationship with Big Data. The current industry climate involves an increasing demand for capital expenditures for building networks with increased regulatory scrutiny and obstacles driven by decreasing or flat revenue growth as competition from all sides grows. There are also a limited number of success stories from CSPs who have implemented an operational Big Data platform and are receiving economic and competitive advantages. Let's take a look at how CSP's compared to other industries with regard to paying for and implementing Big Data solutions.

The median big data spend across all industries is USD$10M, while CSPs are spending USD$25M. The meanexpected return across all industries is 45% while it's only 38% for CSPs. It's safe to assume that we have identified the reason behind the attitude of so many CSP's that Big Data isn't worth the investment. If you are spending substantially more and getting back quite a bit less than everyone else, it's a very reasonable initial response. But there must be some reason behind these numbers.

Let's take a look at the data that's being processed. On average, most industries have a data load that involves 52% structured data, 21% semistructured and 27% unstructured data. In telecommunications, those numbers are 45% structured, 27% semistructured and 28% unstructured. We did not have to look farto find the answer! CSP's have an ROI that is only 84% of the global average. CSP's have a workload where the structured data only accounts for 86% of the global average. This is predictable from an architecture perspective when working at scale since the vast majority (80%) of the work of a big data project is getting the data into a usable format. This can also account for the 250% spending differential: the offtheshelf tools for this type of processing can be extremely (and unnecessarily) expensive. So that 14% difference in structured versus unstructured data causes work to be done during the most time consuming phase of the big data process and makes your big data dollar only 34% as effective as the average. As we will demonstrate mathematically later when discussing development of algorithms at scale and the virality equation for social networks, improving the time it take to perform an operation often yield outsized results.The less structured data that you have, the more processing time it takes.

Let's look soley at the internal, structured and semistructured data that is available and see if they are amenable to data mining activities that are typical of the telecommunications industry and whose application could result in a positive business outcome.

CSPs internally generate a tremendous amount of data including:

• Call Detail Data : CDRs and XDRs

• Network Data : Describes the state of the hardware and software components in the network

• Customer Data : Similar in structure and function to other industries.



The following data mining applications are typical of CSP's and can be accomplished with the data describedabove.

• Fraud Detection

• Subscription Fraud : Customer opens account with no intention of paying

• Superimposition Fraud : Legitimate account with legitimate activity with some illegitimate activity superimposed

• Customer Profiling

Managing customer churn represents a very profitable area for the application of predictive analysis. A significant cost is incurred when a customer leaves. For example, when competing companies offerincentives, such as a $50 bonus, people switch carriers repeatedly to earn incentives. Utilizing call detail, billing subscription and customer information, it is possible to create an induced model to inform next best action.

In 1991, using graph analysis, MCI calculated that it would be cheaper to add entire calling circles toa plan rather than adding individuals. This resulted in the MCI Friends and Family plan, which was one of the most successful marketing plans in telecommunication history. It is interesting to note that MCI ultimately decided to have customers defines their circle rather than using the call detail data because of privacy concerns.

• Network Fault Isolation

Most of the network elements are capable of at least limited selfdiagnosis, and these elements may collectively generate millions of status and alarm messages each month. Because of the volume of the data, and because a single fault may cause many different, seemingly unrelated, alarms to be generated, the task of network fault isolation is quite difficult. Data mining has a role to play in generating rules for identifying faults.

Telecommunication Alarm Sequence Analyzer (TASA) automatically discovers recurrent patterns of alarms within the network data along with their statistical properties, using a specialized data mining algorithm.

Each of these activities can be accomplished using structured and semistructured internal data and fits into the two categories that we will identify as key business drivers for CSPs in Colombia today :

• Customer Experience

• Network Management



Customer ExperienceThe telecommunication sector in Colombia accounts for 3% of GDP. As of June 2012, there are 103 mobile telephone subscribers per 100 inhabitants, representing a growth rate higher than the world average but lower than Brazil and Peru. There are 0,98M 2G users, 2,07M 3G and 0,02M 4G users in the mobile internetspace. The number of fixed internet subscribers per 100 inhabitants is 8, or 28 fixed internet subscribers per household. The fixed internet subscriber rate has been growing at the same rate as the world average but the overall number is lower than both the world and the Latin American average. The percentage of fixed internet subscibers using broadband as opposed to narrowband has steadily increased. The 4G spectrum auction in 2012 will likely dramatically accelarate this adoption.

Using the HerfindahlHirschman Index (HHI) to measure concentration and quantify competitiveness (a higher number means more concentration and less competitiveness) shows that the mobile telephone market has a higher HHI than other CSP services, with 65% of revenues going to one of the five operators. The impact of the 4G spectrum has yet to be calculated, but it's impact will almost certainly be significant. 4G enables social media interaction from a mobile device and, as we will calculate later, the impact can be dramatic. The HHI for fixed internet is low, likely due to the homogenous distribution of subscriptions among the four bigger operators. For mobile internet, the concentration of income is decreasing since 2010 while the concentration of users, as a sum of the prepaid and subscriber base, is higher.

The following three basic goals of Customer Experience would seem to be appropriate:

• Increase ARPU

• Decrease churn

• Increase market share

Increased ARPU

Transactional Behavioral Analysis algorithms, which group and track customer characteristics and behaviors,can be used to target customers with appropriate offers. Consider the following two examples:

• Occasional User

These users spend a small amount of money quickly on recharging and then stay inactive for a long period of time. The most effective next best action would be to send small upsell offers while the phone is in use. Sending regular offers is ineffective.

• Topping up/Bipping

These users consume credits and recharge in predictable periods (ex 30 days), called topping up. When theyrun low before topping up, they receive in increase in incoming calls as they request people call them (bipping). A topup campaign can give a limited time bonus to and move top up time to be more in line with



behavior, increasing revenue and customer satisfaction.

An Eastern European operator who changed their traditional targeting approach and measured the rechargeand spending characteristics of its prepaid base to monitor for irregular transactions on shorter intervals experienced an increased ROI of a few million and increased customer satisfaction.

Decrease Churn

XO Communications, a wholesale and enterprisefocused CSP in the United States, has been using predictiveanalytics with its monthly accountmanagement cycle to predict churn. Within the first year, a 60percent improvement in revenue retention rates was realized. Although XO account managers have been surprised from timeto time at the likely candidates to churn, the use of analytics in this way has enabled an improvement in customer experience to the degree that one particular XO service line has swung from loss to profit.IBM, Analytics: Real World Use of Data in Telecommunications

Increase Market Share

The following methods of driving mobile market share can experience increased effectiveness though the use of predictive analysis

• Try & Buy Campaigns

Link free trials with sales conversion to obtain 'first mover' advantage.

Vodaphone's 'Data Test Drive' allowed people to experiment with 3G mobile data and at the end of themonth it suggested best plan based on usage patterns.

• Staggered Data Plan

Most useful when the majority of subscribers are spending small amounts and do not want to make longterm commitments.

NTT Docomo offers postpaid payasyougo users the option to switch to a volumebased plan if they exceed a certain data limit

• Dynamic Data Pricing

Operators offer specific discounts at the right time to reduce traffic congestion and modify behavior to consistently use mobile data. In emerging countries, subscribers are pricesensitive and promotiondriven.

AXIS Indonesia offers 'paytoboost' speed where users pay a small fee to increase the speed of their mobile data for a limited time once the user has reached their data limits and experienced throttling.



MTN Africa offers 1GB at 50% discount to Night Owls

Targeting the right customer at the right time with a relevant value proposition in an understandable way is almost the exclusive province of the telco. Customers' needs and behavior, along with handset capability andgeography, has limitless potential.

Network Management“...network management today is the make it or break it part of the business, certainly from our bottom line perspective” top executive from Grupo Salinas' Colombian telecoms business

Brazil's telecommunication regulator Anatel, announced yesterday that it has suspended sales in some states by three mobile phone companies: TIM, the Brazilian unit of Telecom Italia; America Movil's Claro; and the national provider Oi due to the volume of customer complaints. Anatel said the worst wireless carriers in each state are prohibited from selling new lines.RCRWireless Americas, 19 July 2012



ChallengesThe media has written extensively about the challenges companies face in collecting, processing, analyzing and using Big Data in their businesses. Much public discussion has focused on the 3Vs, handling the volume,variety and velocity of the data. Finding people who know how to analyze data, the data scientist, is also a concern. Finally, driving business decisionmakers to actually use data rather than intuition is a challenge. Finally, the existing IT silo infrastrucure and reliance on current vendors is an issue. By what is the gap between perception and reality?

Staffing and Skills• Building high levels of trust between the data scientists who present insights on Big Data and the functional managers.

• Finding and hiring data scientists who can manage large amounts of structured and unstructured data and create insights

• Reskilling the IT function to be able to use the new tools and technologies of Big Data

• Getting the IT function to recognize that Big Data requires new technologies and new skills

Business Support• Getting business units to share information across organizational silos

• Determining what data (both structured and unstructured, and internal and external) to use for different business decisions

• Getting top management in the company to approve investments in Big Data and its related investments (e.g., training, etc.)

• Finding the optimal way to organize Big Data activities in our company

• Determining what to do with the insights that are created from Big Data

Current Environment• Being able to handle the large volume, velocity and variety of Big Data

• Determining which Big Data technologies to use

• Keeping the data in Big Data initiatives secure from internal parties



• Keeping the data in Big Data initiatives secure from external parties

• Getting functional managers to make decisions based on Big Data, rather than on intuition

• Reskilling the IT function to be able to use the new tools and technologies of Big Data

The best way to address these challenges is to build them into the DNA of a new organization unit explicitely charges with evangelizing and delivering Big Data solutions. This is discussed next in the Roadmap.

Big Data SDLCBig Data is a platform rather than a prepackaged solution. Specifically, Big Data is an platform upon which you can build an entire ecosystem of products for the enterprise. However, it is not enterprise development. Algorithms that are effective on GB of data are untenable at TB scale. The same is true of error rates of 1%. Waterfall approaches were acceptable, although far from optimal, at the enterprise level because the data was small enough to allow interdepartmental politics to trump effectitve algorithmic design. Projects are different at scale and its a mistake to take enterpriselevel thinking to bigdata scale.

Software Development Lifecycle at Scale

• Start Simply

• Prototype Perpetutally

• Optimize Obsessively

• Be Opportunistic For Wins

The 2Cdata Knowledge Cycle: a Scalable, Repeatable, Flexible Approach

1. Define an Objective

Define a clear and measurable outcome. Using a framework when defining objectives can help.

1. Answer existing questions in existing businesses, with a focus on improved efficiency

2. Answer new questions in existing businesses, with a focus on opportunities for growth

3. Answer new questions in new businesses, with the goal of reshaping the competitive landscape

This is an ordered list for a reason; these could also be interpreted as phases or degrees of difficulty. You may not want to reshape the competitive landscape before you parse a clickstream log, for example.

2. Identify Controls



Identify what inputs can be controlled, like a recommendation for a product or a delivery time for a service or the cost of a product. These controls should obviously be relevant to your objective.

3. Identify Data

Identify the data that can be collected. Consider both the data that you have (ideal) and the data that wouldneed to be collected. It's estimated that 80% of the work will happen in this data munging period. There are two tips that will serve you well when it comes to identifying data.

More data beats smart algorithms every time

Problems that are intractable with MB's of data can be trivial with GB or TB of data.

The preparation of data takes about 80% of the time

A schema implies that you know what the data does and what it is for. Most of the data that you willuse to answer an interesting question won't fall into this structured format by definition.

4. Model Solutions

Build a model solution assembly line composed of a modeler, a simulator and an optimizer that takes raw data and converts it into slightly more refined predicted data.

Modeler

If you have a lot of data that has been well scrubbed, you can start with trivial algorithms in MapReduce. Look for "the simplest thing that could possibly work"

Simulator

Run the model over a wide range of inputs to ask "what if" questions when certain levers are moved based on the model.

Optimizer

Take the surface of possible outcomes and identify the highest point. Now you have a defined metricupon which to continue iterating or stop once you have what could be considered a "minimum viableproduct".



RoadmapThe following roadmap uses Bouygues Telecom as an example. Bouygues Telecom is the third largest telecom company in France with 10 million mobile customers, a market share of 20%, and annual revenues of 5.3 billion Euros. Their implementation of a Big Data platform is near enough in every particular to the standard model that we endorse that it made sense to hold them up as a realworld example. They had to convert mountains of data into usable intelligence and make it available to internal as well as external parties, including providing near realtime CDRs to the police. This was no small feat as the company had over 300 data marts. The project took two years and involved the creation of a governance program to ensure a single enterprise view of the data with access to analytics throughout the organization.

The ultimate goal for the restructuring effort was threefold:

1.To consolidate their analytic data into a single data warehouse,

2.To shorten the data latency to make the data closer to realtime,

3.To support the creation of temporary sand boxes within the data warehouse (the Data Lab) to provide fast solutions and quick answers to timetomarket question.

The implementation process:

Set expectations

The first step required that the company recognize that the conversion could not happen through one big project. This meant that the team had to arbitrate and prioritize the business requirements into workable projects

Develop a ProofOfConcept

They brought in Teradata to develop a ProofofConcept. Often using a thirdparty in the initial stages can overcome some of the initial resistance towards new tools that arise when the ideas come from within the organization. Teradata in particular works with enough of the Oracle, SAS and Microsoft solutions helped.

Develop a governance process for business prioritization across the enterprise

Here Bouygies Telecom introduced the innovative idea of giving every business division one hundred"chips" they could use to purchase analytics services from the central BI division. Many units requested the same services, so they received a "volume discount". Senior management was involvedin prioritizing the work by ranking them in importance. This "chip and rank" system apparently overcame a substantial amount of the organizational hurdles that cause so many of the Business Support issues mentioned above.



Create an Enterpise Data Model

Starting with the Teradata Communications Logical Data Model and working from business unit to business unit, they were able to develop a comprehensive model of the enterprise and identify whichdata marts could be decommissioned and when.

Migrate

Once the EDW was built out and the data marts were decommissioned, they connected their EDW totheir Seibel CRM system. Now all frontline personnel had access to the same analytics as the departments responsible for what used to be isolated data marts. There was one true view of data.

At the end of the process, there was a single EDW made up of Big Data, OLAP and OLTP data, all run by a single department. Requests were given equal weight across the enterprise and prioritized by senior management. Their was a single true view of all of the data and this data was available to the front and back office. The consolidation and decommissioning effort realized a 33% savings, and future data growth isnow linear. The following analytic achievements were given particular mention:

Customer Churn

The new customer churn system takes four hours rather than one week using datamining.

Operational BI

Teradata provides the data source for Oracle Real Time Decisions for the Marketing group to make actionable decisions in near real time based on the Big Data analytic

Extreme Analytics on Big Data

The growth of CDRs and xDRs is not an issue with Hadoop, so the analysis of larger and larger data sets is not measurably impacting the performance and response times of their analytics.



DefinitionsThe purpose of this definition section is to eliminate common marketing buzzwords from the discussion as much as possible. The goal of this paper is to be accurate and clear and marketing phrases and buzzwords are vague and cloudy. A more complete definition of "mobile" is also provided since multiple arguments will be made regarding the future of mobile as it relates to CSP's and the term may be conflated with mobile phones when it is, in fact, much more significant. The same is true of "social".

Big DataBig Data is a new term (2010) and its use can be fairly confusing because it seems so simple: “Big Data” must mean “A Large Amount of Data”. Not necessarily, as we shall see. Also, technical people seem to conflate the term “Big Data” with a technology called “Hadoop”. However, they soon find out that Hadoop issynonymous with a single product and an ecosystem of products.

We will provide different definitions of Big Data; Big Data – Gartner Perspective and Big Data – Apache Perspective to differentiate from the business and technical perspectives. Additionally, we will provide the Big Data – 2cdata Perspective.

Big Data – Gartner PerspectiveBig Data problems can be defined using one or more of the following metrics, originally develped by Doug Laney (now of Gartner), commonly called the Three V's of Big Data:

Volume : Volume = rows / objects / bytes

Volume refers to the size of the data to be processed as well as the complexity of the data.

Big Data is any data that is expensive to manage and hard to extract value from.Michael

Franklin, Director of Algorithms, Machines and People Lab, Univers\ ity of Berkeley

Velocity : Velocity = number of rows / bytes per unit time

Velocity refers to the latency of data processing relative to the growing demand for interactivity.

Realtime big data isn't just a process for storing petabytes or exabytes of data in a data warehouse. It's about combining and analyzing data so you can take the right action, at the right time, in the right place.Michael Minelli, Big Data, Big Analytics

Variety : Variety = number of columns / dimensions / sources

Variety refers to the diversity of sources, formats, quality and structures of data.

...no greater barrier to effective data management will exist than the variety of



incompatible data formats, nonaligned data structures, and inconsistent data semantics Doug Laney, 3D Data Management: Controlling Data Volume, Velocity and Variety,

Gartner 2001

These general definitions can be applied to the CSP industry.

Volume

CSP's are used to dealing with large amounts of data in the form of CDRs, network data, call center interactions and customer data. Smartphones have added a new category of transaction records, xDRs, which capture media purchases and downloads, prepaid phone recharges, mobile payments and other transactions. Location data from GPSenabled devices is also stored. Combine this internal information with external data from social media and other online sources and this is a substantial amount of growth. Volumecan also be a function of time: regulatory requirements may require carriers to keep more information for anextended period of time. Finally, large volume means smart answers. It is axiomatic in the data science field that a lot of data will beat a smart algorithm every time. It may be priceconscious to store more data for a longer period than to hire a department full of statistics PhD's.

All things considered, though, this is additional volume of data represents a linear growth factor and is therefore not particularly interesting in and of itself. If you have are processing a terabyte of data and you find out you need to process two terabytes, you just double your storage capacity. That's somewhat oversimplified (and I will address this later), but it still hardly represents a substantial change to how business is done. And the additional volume, in and of itself, does not give a competitive advantage. This has led many analysts to insist that Big Data is not relevant to telecommunications: you already do it. I agree. Volume plus velocity; however, is a different story.

Velocity

The competitive advantage that CSP's can have over every other type of business is that they can know a person's demographic profile and their location at a specific point in time and communicate with that personimmediately and directly. This type of service can transform telecommunication providers from PSTN operators to information providers and content distributors. People are inherently mobile and reactive and CSP's are in a unique position to profit handsomely.

There is one small catch, though. Fast is harder than big.

Variety

Corporate data used to be exclusively structured and internal but now the dimensions of data structure and source have changed. Data structure is divided into structured, unstructured and semistructured data. Structured data resides in fixed fields like relational database tables and tabdelimited network device logs. Semistrucured data refers to tagged data such as xml, json or tagged audio/video and spreadsheets. Unstructured data refers to data sources like call center text logs and audio, security video surveillance feedsand emails. Data Sources are internal (sales, service, operational and employee records, network devices) and external (thirdparty data providers, social media sites). It is interesting to note that most companies would consider locationbased information from a CSP to be an extremely valuable thirdparty external datasource.

To summarize then, CSP's don't need a Big Data solution. CSP's have already solved the basic problem of having a lot of data. CSP's need a Fast Data solution. And they need it done cheaply.



Big Data – Apache PerspectiveAn engineer named Jeff Dean took a job at a new company called Google. Two guys there wanted to try out a new algorithm they had developed called PageRank which they thought could make money from searching the Internet. Before they could make money they needed to run the algorithm and the algorithm needed data. So they asked the new guy to store the Internet for them. Let us know when you're done; there's some used servers in the corner and coffee in the breakroom.

Jeff took a close look at the computers. Moore's Law states that processor power will double every eighteen months so fast processing power was cheap but hard drives had only increased in size but not in speed. Since 1990, the transfer rate of hard drives has only increased from 5MB/sec to 100MB/sec. Disks were still organized into 64kb blocks, which was the POSIX standard. It takes 10^3 seconds to read data from a seekoperation on a disk while it takes 10^15 seconds to read data from RAM. At the current transfer rate and block size, it would take twenty two minutes to read 100GB into RAM. So he wrote an operating system where the block size was 64MB rather than 64kb, decreasing the time it took to read data into RAM by 10^2.

The machines that he had were commodity grade; not cheap but nothing special. Assuming they would fail regularly made sense so he built the operating system to replicate each block onto two other machines so there were always three copies. This, essentially, is the Google File System (GFS). Next, Jeff needed a system where developers could use the GFS effectively, which he also wrote, called MapReduce. According to Jeff:

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The runtime system takes care of the details of partitioning the input data, scheduling the pros execution across a set of machines, handling machine failures, and managing the required intermachine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters

Engineers at the Apache Software Foundation used the papers on GFS and MapReduce to create Hadoop. Hadoop is an affordable, commercially available, parallel supercomputing platform. HDFS provides reliableshared storage and MapReduce is a programming model for distributed data processing.

Hadoop enabled Big Data, but Big Data is not Hadoop. Big Data is identifying common problems that can be improved exponentially. If Jeff Dean had money, he would have bought more servers, increasing the processing power P by the number of servers N, so he would get PN for his money. Instead, he increased the block by 10^2, so now he gets PN^2. This is why people say that Hadoop is thousands of times faster and hundred of times cheaper than comparable solutions. 10^2 is 1,000. Compare the price of a standard Dell server or an Oracle Exadata or IBM mainframe solution plus the associated licensing costs and you canquickly see where order of magnitude savings come from. It is not an exaggeration to say that this difference is essentially the reason Google exists today. If Jeff Dean had been given money, he would have bought a system from a vendor. Google would have pushed money into that system to feed PageRank. There wouldn't have been the inhouse platform upon which Google Maps, Gmail, Picasa and multiple other products that Google has developed that go beyond search. Big Data is fundamentally an engineeringcreativity issue. And money kills creativity.

It had been stated earlier that Hadoop is both an application (two applications, technically) as



well as an ecosytem of products. This is cited as a source of great confusion among the technical staff and is repeatedly cited as a barrier to implementation. The following picture should be sufficient to clear up any confusion.

Big Data – 2cdata PerspectiveAt 2cdata, we are firm believers that smart algorithms running on opensource software are the key to successful Big Data and that big vendor price tags are the main reason for failure. Vendor packages provide knobs and dials to tweak prewritten algorithms. However, creativity plus domain knowledge can identify key algorithms that can deliver outsized benefits to the organization by fine tuning around the exponents. We believe that engineers run on coffee, not cash.

NoSQLNoSQL refers to a diverse set of storage technologies. The phrase NoSQL stands for Not Only SQL, which is a



reference to one of of Dr. Codd's 12 Rules for defining a relational database, a datastore based on relational algebra. Not only is a definition predicated on a negation is fairly unhelpful, but many NoSQL databases provide a SQLlike interface, so now its just confusing. Let's talk instead about what problems need to be solved and then identify technologies that are an appropriate solution.

One of the most helpful ways to identify which Big Data solution that you need it to identify what propertiesare most important to your business for particular data. The CAP Theorem stabds for Consistency, Availability and Partition Tolerance. The basic idea is that while all three are great to have, you can only have two out of three for any database solution.

NoSQL Taxonomy

• KeyValue

• GraphDB



• Column Family

• Document Stores

KeyValue

Based on Amazon's Dynamo database, this is a large, distributed hashmap data structure that stores and retrieves opaque values by key. The hashmap is stored across multiple buckets across the network and each bucket is replicated across the network for fault tolerance using the formula R=2F+1 where the number of replica sets R is a factor of the number of failures F that can be tolerated. Using the Intelligent Key Design principle will result in a system where keys are easily used by applications and hotspots are avoided. They are aggregate stores and cannot provide joins. KeyValue stores are functionally similar in many ways to document stores except that the application developer is shielded from the opaque value. Since these databases are descendents of Amazon's shopping cart service, they are optimized for high availability and scale. Dynamo, Riak, Redis, Cache and Voldemort are examples of applications in this space.

GraphDB

Based on Google's Pregel whitepaper, graphs use edges and nodes to describe relationships. Graph models are developed that represent connected data in a complex domain. Graph databases manipulate graph models. Unlike the other NoSQL solutions, which are aggregate stores, and relational databases, graph databases do not have to find relationships within data by brute force or O(n) algorithms. Graph databases can identify relationships using O(log n) time, which means that the time savings on large datasets is the difference between possible and impossible. Also, a graph database typically has a relational database poviding OLTP data in the backend while it processed data inmomory. Neo4J, Allegro and Virtuoso are examples in this space.

Column Family

Based on Google's BigTable, this data model is based on a sparsely populated table whose rows can contain arbitrary columns, the keys for which provide natural indexing. Logically, it can be considered a map of maps. This provides for a more expressive data model than keyvalue or document stores. Hbase and Cassandra are examples in this space. It may be of interest to note that HBase is optimized for reads while Cassandra is optimized for writes.

Document Stores

Document databases store and retrieve hierarchically structured documents, typically in JSON or XML. Sincethese are disconnected entities, they should scale horizontally linearally (but most require sharding), so theycan grow very large with no performance degradation. Indexing is available at the cost of write



performance. Transactionality is limited to an individual record and locking is left to the application level. They are aggregate stores and cannot provide joins. CouchDB, MongoDB and RavenDB are examples in this space.

Keyvalue stores and are very applicable to managing data from network devices. Using Apache Flume to read streaming logs into Hadoop, MapReduce to perform analytics and HBase or Redis to provide querying seems a reasonable first step. Document stores are readymade for external data. MongoDB, since it uses a JavaScript, can be seamlessly integrated into an ntiered web application offering socialnetwork augmentedinteractions, such as an online call center application, using a single, simple, wellunderstood programming language. CDR and xDR data can be well served by a Column Family store since the differences in data structure is trivially managed. The possibiilities of graph data on both customer relationships and network interconnectivity in real time is very exciting. Customer and Billing data, of course, is best left to a relationaldatabase.

Business value should not depend on a strict data model. Unstructured data and semistructured data, used properly, can yield valuable and actionable insights.

MobileWhy is mobile the future and not just a fad? Because human beings are mobile. More specifically, human beings are mobile and social.

In June of 2007, Apple launched the iPhone. Six years later and 1.5 billion people have bought a smart phone of some kind with (1.5*10^9)/(6*365.5)~=6.65*10^5 smartphones purchased per day on average. Mobile phones have heavy turnover as well. The mobile app stores provide a distribution platformcomparable only to the Web itself. Facebook, with its one billion monthly active users after eight years, have an average daily user installation rate of (1*10^9)/(8.5*365.5)~=3.22*10^5. Signing up for Facebook is nearly frictionless (it's free and you just click a link), but its extraordinary growth is still exceeded by a technology that requires you to go to a store to make a purchase and pay recurring fees.

Today, people are concentrating on the tremendous potential growth of mobile phones and how profitable itwill be to send time and locationsensitive coupons to people based on their demographic information. This is and will continue to be, of course, very profitable. But it also is fairly uninspired. Dan Hesse, CEO of Sprint, said that the mobile phone was the most personal device ever created. It makes sense to look one step further towards the 'Internet of Things'. When two forces converge; the ubiquity of handheld computers(smart phones) and wireless broadband as a basic assumption, there will be an increase in unattended devices that are meant to be controlled by a touchscreen or a mobile phone. Imagine when the most personal device ever created is paired with a world of connected objects. There are more opportunities here than just coupons.



SocialThe future of CSPs is social.

It's become apparent that the social network connections between people are digital roads, the mobile phone is our vehicle of preference and virality is the mechanism to increase traffic on these roads. The virality equation is U(x) = (Np)^x/t where U are users, x is the number of users who have viewed content, N is the number of people invited to share content, p is the probability that a given user will decide to share and t is the time interval between shares. Let's see what adjusting the relative values gets us.

If we assume that p=.1, N=50 and t = 1 day, in three days we would have approximately: U(3) = (.1 * 50)^3/1 = 125 users. If we double p to .2 we'd get U(3) = (.2 * 50) = 1000 users. If we double N to 10 we'd get U(3) = (.1 * 100)^3/1 = 1000 users. But if we halved t to .5 we'd get U(3) = (.1 * 50)^3/.5 = 15625 users! The time interval between shares has the ability to increase the number of users nonlinearly because it affects the exponent of the virality equation. Once again, we see that concentrating on the exponent provides nonlinear positive results. Shorter incubation times means rapid viral spread. YouTube made the design choice to use Flash rather than force the end user to install Quicktime, reducing t and improving YouTube's viral loop. Since busy roads mean more tolls, it is not unreasonable that CSP's could take an interest in seeing apps go viral. TeliaSonera and Spotify have shown the difference being a "dumbpipe" or cooperating with a potentially viral app to reduce churn rate, increase ARPU and acquire market share.

Appendix I : Industry SurveyWhich of the following benefits would you receive from big data analytics?

• Better targeted social media influencer marketing 61%

• More numerous and accurate business insights 45%

• Segmentation of customer base 41%

• Recognition of sales and marketing opportunities 38%



• Automated decisions for realtime processes 37%

• Definitions of churn and other customer behaviors 35%

• Detection of fraud 33%

• Greater leverage and ROI for big data 30%

• Quantification of risks 30%

• Trending for market sentiments 30%

• Understanding of business change 29%

• Better planning and forecasting 29%

• Identification of root causes of cost 29%

• Understanding consumer behavior from clickstreams 27%

• Manufacturing yield improvements 6%

• Other 4%

The benefits were the broken down as follows:

Customer Experience

• Better targeted social media influencer marketing 61%

• Recognition of sales and marketing opportunities 38%

• Definitions of churn and other customer behaviors 35%

• Understanding consumer behavior from clickstreams 27%

BI in general can benefit

• More numerous and accurate business insights 45%

• Understanding of business change 29%

• Better planning and forecasting 29%

• Identification of root causes of cost 29%

Specific applications

• Automated decisions for realtime processes 37%

• Detection of fraud 33%



• Quantification of risks 30%

• Trending for market sentiments 30%

In your organization, what are the top potential barrier to implementing big data analytics?

• Inadequate staffing or skills for big data analytics 46%

• Cost, overall 42%

• Lack of business sponsorship 38%

• Difficulty of architecting big data systems 33%

• Current database software lacks indatabase analytics 32%

• Lack of compelling business case 28%

• Scalability problems with big data 23%

• Cannot make big data usable for end users 22%

• Database software cannot process analytic queries fast enough 22%

• Current data warehouse modeled for reorts and OLAP only 22%

• Current database software cannot load data fast enough 21%

• Can't find Hadoop experts to hire 11%

• Can't fund Hadoop's high operational expenses 7%

• Other 6%

The challenges were the broken down as follows:

Inadequate staffing and skills are leading barrier

• Inadequate staffing or skills for big data analytics 46%

• Difficulty of architecting big data systems 33%

• Cannot make big data usable for end users 22%

• Can't find Hadoop experts to hire 11%

Lack of business support

• Cost, overall 42%

• Lack of business sponsorship 38%



• Lack of compelling business case 28%

• Can't fund Hadoop's high operational expenses 7%

Problems wih database software

• Current database software lacks indatabase analytics 32%

• Scalability problems with big data 23%

• Database software cannot process analytic queries fast enough 22%

• Current data warehouse modeled for reorts and OLAP only 22%

• Current database software cannot load data fast enough 21%


Documents

BigDataInTelco