17
From Hype to Action: Getting What’s Needed from Big Data Analytics Geeta Deodhar, VP Product Management ABSTRACT For every technology revolution, someone pays. Most often, it’s the “early adopter” seeking to gain strategic advantage, but lacks the ability to manage the change of process and protocol within their organization. Balancing the urgency of delivering results against the hype for innovations, the realities of market adoption and time frame to mainstream adoption is as difficult for struggling startups to deliver, as early adoption to deployment is for innovators in larger companies. Now that the Big Data has passed the top of the Gartner’s Big Data Hype Cycle, and based on the amount of attention that big data technology are receiving, one might think that adoption and deployment is already pervasive. The truth is, most companies are still trying to get a handle on big data and are still struggling to find the products to effectively manage it in order to get tangible business benefits. There is, however, a significant shift happening. Big data, and the ensuing analytics applications, are establishing a new norm for using data to report not only what has happened but, predicting what may happen with a lot more accuracy. The new generation of advanced, predictive analytics technology is now at the forefront of a paradigm shift that will firmly establish ROI for those businesses able to integrate developing technology into their networks. Companies that curate a data pipeline workflow will be more effective than those who are slow to adopt.

From hype to action getting what's needed from big data a

Embed Size (px)

Citation preview

From Hype to Action: Getting What’s Needed from Big Data Analytics Geeta Deodhar, VP Product Management

ABSTRACT For every technology revolution, someone pays. Most often, it’s the “early adopter” seeking to gain strategic advantage, but lacks the ability to manage the change of process and protocol within their organization. Balancing the urgency of delivering results against the hype for innovations, the realities of market adoption and time frame to mainstream adoption is as difficult for struggling startups to deliver, as early adoption to deployment is for innovators in larger companies.

Now that the Big Data has passed the top of the Gartner’s Big Data Hype Cycle, and based on the amount of attention that big data technology are receiving, one might think that adoption and deployment is already pervasive. The truth is, most companies are still trying to get a handle on big data and are still struggling to find the products to effectively manage it in order to get tangible business benefits.

There is, however, a significant shift happening. Big data, and the ensuing analytics applications, are establishing a new norm for using data to report not only what has happened but, predicting what may happen with a lot more accuracy. The new generation of advanced, predictive analytics technology is now at the forefront of a paradigm shift that will firmly establish ROI for those businesses able to integrate developing technology into their networks. Companies that curate a data pipeline workflow will be more effective than those who are slow to adopt.

© 2015 NTT Innovation Institute Inc. 2

CONTENTS Abstract ........................................................................................................................... 1

Disruption ........................................................................................................................ 3

Where Buzz Meets Business ............................................................................. 4

Hype and Reality ................................................................................................ 6

Expectations ....................................................................................................... 7

Examples of Success ......................................................................................... 8

Challenges for Most ........................................................................................... 9

Early Stages ....................................................................................................... 9

The IT Challenge ........................................................................................................... 11

Delivering Cost-Effective, Innovative Solutions in an Evolving Technology Landscape ........................................................................................................ 11

Delivering Data Insights to Business Users – On Their Terms ........................ 11

The Ideal Solution ......................................................................................................... 11

What’s Available ............................................................................................... 12

Ability to Easily Manage End-to-End Data Pipeline Management Workflow .... 13

The Data Management Workflow ..................................................................... 13

Open Big Data Analytics Platform That Delivers .............................................. 14

Useful Templates for Speed, Usability, and Visualization ................................ 14

A Real-Time Distributed Learning Framework ................................................. 14

Learning Machines ........................................................................................... 15

Leading The Shift .......................................................................................................... 16

Conclusion ..................................................................................................................... 16

© 2015 NTT Innovation Institute Inc. 3

DISRUPTION Everything around us is being disrupted. The Internet is changing the way entire industries operate. Mobile and cloud-based technology is now creating new business models for companies and their customers. From industrial applications to sensors, to personal health data, to social media, it all runs through the Cloud. Social networks are mainstream and connected smart devices are ubiquitous. And there is data everywhere.

The growth in data has become a tidal wave. According to industry analyst group IDC, the volume of digital records alone will be 44 times larger in 2020 than it was in 2009. When machine generated data is included, that number grows exponentially. Data is becoming the basis of biggest decisions for all businesses.

IDC claims that this digital universe will be doubling in size every year into the next decade, and is being driven by four key areas: sensors and devices; social media; voice over Internet communication; and the enterprise data. By 2020, this digital universe of data will reach 44 zettabytes, or 44 trillion gigabytes. This incredible volume of data is generating a new wave of opportunities for the businesses that are able to take advantage of it. (Source: IDC, The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things, April 2014.)

© 2015 NTT Innovation Institute Inc. 4

But, just as the connected world is creating overwhelming amounts of data, technology advances are enabling storage and analytics to make sense of it all.

Where Buzz Meets Business

Capturing and collecting data to better understand the needs of customers and to provide better customer experiences is something businesses have always strived for. Successful businesses do their best to understand customer needs by capturing data they can gather from all of their business processes. By analyzing demographic data, along with transactional histories, companies can predict what customers might purchase next. This type of data is mostly stored in relational databases and, for analysis purposes, stored in data warehouses. The volume of this type of data is usually proportional to the size of the business and is well within the capabilities of typical data warehouse infrastructures. The data is typically analyzed and reported on using traditional business intelligence (BI) tools that generate descriptive analysis, based on simple correlations and aggregations. This type of reporting is useful to understand what happened in the past. The exact same data can be leveraged to predict what may happen in the future by identifying patterns. The accuracy of these predictions is as good as the quality, volume and the relevance of the underlying data stored in a data warehouse.

Now, in the era of big data, businesses find themselves grappling with additional, mostly heterogeneous, and complex data that is being generated rapidly, and from multiple touch-points. The term “big data” is actually an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications. It requires new types of databases to store the data; new data lakes that can scale; new analytics technology that can help identify useful data points and predict outcomes with great accuracy; and new intuitive data visualization techniques to help end users actually take advantage of big data analysis in their day-to-day activities.

Businesses are now able to look at social conversations, machine to machine (M2M) communications, and sensorial data to dig deeper and learn more. Today’s businesses are well aware that those who can leverage all of this data to drive better business decisions will gain significant competitive advantage because they can stay attuned to dynamic markets and customer behaviors and adjust accordingly. The basic form factors for the new growth in data are:

Social conversations that can occur both externally, and internally, to a business. These online conversations occur between customers, partners and employees. Customers often tweet about products, or discuss the merits of a product on social networking sites. People might post blogs about a specific solution they need, or provide product reviews. If businesses can tap into these social conversations, they can gain unique perspective on the strength of their brand in the market.

Machine to machine (M2M) communications provide another valuable data stream. The Internet is not just about people connecting to one another, it also enables machines to share data with other machines and humans. This exchange of

© 2015 NTT Innovation Institute Inc. 5

information between devices and humans is commonly known as the “Internet of Things” (IoT), and the data generated by M2M communications is changing how the world understands and interacts with machines – and data. Automobiles, for example, have been gathering diagnostic and geospatial position data for many years. By 2025, every car will be required to be connected to the Internet, collecting and sending data about driving habits, regular commute distances, speed, and other granular diagnostic data about each car and the driver.

Every mechanical device of the past is getting connected as well. From home heating and alarm systems to kitchen appliances and airplanes, they will be able to sense a variety of data points about its own functioning and from its surrounding environment and communicate to other machines for analysis and correction. Today, Boeing 787s create half a terabyte of data per flight. Combine that with data already stored in the cloud, and it is easy to see how quickly petabytes of data become available for businesses to analyze and benefit from.

The Internet of Things (IoT) refers to a new generation of smart hardware devices equipped with sensors capable of recognizing temperature, motion, acceleration, light, sight, sound and even smell, and are connected over networks of all kinds. In addition to the sensorial data, these devices can send a lot of data about their internal working for diagnosis and analysis. As this digital ecosystem develops and sustains itself, it will continue to generate data for many useful applications. These smart devices can detect the presence of other devices as well as humans, track their movements, their conditions, and generate sensorial data about both devices and people every second. Many applications for these kinds of devices have been already identified in healthcare, diagnostics, and retail industries, and many more are bound to appear. This sensorial information will create a torrent of data—measured in yottabytes—and beyond.1

Mining these data streams will enable businesses to gain significant competitive advantage as they can proactively detect and act to address problems efficiently and leverage opportunities effectively. But that will all depend on the ability of businesses to properly manage the data pipeline, for quality and timeliness, and to then make sense of it through proper modeling and analytics.

1 http://gigaom.com/2012/10/30/as-data-gets-bigger-what-comes-after-a-yottabyte/

© 2015 NTT Innovation Institute Inc. 6

Hype and Reality

Big data is just beyond the hype phase. It is real and it is here to stay. Organizations will need to face and overcome challenges before it becomes mainstream.

Abundant data from the web, sensors, smartphones, and corporate databases can be mined for unknown efficiencies and insights, for smarter, data-driven decision-making in every field. But, in this nascent stage of development, mining big data requires highly developed skills of a special group of people - data scientists. In a recent article in the New York Times, writer Steve Lohr referred to the work conducted by these hybrid IT/Business specialists – in their own words - as “data wrangling,” “data munging” and “data janitor work.” These data scientists can spend from 50 percent to 80 percent of their time collecting and preparing data before it can even be reviewed for potential insight. This brings on what used to be referred to as the “Cold Fusion Effect,” or the act of consuming more energy to create it than there is to consume – or a net loss.

Just like many of the technological trends that have preceded big data analytics, however, the technology is adapting to meet the challenges and businesses are beginning to find the economic advantages and incorporate new practices to utilize them.

According to Gartner and their Advanced Analytics and Data Science Hype Cycle from 2014, it would appear that big data, and it’s accompanying advanced analytics, have moved through the portion where the data load is out-gunning the ability to manage and make sense of it.

© 2015 NTT Innovation Institute Inc. 7

And, the light that can be seen at the end of the tunnel is being driven by excitement around the early business improvements that are being derived.

The reality, or success, of big data surrounds the challenges of managing the massive data sets including analysis, capture, curating, search, sharing, storage, transfer, visualization, privacy, and regulatory requirements. Or, in a nutshell, managing the end-to-end data pipeline workflow. The differences between data as it was known, and the big data of today, are the shear volume, its growth rate, its heterogeneity and complexity. But, with the technological advances in machine learning, relationships between seemingly unrelated data sets are now being found that make it possible to derive meaningful and useful information.

Big data, when finally harnessed effectively, will be able to enhance the quality for everyone on this planet. Whether its reviewing social conversations, properly tuning M2M communications, or optimizing home energy consumption, the key to harnessing this power revolves around a few important aspects:

• The ease with which the data can be accessed and managed • The availability of top-notch analysts who can make sense of it • A process within an organization through which it can be actualized into something

truly transformational

Expectations

In business, digital foresight can create deep significance in terms of growth and competitive advantage. With adequate systems, businesses can capture all possible data about every angle of their processes and products more quickly and radically improve market agility.

Business leaders want to be able to analyze all the data, regardless of the source, volume, or structure—or the lack of it. Businesses expect all this data to be leveraged in real-time to optimize every business transaction and they have to capture and analyze it quickly, especially when the importance and relevance of the data is transient as in social media. If businesses cannot capture negative sentiments from consumers and act on them in a reasonable amount of time, that data becomes irrelevant. Sensor data is even more demanding. It requires that the data is watched, analyzed, and many times, acted upon almost in real-time.

Due to its early success, despite the challenges of deployment and analysis, IDC found that a high percentage of companies believe that big data is here to stay. In fact, 74 percent see it being adopted in at least one department in the next three years.

© 2015 NTT Innovation Institute Inc. 8

According to the IDC survey the top four areas driving investment in big data initiatives are:

• Improving the quality of the decision-making process (59%) • Increase the speed of decision-making (53%) • Improving planning and forecasting (47%) • And, developing new products/services and revenue streams (47%)

Businesses also expect these data insights to be delivered to the end users in a way they can easily understand and explore in order to improve on business processes. IT needs to provide tools for this self-service data discovery for their business users so that they can gain both valuable insights and actionable intelligence to apply in their daily jobs

Examples of Success

The information found in big data is now being harnessed by some of the world’s most innovative companies to track, understand and even influence consumer behavior. For example:

• Netflix makes recommendations based more on what viewers watch than the ratings they give. Almost 75 percent of what viewers watch comes from these data based recommendations. And, in the end, keeps their customers coming back for more.

• Starbucks uses big data and analytics to create the secret of their success – mass customization. Who knew that the Chicken Santa Fe Panini with Caramel Bruleé Latte would be a hit? They did!

© 2015 NTT Innovation Institute Inc. 9

• Tesla exceeds the customer care delivered by every other car company because of their use of advanced analytics. They have modeled consumer buying patterns and geographical algorithms in order to create an unparalleled service system. From service locations, to actual capabilities, they have rated the highest in consumer loyalty in Consumer Reports – ever!

The Harvard Business Review recently reported that companies in the top third of their industry in the use of data-driven decision making are, on average, five percent more productive and six percent more profitable than their competitors. Based on these findings, it would seem that big data analytics adoption should become mainstream rapidly.

Challenges for Most

As depicted in the Gartner Hype Cycle, full adoption of big data analytics is far from mainstream. For most companies and their IT groups, the crux is managing the fire-hose flow of data and applying the necessary modeling and analytics. Additionally, creating a source of opportunity from those tools and having the corporate process that allows a business to take advantage of them will pose an additional challenge. As a result, this high-value information is still a department-by-department effort, not a corporate initiative. According to a recent Gartner report, by 2017, 90 percent of the information assets from big data analytic efforts will be siloed, and not able to be leveraged across multiple business processes even though the demand for a usable cross-organization data analytics platform may be high. (Source: Focus on the 'Three Vs' of Big Data Analytics: Variability, Veracity and Value, November 24, 2014. Analyst Alan D. Duncan.)

One contributing factor to the siloed big data analytics efforts is the availability of data scientists and their ability to manage a data pipeline workflow. In the IDC chart below, there is a clear talent gap of available data analysts in comparison to the expanding need.

Early Stages

Part of the limitation for an enterprise to meet this challenge head on are the limitations of infrastructure, unknown or dated privacy and data access policies, poor data governance, evolving data management and analytics technologies, lack of necessary talent and just the lack of understanding as to how exactly these findings can affect an organization. Or, simply put, the lack of proper talent across all technology and business layers to manage and make sense of the data.

In 2013, IDC surveyed 700 large North American enterprises and found that less than one percent had achieved a high level of big data usage where it was operationalized and used to provide continuous process improvement. And, Gartner has stated that, “Through 2018, 90 percent of deployed data lakes will be useless as they are overwhelmed with information assets captured for uncertain use cases.” (Source: Gartner Predicts 2015: Big Data Challenges Move From Technology to the Organization, November 28, 2014. Analysts: Nick Heudecker, Lakshmi Randall, et al.)

© 2015 NTT Innovation Institute Inc. 10

Advanced analytics is about problem solving and predicting. Traditionally, analytics has been about reporting what happened (descriptive analytics). Advanced analytics helps predict future outcomes and behavior. As a result, advanced analytics requires a very different set of skills. It is not the realm of the typical business analyst.

Gartner predicts that, by 2015, 4.4 million jobs globally will be needed to cope with big data, but only a third of those jobs will be filled. They expect that the data scientist role will be in great demand due to the need for a multidisciplinary practice that encompasses the creation and rationalization of advanced analytics. It typically includes the following skills:

• Advanced statistics • Machine learning • Computer science • Operations research • Programming • Data management

(Major Myths About Big Data's Impact on Analytics, September 15, 2014. Analysts Alexander Linden, Mark A. Beyer, et al)

With a capable IT team in place that includes this type of expertise, it is definitely possible to deploy advanced analytics to help business. But, beyond the impact of cost and capability to adoption, one of the over-riding factors to implementation is a prevailing attitude among IT and business executives about being early adopters. There have been a number of technology trends that have taken much longer to prove ROI than expected. Having seen many technology boom and bust cycles over the past few years, not many IT audiences or business decision makers are as receptive as one might think to the next “game changer” and its accompanying price tag.

According to the Harvard Business Review (HBR), “…our experience reveals that most companies are unsure how to proceed. Leaders are understandably leery of making substantial investments in big data and advanced analytics. They’re convinced that their organizations simply aren’t ready. After all, companies may not fully understand the data they already have, or perhaps they’ve lost piles of money on data-warehousing programs that never meshed with business processes, or maybe their current analytics programs are too complicated or don’t yield insights that can be put to use. Or all of the above. No wonder skepticism abounds.” (Making Advanced Analytics Work for You, Dominic Barton and David Court, October, 2012.)

IT must be reinvented to address the challenges posed by big data to establish a data driven decision-making culture across the organization.

© 2015 NTT Innovation Institute Inc. 11

THE IT CHALLENGE Delivering Cost-Effective, Innovative Solutions in an Evolving Technology Landscape

While current business intelligence technology is good for analyzing structured, transactional, and demographic data to gain historical perspective, it cannot handle heterogeneous and complex data. IT’s traditional method of delivering static reports on a periodic basis using email is no longer of any value when their end users are trying to access real-time, predictive data insights.

IT departments have to figure out how to use new database technologies for data storage, understand machine learning frameworks, and deliver sophisticated, interactive applications to visualize big data where end users can interact and discover insights on their own. In a still-evolving technology landscape, the choices are slim; engineers must either learn new skills, or hire data scientists and engineers with the right skills for building analytics solutions that address the changing business needs. Data science and machine learning, however, are highly specialized areas, and the scarcity of talent in this discipline drives cost higher.

Delivering Data Insights to Business Users – On Their Terms

Because end user expectations have changed, IT leaders are responding to the problem, and are addressing these challenges by embracing new tools and techniques. Just as spreadsheets made financial calculations and simple modeling accessible to millions of non-experts in business, IT organizations are working to identify the right tools that cover the entire data pipeline management workflow, and are beginning to integrate them to provide advanced data analysis to the everyday business end user.

IT is adapting to the new business model in which the value chain includes end-to-end processes for sophisticated data capture and preparation techniques, advanced analytics with machine learning, and clever interactive visualizations that put fresh data into the hands of those who need it in a cohesive, simple interface. By addressing every step in the data pipeline management workflow with the right solution, IT can give business users a comprehensive understanding of the data through an intuitive, interactive, and enriched user experience. Those organizations that do not enable IT to provide these advanced data insights - despite the high costs - will quickly lose to the businesses that do.

THE IDEAL SOLUTION The ideal solution calls for a methodology to guide the data analysts through the end-to-end data pipeline management workflow for each business problem. The process should start with identifying the question that needs to be addressed for a given business process, then identifying the data, then, leveraging the right tools to capture the data at the source, and using the appropriate tools and technology to store and analyze that data. Lastly, leveraging the most appropriate data visualization techniques

© 2015 NTT Innovation Institute Inc. 12

to build data products allows end users to intuitively explore data to discover additional insight in a self-serve manner.

Additionally the data products need to be embedded seamlessly in the applications users are already using for their day-to-day activities to create enterprise-wide pervasive analytics. The process does not end there. Analysis, by nature, is an iterative process. Analysts should be able to identify better ways to analyze and represent the data and gain incremental value from data in an ongoing process.

Only then will businesses be able to use the data to make informed decisions.

Part of this methodology calls for subject matter expertise. Domain and data experts who understand both the structured and unstructured, complex data in the context of the problem, are able to recognize relevant patterns to build the right solutions. Another part of this methodology requires knowledge and skills in advanced analytics frameworks that are constantly evolving. For example, in some cases, distributed, real-time machine learning frameworks, such as Jubatus, enable IT organizations to implement machine learning processes to deploy at source in a distributed fashion, where learning occurs at each deployment and knowledge is collated and even applied at each location. This type of technology takes knowledge and decision-making to the next level, where the data is captured, processed and analyzed at the source in near real-time for ultimate proactive decision-making.

An ideal solution will allow data scientists to iterate on large data sets to identify valuable information, and then continue to experiment with additional useful data sets to refine the analysis and continue to package the insights for immediate and continuous value. And this solution should enable data scientists to deploy the knowledge where it matters most. The current set of technology enables data scientists to build the analytical models comparatively easily, however, analysts still need to rewrite the model to scale for both the data volume it needs to process, and the time in which the analysis must be completed. In an ideal situation, analysts need an easy way to deploy analytical models, monitor, and evaluate models over time, in order to make necessary adjustments for maintaining model accuracy and performance. In short, they need a platform that standardizes and automates data pipeline workflow management.

As the underlying technology matures, smarter, more sophisticated tools will replace the ones already in use. IT organizations, however, will need standardized platforms to take advantage of the new technology without disrupting the rest of the solution footprint.

What’s Available

Many tools in the marketplace cater to parts of the data pipeline management workflow. IT departments must identify skilled engineers and data scientists to create a cohesive solution to manage the data pipeline and then update and maintain the integrated solution as new and improved tools come to market. This is one of the reasons why big data analytics adoption is not yet commonplace.

© 2015 NTT Innovation Institute Inc. 13

Ability to Easily Manage End-to-End Data Pipeline Management Workflow

A big data analytics platform for data scientists, software engineers and even business users that supports business agility, is what’s needed. It needs to allow users to work with data, experiment with it in an iterative fashion, and explore it to discover valuable insights. This platform should be designed for ease of use, simplicity, and clarity.

The Data Management Workflow

The key steps in the end-to-end data pipeline management process are:

• Data Capture

Capturing and securing access to data, across all business touch points, domains and processes presents a modern challenge in security, privacy, governance and data quality. Once a data lifecycle process has been developed, assigning priority to business critical information should take precedence.

• Data Preparation

Adequate data wrangling tools are needed to prepare data from diverse sources and different formats in order to create useful sets of data for funneling into structured datasets. Managing a SQL-based data warehouse, and developing data models for accessing heterogeneous information, and then having that data ready for advanced analytics is a challenge. But focusing on the data quality itself is always important – even in the big data world.

• Feature Engineering

With properly prepared, modeled, and structured data sets, users can begin to understand data distribution and relevance in order to identify significant features for machine learning. Proper modeling and data integrity tools are necessary.

• Machine Learning

© 2015 NTT Innovation Institute Inc. 14

Through machine learning, users can finally experience the true value of big data. In a world where more is better, machine learning can take over and begin the deep analysis of complex data through its own logic assessments.

• Data Visualization and Data Products

Through the logic of machine learning, users can create data products including predictive models, along with interactive data visualization, that can enable self-service data discovery.

Open Big Data Analytics Platform That Delivers

Today’s cloud services offer a new option in the storage and use of rapidly scaling data. The proper controls and service level agreements must be in place to develop trust in cloud service providers. With their sensitive and/or valuable data, companies might consider starting with a private cloud solution (good for use case development) and move towards a secure hybrid version. Also, starting small and experimenting, organizations can limit their investment while determining the full value of their investment. Starting with a cloud-based option will help IT grow their solution and help justify further investment needs.

To take full advantage of their big data analytic systems, however, the platform utilized by IT needs to be able to “plug and play” with better performing or lower cost variants of technology currently in use. This helps reduce any friction due to transition or changing templates.

Useful Templates for Speed, Usability, and Visualization

By creating a set of pre-packaged components with familiar tools and techniques for data capture, data preparation, data visualization and data analysis, a business will benefit greatly. In order to get meaningful data insight through data discovery, specific business domains such as security, cloud orchestration and cloud service brokerage, must be able to quickly build an end-to-end process to mine domain specific data by stringing pre-packaged components at every step. From capturing the data, to preparing the data with usable models, to exploring the data with available visualization templates, users should be able to deploy the components where needed, through familiar user interfaces.

When advanced data analytics are handled in the cloud, IT can focus on other important tasks such as collaborating with business users to better understand domain-specific analytics requirements and implement solutions in a reasonable period to provide value to all areas of the business.

A Real-Time Distributed Learning Framework

There are many machine learning software packages available today. These can be roughly categorized into four major groups, based on whether they support batch and online learning paradigms, and whether or not the learning can be distributed. Most of

© 2015 NTT Innovation Institute Inc. 15

the classic and well-known machine learning software packages, such as R and MATLAB, are non-distributed and batch-oriented. With the increasing adoption of Hadoop, there are now many distributed batch-learning frameworks, such as Apache Mahout, that are typically built on the Hadoop TechStack and are designed to elevate big data analysis to the next level. All these software packages, however, only support batch learning.

Learning Machines

There is another category of machine learning capability called online machine learning, also known as incremental or adaptive learning. Online learning algorithms are designed to learn from data streams in real-time, as opposed to learning from stored data offline as a batch. Several emerging software packages implement this type of learning. Most notable ones are Vowpal Wabbit, MOA, SciKit-learn and Rapid Miner data stream plugin. All of these, however, are non-distributed.

Online learning is known to be difficult to scale and distribute. Jubatus is the first software package that combines online and distributed learning. SAMOA, developed by Yahoo Research in Europe, also tackles online and distributed learning.

In order to appreciate the need for online and distributed machine learning software packages, it is essential to consider the challenges of learning on big data streams.

When machine learning is applied to big data streams, there are two main challenges.

First, each analytical model needs to be refreshed continuously to handle a possible problem known as the concept drift. Concept drift happens when there is a significant change in the data distribution, enough to make existing models obsolete. This type of situation can be modeled as batch learning, however the model is always somewhat out-of-sync with the actual data distribution until another offline batch learning is performed. With online learning, since the model is updated in real-time when each new data instance arrives, the model is always up-to-date.

Second, when data streams grow in volume, the machine learning software has to scale to keep up with these data streams. There are two approaches to handle these two challenges. One is to use the conventional batch machine learning method and re-train the model periodically. The other approach, used by Jubatus, is online learning, where learning happens as the model is also applying the knowledge to the incoming data.

A leading framework, Jubatus, is one of the best to apply machine learning on large data sets originated from big data.

© 2015 NTT Innovation Institute Inc. 16

LEADING THE SHIFT "By visualizing information, we turn it into a landscape that you can explore with your eyes, a sort of information map. And when you're lost in information, an information map is kind of useful." David McCandless.

What IT and business leaders are learning is, technology is moving rapidly that will help manage the data pipeline workflow. In doing so, businesses will be able to leverage big data, with advanced analytics, to optimize:

Customer Analytics to

• Increase acquisition • Reduce churn • Increase revenue per customer • Improve existing products

Operational Analytics for

• Industrial monitoring and optimization • Supply chain efficiency • IT operation analytics • Network planning and optimization

It’s been shown, there is a shortage of data scientists needed to quantify the data, there is a developing ecosystem of new products and the demand is there. Utilizing a large scale, yet open, platform that can readily integrate with a number of databases, analytics engines and templates, will help reduce overall cost, improve ROI and help justify a critical investment.

With predictive analytics, companies will have a much greater chance of making better decisions to keep their respective customers engaged. And, in becoming a more data-driven organization, companies will see an overall improvement in the process of new product development and service Innovation like Netflix, Starbucks, and Tesla.

CONCLUSION The megatrends described here bring enterprises many opportunities to raise their productivity and grow with innovative applications. To capture these opportunities, enterprises must overcome certain technological hurdles that go along with these developing trends. With big data all around us, the number of exabytes created is growing exponentially every day. Over 90 percent of the world’s data was generated in the last two years alone.

Many technology vendors are developing products to solve specific infrastructure shortcomings but the bigger challenge for enterprises is to integrate those products into their IT infrastructures, operate them, and update them while maintaining leading edge

© 2015 NTT Innovation Institute Inc. 17

capabilities. As the technology lifecycle accelerates, rebuilding infrastructure with the traditional IT-as-an-asset model will prove to be a big financial burden for only the most affluent enterprises. Over time, companies will increasingly look to, and need, the Cloud for the business agility required for a successful organization now, and in the future.