6
WWW.KELLYTECHNO.COM Page 1 INTRODUCTION TO HADOOP What is Big Data? Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. Gartner defines Big Data as high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. According to IBM, 80% of data captured today is unstructured, from sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals, to name a few. All of this unstructured data is Big Data. Big data spans three dimensions: Volume, Velocity and Variety. Volume: Enterprises are awash with ever-growing data of all types, easily amassing terabytes - even petabytes - of information. Turn 12 terabytes of Tweets created each day into improved product sentiment analysis Convert 350 billion annual meter readings to better predict power consumption Velocity: Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value. Scrutinize 5 million trade events created each day to identify potential fraud Analyze 500 million daily call detail records in real-time to predict customer churn faster Variety: Big data is any type of data - structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together. Monitor 100’s of live video feeds from surveillance cameras to target points of interest Exploit the 80% data growth in images, video and documents to improve customer satisfaction What does Hadoop solve? Organizations are discovering that important predictions can be made by sorting through and analyzing Big Data.

Hadoop training institutes in bangalore

Embed Size (px)

Citation preview

Page 1: Hadoop training institutes in bangalore

WWW.KELLYTECHNO.COM Page 1

INTRODUCTION TO HADOOP

What is Big Data?

Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world

today has been created in the last two years alone.

Gartner defines Big Data as high volume, velocity and variety information assets that demand

cost-effective, innovative forms of information processing for enhanced insight and decision

making.

According to IBM, 80% of data captured today is unstructured, from sensors used to gather

climate information, posts to social media sites, digital pictures and videos, purchase transaction

records, and cell phone GPS signals, to name a few. All of this unstructured data is Big Data.

Big data spans three dimensions: Volume, Velocity and Variety.

Volume: Enterprises are awash with ever-growing data of all types, easily amassing terabytes - even

petabytes - of information.

Turn 12 terabytes of Tweets created each day into improved product sentiment analysis

Convert 350 billion annual meter readings to better predict power consumption

Velocity: Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big data

must be used as it streams into your enterprise in order to maximize its value.

Scrutinize 5 million trade events created each day to identify potential fraud

Analyze 500 million daily call detail records in real-time to predict customer churn faster

Variety: Big data is any type of data - structured and unstructured data such as text, sensor data, audio,

video, click streams, log files and more. New insights are found when analyzing these data types

together.

Monitor 100’s of live video feeds from surveillance cameras to target points of interest

Exploit the 80% data growth in images, video and documents to improve customer satisfaction

What does Hadoop solve?

Organizations are discovering that important predictions can be made by sorting through and

analyzing Big Data.

Page 2: Hadoop training institutes in bangalore

WWW.KELLYTECHNO.COM Page 2

However, since 80% of this data is "unstructured", it must be formatted (or structured) in a way

that that makes it suitable for data mining and subsequent analysis.

Hadoop is the core platform for structuring Big Data, and solves the problem of making it useful

for analytics purposes.

The Importance of Big Data and What You Can Accomplish

The real issue is not that you are acquiring large amounts of data. It's what you do with the data that

counts. The hopeful vision is that organizations will be able to take data from any source, harness

relevant data and analyze it to find answers that enable 1) cost reductions, 2) time reductions, 3) new

product development and optimized offerings, and 4) smarter business decision making. For instance,

by combining big data and high-powered analytics, it is possible to:

Determine root causes of failures, issues and defects in near-real time, potentially saving billions

of dollars annually.

Optimize routes for many thousands of package delivery vehicles while they are on the road.

Analyze millions of SKUs to determine prices that maximize profit and clear inventory.

Generate retail coupons at the point of sale based on the customer's current and past

purchases.

Send tailored recommendations to mobile devices while customers are in the right area to take

advantage of offers.

Recalculate entire risk portfolios in minutes.

Quickly identify customers who matter the most.

Use clickstream analysis and data mining to detect fraudulent behavior.

Challenges

Many organizations are concerned that the amount of amassed data is becoming so large that it is

difficult to find the most valuable pieces of information.

What if your data volume gets so large and varied you don't know how to deal with it?

Do you store all your data?

Do you analyze it all?

Page 3: Hadoop training institutes in bangalore

WWW.KELLYTECHNO.COM Page 3

How can you find out which data points are really important?

How can you use it to your best advantage?

Until recently, organizations have been limited to using subsets of their data, or they were constrained

to simplistic analyses because the sheer volumes of data overwhelmed their processing platforms. But,

what is the point of collecting and storing terabytes of data if you can't analyze it in full context, or if you

have to wait hours or days to get results? On the other hand, not all business questions are better

answered by bigger data. You now have two choices:

Incorporate massive data volumes in analysis. If the answers you're seeking will be better

provided by analyzing all of your data, go for it. High-performance technologies that extract

value from massive amounts of data are here today. One approach is to apply high-performance

analytics to analyze the massive amounts of data using technologies such as grid computing, in-

database processing and in-memory analytics.

Determine upfront which data is relevant. Traditionally, the trend has been to store everything

(some call it data hoarding) and only when you query the data do you discover what is relevant.

We now have the ability to apply analytics on the front end to determine relevance based on

context. This type of analysis determines which data should be included in analytical processes

and what can be placed in low-cost storage for later use if needed.

Technologies

A number of recent technology advancements enable organizations to make the most of big data and

big data analytics:

Cheap, abundant storage.

Faster processors.

Affordable open source, distributed big data platforms, such as Hadoop.

Parallel processing, clustering, MPP, virtualization, large grid environments, high connectivity

and high throughputs.

Cloud computing and other flexible resource allocation arrangements.

The goal of all organizations with access to large data collections should be to harness the most relevant

data and use it for better decision making.

Three Enormous Problems Big Data Tech Solves

But what’s less commonly talked about is why Big Data is such a problem beyond size and computing

power. The reasons behind the conversation are the truly interesting part and need to be understood.

Page 4: Hadoop training institutes in bangalore

WWW.KELLYTECHNO.COM Page 4

Here you go…there are three trends that are driving the discussion and should be made painfully clear

instead of lost in all the hype:

We’re digitizing everything. This is big data’s volume and comes from unlocking hidden data

from common things all around us that were known before but weren’t quantified, stored,

compared and correlated. Suddenly, there’s enormous value in the patterns of what was

recently hidden from our view. Patterns offer understanding and a chance for prediction of what

will happen next. These each are important and together are remarkably powerful.

There’s no time to intervene. This is big data’s velocity. All of that digital data creates massive

historical records but also rich streams of information that are flowing constantly. When we

take the patterns discovered in historical information and compare it to everything happening

right now, we can either make better things happen or prevent the worst. This is revenue

generating and life saving and all of the other wonderful things we hear about, but only if we

have the systems in place to see it happening in the moment and do something about it. We

can’t afford enough human watchers to do this, so the development of big data systems is the

only way to get to better things when the data gives humans insufficient time to intervene.

Variation creates instability. This is big data’s variety. Data was once defined by what we could

store and relate in tables of columns and rows. A world that’s digitized ignores those boundaries

and is instead full of both structured and unstructured data. That creates a very big problem for

systems that were built upon the old definition, which comprise just about everything around

us. Suddenly, there’s data available that can’t be consumed or generated by a database. We

either ignore that information or it ends up in places and formats that are unreadable to older

systems. Gone is the ability to correlate unstructured information with that vast historical (but

highly structured) data. When we can’t analyze and correlate well, we introduce instability into

our world. We’re missing the big picture unless we build systems that are flexible and don’t

require reprogramming the logic for every unexpected (and there will be many) change.

There you have it… The underlying reasons that big data matters and isn’t just hype (though there’s

plenty of that). The digitization, lack of time for intervention and instability that big data creates leads us

to develop whole new ways of managing information that go well beyond Hadoop and distributed

computing. It’s why big data presents such enormous challenge and opportunity for software vendors

and their customers, but only if these three challenges are the drivers and not opportunism.

BI vs. Big Data vs. Data Analytics By Example

Business Intelligence (BI) encompasses a variety of tools and methods that can help organizations make

better decisions by analyzing “their” data. Therefore, Data Analytics falls under BI. Big Data, if used for

the purpose of Analytics falls under BI as well.

Page 5: Hadoop training institutes in bangalore

WWW.KELLYTECHNO.COM Page 5

Let’s say I work for the Center for Disease Control and my job is to analyze the data gathered from

around the country to improve our response time during flu season. Suppose we want to know about

the geographical spread of flu for the last winter (2012). We run some BI reports and it tells us that the

state of New York had the most outbreaks. Knowing that information we might want to better prepare

the state for the next winter. Theses types of queries examine past events, are most widely used, and

fall under the Descriptive Analytics category.

Now, we just purchased an interactive visualization tool and I am looking at the map of the United

States depicting the concentration of flu in different states for the last winter. I click on a button to

display the vaccine distribution. There it is; I visually detected a direct correlation between the intensity

of flu outbreak with the late shipment of vaccines. I noticed that the shipments of vaccine for the state

of New York were delayed last year. This gives me a clue to further investigate the case to determine if

the correlation is causal. This type of analysis falls under Diagnostic Analytics (discovery).

We go to the next phase which is Predictive Analytics. PA is what most people in the industry refer to

as Data Analytics. It gives us the probability of different outcomes and it is future-oriented. The US

banks have been using it for things like fraud detection. The process of distilling intelligence is more

complex and it requires techniques like Statistical Modeling.

Back to our examples, I hire a Data Scientist to help me create a model and apply the data to the model

in order to identify causal relationships and correlations as they relate to the spread of flu for the winter

of 2013. Note that we are now taking about the future. I can use my visualization tool to play around

with some variables such as demand, vaccine production rate, quantity… to weight the pluses and

minuses of different decisions insofar as how to prepare and tackle the potential problems in the

coming months.

The last phase is the Prescriptive Analytics and that is to integrate our tried-and-true predictive models

into our repeatable processes to yield desired outcomes. An automated risk reduction system based on

real-time data received from the sensors in a factory would be a good example of its use case.

Finally, here is an example of Big Data.

Suppose it’s December 2013 and it happens to be a bad year for the flu epidemic. A new strain of the

virus is wreaking havoc, and a drug company has produced a vaccine that is effective in combating the

virus. But, the problem is that the company can’t produce them fast enough to meet the demand.

Therefore, the Government has to prioritize its shipments. Currently the Government has to wait a

considerable amount of time to gather the data from around the country, analyze it, and take action.

The process is slow and inefficient. The following includes the contributing factors. Not having fast

enough computer systems capable of gathering and storing the data (velocity), not having computer

systems that can accommodate the volume of the data pouring in from all of the medical centers in the

country (volume), and not having computer systems that can process images, i.e, x-rays (variety).

Big Data technology changed all of that. It solved the velocity-volume-variety problem. We now have

computer systems that can handle “Big Data”. The Center for Disease Control may receive the data

Page 6: Hadoop training institutes in bangalore

WWW.KELLYTECHNO.COM Page 6

from hospitals and doctor offices in real-time and Data Analytics Software that sits on the top of Big

Data computer system could generate actionable items that can give the Government the agility it

needs in times of crises.