Hadoop Programming - 2. Hadoop Architecture for Data Science

Preview:

DESCRIPTION

The foundation of any Big Data implementation is data storage design, but... When is a Big Data Approach appropriate for an organization and their business? How can Hadoop help us developing Big Data Application?

Citation preview

Hadoop Programming

2. Hadoop Architecture for Data Science

Some Questions

The foundation of any Big Data implementation is data storage design...

but...

Why is a Big Data Approach appropriate for an organization and their business?

How can Hadoop help us developing Big Data Application?

Big Data Systems

A “Big Data System” is a distributed data

storage architecture that relies on massive

parallelization;

Let's see the differences among possible

storaging solutions!

Key-Store Systems

Key-store systems don't use schemas!

Key stores are a good choice when you have no

idea what the structure of the data is, you have to

implement your own low level queries (e.g., image

processing and anything not easily expressed in

SQL), or even if the data has structure, the

relationship definition of the recorded data and

any schema is left to the user!!!

Columnar Databases

Columnar databases split each record across

multiple column files with the same index;

In a columnar database, rows are decomposed

into their individual fields and then stored in

individual column files, one field per file!!!

RDBM Databases

RDBM databases store complete records as

individually distinguishable rows!!!

In an RDBMS, each row is a unique and

distinguishable entity.

The schema defines the contents of each row,

and rows are stored sequentially in a file.

How to decide which storage is best fit?

Key stores will work well with schema-free,

eterogeneous datasets (web pages, system logs,

images,...) where the individual records have

large dimensions;

structure and relationship interpretation of

data fields are dependent on the

user/implementor;

How to decide which storage is best fit?

Columnar Databases are preferable when the

data is easily divided into individual records that

don’t need to cross-reference each other, and

when the contents are relatively small;

they can optimize queries by picking out and

processing data from a subset of the columns

in each record;

How to decide which storage is best fit?

RDBM Databases work best with data that can

be subdivided across multiple tables;

RDBMSes are good at maintaining integrity and

concurrency; if you need to update a row,

they’re the default choice!!!

BUT...

How to decide which storage is best fit?

RDBMes are not the best choice if your data:

- doesn’t change after creating it;

- individual records don’t have cross-

references;

- data schemas store large blobs;

Most importantly:

RDBMes don't scale well with Big Data and

SQL queries response may take tons of time!!!

Being Quick and Data Driven

Rapid decision making is mission critical in

dynamic environments: twelve-month product-

release cycles are a relic of the past!!!

Organizations need to move to a cycle of

continuous delivery and improvement,

adopting Agile Development, supported by

Big Data Analytics, in order to increase their

pace of Innovation!

Being Quick and Data Driven

Continuous improvement requires continuous

experimentation, along with a process for

quickly responding to bits of information!!!

Integrating different data sources into a

single system that is accessible to everyone in

the organization will improve the overall struggle

for innovation!!!

Data Science VS Business Analytics

The term Data Science, originally introduced by Peter Naur in the 1960s, relates to: “an emerging area of work concerned with the collection, preparation, analysis, visualization, management, and preservation of large collections of information.”

Unfortunately, the term is often mistaken with traditional Business Analytics.

In reality, the two disciplines are quite different!

Data Science VS Business Analytics

Business Analytics searches for patterns in existing business operations, to improve them.

The goal of Data Science, instead, is to extract meaning from data!!!

Data Science follows a multidisciplinary approach, based on math, statistical analysis, pattern recognition, machine learning, high-performance computing, data warehousing, … Information get analyzed to look for trends in emerging business possibilities!!!

Data Science fueled by Hadoop Map Reduce

By utilizing Hadoop’s MapReduce programming model, not only it is possible to solve data science problems, but also enterprise application creation and deployment are significantly simplified!

Corporates are using Hadoop for strategic decision making, and they are starting to use their data more wisely than before!!

As a result, data science has entered the business world!!!

Business problems solved with Hadoop

Examples of business problems solved using Hadoop:- Credit Fraud detection;- Social media marketing analysis;- Shopping pattern analysis for retail product placement;- Traffic pattern recognition for urban development - Content optimization and engagement- Network analytics and mediation - Large data transformation — The New York Times was able to convert 4 TB of scanned articles to 1.5 TB of PDF documents in just 24 hours!!!

Developing Enterprise Apps with Hadoop

Meeting Big Data challenges requires rethinking the way applications are built with Hadoop; traditional approach, based on storing data in the RDBM databases, will not work with Hadoop, as:

- transactional database access is not supported;- real-time access is feasible only on a partial data stored on the cluster;- massive data storage capabilities of Hadoop enable to store versions of data sets, as opposed to the traditional approach of overwriting data.

Developing Enterprise Apps with Hadoop

The Hadoop Distributed File System (HDFS), implemented by Hadoop, is a “write-once” filesystem;

All this means that:

- in HDFS new data does not overwrite existing data;- HDFS simply creates a new version of the data!!!

A typical Hadoop Enterprise Application

source: “Professional: Hadoop Solutions”, by B. Lublinsky, Wrox

Hadoop Ecosystem

- HDFS, Hadoop Distributed FS, also the foundation for other tools (Hbase,...)- Hbase, column-oriented NoSQL database;- MapReduce Framework;- Zookeeper, Hadoop’s distributed coordination service;- Oozie, scalable workflow system for M/R jobs;- Pig, an abstraction over the complexity of MapReduce programming;- Hive, an SQL-like, high-level language used to run queries on data stored in Hadoop;

Hadoop Enterprise Integration Frameworks

- Sqoop, a connectivity tool for moving data between RDBM databases, DWH and Hadoop;

- Flume, a distributed, reliable, and highly available service for efficiently collecting, aggregating, and moving large amounts of data from individual machines to HDFS;

- Mahout, a machine-learning and data-mining library that provides MapReduce implementations for popular algorithms used for clustering, regression testing, and statistical modeling.

The Hadoop Ecosystem

source: “Professional: Hadoop Solutions”, by B. Lublinsky, Wrox

Hadoop Programming References:

- https://leanpub.com/the-big-data-approach-to-innovation

- “Hadoop: The Definitive Guide”, T. White, OReilly

- “Professional: Hadoop Solutions”, by B. Lublinsky, Wrox

Hadoop Programming Course

For more information about

Hadoop Programming,

please browse the link below:

http://www.startithub.com/blog/category/hadoop-programming

Recommended