Big data clustering

Workshop on Parallel, Cluster and

Cloud Computing on Multi-core & GPU

(PCCCMG - 2015)

Workshop Conducted by Computer Society of India In

Association with Dept. of CSE, VNIT and

Persistence System Ltd, Nagpur4th – 6th Sep’15

Big-Data Cluster Computing

Advance tools & technologies

Jagadeesan A SSoftware Engineer

Persistent Systems Limited

www.github.com/jagadeesanas2

www.linkedin.com/in/jagadeesanas2

ContentContentOverview of Big Data

• Data clustering concepts• Clustering vs Classification • Data Journey

Advance tools and technologies• Apache Hadoop• Apache Spark

Future of analytics• Demo - Spark RDD in Intellij IDEA

Big-Data is similar to Small-Data , but bigger in size and complexity.

What is Big-Data ?

Definition from Wikipedia:

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

Characterization of Big Data: 4V’s

Veracity

Characterization of Big Data: 4V’s

Now big question ????Why we need Big Data ?

What to do with those

Data ?

And the answer is very clear…!!

What is a Cluster ?A group of the same or similar elements gathered or occurring closely together.

Clustering is the key to Big Data problem

• Not feasible to “label” large collection of objects • No prior knowledge of the number and nature of groups (clusters) in data • Clusters may evolve over time • Clustering provides efficient browsing, search, recommendation and organization of data

Difference between Clustering & classification

Clustering data on

Clustering videos on

Clustering Algorithms Hundreds of Clustering algorithms are available.

• K-Means• Kernel K-means• Nearest neighbour • Gaussian mixture• Fuzzy Clustering• OPTICS algorithm

Data Journey

Advance tools &

Technologies

Large-Scale Data Analytics

MapReduce computing paradigm vs. Traditional database systems

Database

Many enterprises are turned to HadoopEspecially applications generating big data, Web applications, social networks, scientific applications

APACHE HADOOP (Disk Based Computing)open-source software framework written in Java for distributed storage and distributed processing

Design Principles of Hadoop• Need to process big data • Need to parallelize computation across thousands of nodes• Commodity hardware• Large number of low-end cheap machines working in

parallel to solve a computing problem• Small number of high-end expensive machines

Hadoop cluster architecture A Hadoop cluster can be divided into two abstract entities:

MapReduce engine + distributed file system =

What is SPARK

Why SPARK How to configure SPARK

APACHE SPARKOpen-source cluster computing framework

APACHE SPARK (Memory Based Computing)open-source software framework written in Java for distributed storage and distributed processing

• Fast cluster computing system for large-scale data processing compatible with Apache Hadoop

• Improves efficiency through:• In-memory computing primitives• General computation graphs

• Improves usability through:• Rich APIs in Java, Scala, Python• Interactive shell

Up to 100× faster

Often 2-10× less code

Spark OverviewSpark OverviewSpark Shell Spark applications

• Interactive shell for learning or data exploration

• Python or Scala • It provides a preconfigured

Spark context called sc.

• For large scale data processing

• Python, Java, Scala and R• Every spark application

requires a spark Context. It is the main entry point to the Spark API.

Scala Interactive shell Python Interactive shell

Spark Overview

Resilient distributed datasets (RDDs) Immutable collections of objects spread across a

cluster Built through parallel transformations (map,

filter, etc) Automatically rebuilt on failure Controllable persistence (e.g. caching in RAM) for

reuse Shared variables that can be used in parallel

operations

Work with distributed collections as we would with local ones

Resilient Distributed Datasets (RDDs)

Two types of RDD operation

• Transformation – define new RDDs based on the current one Example: Filter, map, reduce

• Action – return values. Example : count, take(n)

Resilient Distributed Datasets (RDDs)

I have never seen the horror movies.I never hope to see one;But I can tell you, anyhow,I had rather see than be one.

File: movie.txt

RDD: mydata


Resilient Distributed Datasets (RDDs)map and filter Transformation


I HAVE NEVER SEEN THE HORROR MOVIES.I NEVER HOPE TO SEE ONE;BUT I CAN TELL YOU, ANYHOW,I HAD RATHER SEE THAN BE ONE.

I HAVE NEVER SEEN THE HORROR MOVIES.I NEVER HOPE TO SEE ONE;I HAD RATHER SEE THAN BE ONE.

Map(lambda line : line.upper())

Filter(lambda line: line.startswith(‘I’))

Map(line => line.toUpperCase())

Filter(line => line.startsWith(‘I’))

Spark Stack

• Spark SQL : --- For SQL and unstructured data processing

• Spark Streaming : --- Stream processing of live data streams

• MLib: --- For machine learning algorithm

• GraphX: --- Graph processing

Why Spark ? Core engine with SQL, Streaming, machine learning and graph

processing modules. Can run today’s most advanced algorithms. Alternative to Map Reduce for certain applications. APIs in Java, Scala and Python Interactive shells in Scala and Python Runs on Yarn, Mesos and Standalone.

Spark’s major use cases over Hadoop• Iterative Algorithms in Machine Learning• Interactive Data Mining and Data Processing• Spark is a fully Apache Hive-compatible data warehousing

system that can run 100x faster than Hive.• Stream processing: Log processing and Fraud detection in

live streams for alerts, aggregates and analysis• Sensor data processing: Where data is fetched and joined

from multiple sources, in-memory dataset really helpful as they are easy and fast to process.

MapReduce Example: Word Count




Example : Page RankA way of analyzing websites based on their link relationships

• Good example of a more complex algorithm• Multiple stages of map & reduce• Benefits from Spark’s in-memory caching• Multiple iterations over the same data

Basic IdeaGive pages ranks (scores) based on links to them

• Links from many pages high rank• Link from a high-rank page high rank

PageRank Performance

30 600

20

40

60

80

100

120

140

160

180 171

80

23

14

Hadoop SparkNumber of machines

Itera

tion

time

(s)

NOTE : Less Iteration Time denotes high Performance

Other Iterative Algorithms

Logistic Regression

0 25 50 75 100 125

0.96110

K-Means Clustering

0 30 60 90 120 150 180

4.1155

Hadoop Spark

TIME PER ITERATION (S)

NOTE : Less Iteration Time denotes high Performance

Spark Installation (For end-user side)Download Spark distribution from https://spark.apache.org/downloads.html which pre-build of hadoop 2.4 or later.

https://spark.apache.org/downloads.html

Spark Installation

Clone from apache https://github.com/apache/spark GitHub repository

(For developer side)

https://github.com/apache/spark

https://github.com/apache/spark

Spark Installation (continue)

Build the source code using maven and hadoop

<SPARK_HOME>#build/mvn –Pyarn –Phadoop –Phaddop-2.4 -Dhadoop.version=2.6.0

How to run Spark ?(Standalone mode )

Once the build is completed. Go to your bin directory which is inside Spark home directory in a terminal and invoke Spark Shell

<SPARK_HOME>/bin#./spark-shell

To start all Spark’s Master and slave nodes:

To execute following terminal inside sbin directory side spark home directory.

<SPARK_HOME>/sbin#./start-all.sh

Spark Master at Spark (Browser view):

localhost:8080

To stop all Spark’s Master and slave nodes:

To execute following terminal inside sbin directory side spark home directory.

<SPARK_HOME>/sbin#./stop-all.sh

Future of analytics

Analytics in the Cloud

https://www.youtube.com/watch?v=JfqJTQnVZvA

• IBM is making Spark available as a cloud service on its Bluemix cloud platform.

• 3,500 IBM researchers and developers to work on Spark-related projects at more than a dozen labs worldwide.

Demo - Spark RDD in Intellij IDEA

Data & Analytics

Big data clustering