24
Big Data Analysis using Hadoop on a Eucalyptus Cloud How secure is our cloud? PRESENTED BY: ABHISHEK DE STUDENT, CSE 2 ND YEAR, BPPIMT

Big Data Analysis on a Cloud Ecosystem-PATW 2013

Embed Size (px)

DESCRIPTION

My presentation on Big Data for PATW 2013, IET(UK)

Citation preview

Page 1: Big Data Analysis on a Cloud Ecosystem-PATW 2013

Big Data Analysis using Hadoop on a Eucalyptus CloudHow secure is our cloud?PRESENTED BY: ABHISHEK DE

STUDENT, CSE 2ND YEAR, BPPIMT

Page 2: Big Data Analysis on a Cloud Ecosystem-PATW 2013

PREPARED BY: ABHISHEK DE

2

Contents:

The Big Data Crisis

Let’s embrace Cloud Computing

Benefits of cloud

Establishing an IaaS using Eucalyptus

A word on Virtualization

Hadoop as a Platform

MapReduce and HDFS

Typical algorithms

Benefits we achieve

How secure is the system?

06-Apr-13

Page 3: Big Data Analysis on a Cloud Ecosystem-PATW 2013

PREPARED BY: ABHISHEK DE

3

The drifting era: BIG DATA and crisis

• YouTube users upload 48 hours of new video every minute of the day.

• 100 terabytes of data uploaded daily to Facebook.

• Twitter sees roughly 175 million tweets every day, and has more than 465 million accounts.

• Walmart handles more than 1 million customer transactions every hour, and databases more than 2.5 petabytes of data.

06-Apr-13

Page 4: Big Data Analysis on a Cloud Ecosystem-PATW 2013

PREPARED BY: ABHISHEK DE

4

DATA is precious, too precious..

We need Infrastructure, which comes easily as a Service

06-Apr-13

Page 5: Big Data Analysis on a Cloud Ecosystem-PATW 2013

PREPARED BY: ABHISHEK DE

5

Solution: Cloud Computing

Conventional Computing:You data gets processed in your own computer.

Cloud computing:You send your data to some other computer. It gets processed there and it comes back to you.

“Cloud Computing is the use of computing resources (hardware and soft ware) that are delivered as a service over a network (typically the Internet)”--WIKIPEDIA

06-Apr-13

Page 6: Big Data Analysis on a Cloud Ecosystem-PATW 2013

PREPARED BY: ABHISHEK DE

6

Benefits of Cloud Computing:

High reliability.

Highly scalable and fault tolerant.

Reduced Cost: Only pay for what you

need.

Efficient management of resources.

Improved Security.

Achieved out of commodity hardware.

06-Apr-13

Page 7: Big Data Analysis on a Cloud Ecosystem-PATW 2013

PREPARED BY: ABHISHEK DE

7

Why Eucalyptus?

“Elastic Utility Computing Architecture Linking Your Programs To Useful System”

Eucalyptus is the world's most widely deployed software platform for on-premise (private) Infrastructure as a Service (IaaS) clouds.

It uses existing infrastructure to create a scalable, secure web services layer that abstracts compute, network and storage to offer IaaS.

Eucalyptus can be dynamically scaled up or down depending on application workloads.

06-Apr-13

Page 8: Big Data Analysis on a Cloud Ecosystem-PATW 2013

PREPARED BY: ABHISHEK DE

8

Architecture of Eucalyptus:

FRONT END:• Users login to

the cloud using credentials

• The user is redirected to the back end of the cloud, i.e., the Storage and the Resource pool

user1

user1@nc1:

BACK END:• Runs the

Node Controller.

• Mounts images as Virtual Machines or instances using XEN or KVM.

• Hosts the resource pool.FRONT END BACK END

06-Apr-13

Page 9: Big Data Analysis on a Cloud Ecosystem-PATW 2013

PREPARED BY: ABHISHEK DE

9

XEN: Virtualize your resources

XEN, is the under laying technology used by eucalyptus. Xen hypervisor allows several guest operating systems to be executed on the same computer hardware concurrently.

Xen partitions a single physical machine into multiple virtual machines, to provide server consolidation and utility computing. Existing applications and binaries run unmodified.

The hypervisor controls the MMU, CPU scheduling, and interrupt controller, presenting a virtual machine to guests.

06-Apr-13

Page 10: Big Data Analysis on a Cloud Ecosystem-PATW 2013

PREPARED BY: ABHISHEK DE

10

HADOOP: Solution to BIG DATA

Roughly how long does it take to read 1TB from a commodity hard disk:

That is roughly around 4 hours.

With HADOOP it takes around :

62 seconds…

06-Apr-13

Page 11: Big Data Analysis on a Cloud Ecosystem-PATW 2013

PREPARED BY: ABHISHEK DE

11Birth of HADOOP: Opensource alternative to GFS

Pre-2004 : Cutting and Cafarella develop open source projects for web-scale indexing, crawling and search.

2004: Jeffrey Dean and Sanjay Ghemawat introduce map reduce model used internally at Google.

2006: Hadoop becomes official Apache project, Cutting joins Yahoo! Yahoo adopts Hadoop.

06-Apr-13

Page 12: Big Data Analysis on a Cloud Ecosystem-PATW 2013

PREPARED BY: ABHISHEK DE

12HDFS: Hadoop Distributed File System

Files split into 128MB (or 64MB) blocks

Blocks replicated across several datanodes(usually 3)

Single namenode stores metadata (file names, block locations, etc.)

Optimized for large files, sequential reads Clients read from closest replica available.(note:

locality of reference.) If the replication for a block drops below target, it is

automatically re-replicated.Datanodes

11223344

112244

221133

114433

332244

Namenode

06-Apr-13

Page 13: Big Data Analysis on a Cloud Ecosystem-PATW 2013

PREPARED BY: ABHISHEK DE

13 Data Flow

Web Servers Scribe Servers

Network Storage

Hadoop ClusterOracle RAC

MySQL

06-Apr-13

Page 14: Big Data Analysis on a Cloud Ecosystem-PATW 2013

PREPARED BY: ABHISHEK DE

14

HADOOP and MapReduce:

Input

Map

Shuffle/SortReduce

Output

06-Apr-13

Page 15: Big Data Analysis on a Cloud Ecosystem-PATW 2013

PREPARED BY: ABHISHEK DE

15

Word Count: A typical Example

06-Apr-13

Page 16: Big Data Analysis on a Cloud Ecosystem-PATW 2013

PREPARED BY: ABHISHEK DE

16

Implementation: Hardware

Move code to data (local

computation)

Allow programs to scale

transparently w.r.t size of

input

Abstract away fault tolerance, synchronization

, etc.

06-Apr-13

Page 17: Big Data Analysis on a Cloud Ecosystem-PATW 2013

PREPARED BY: ABHISHEK DE

17

HADOOP in action!

SOCIAL NETWORKING ANALYSIS

PAGE RANKING ANALYSIS

ANALYTICS ENGINE WITH MAP/REDUCE

IMAGE PROCESSING

06-Apr-13

Page 18: Big Data Analysis on a Cloud Ecosystem-PATW 2013

PREPARED BY: ABHISHEK DE

18

Social Networking Analysis:

Problem: recommend new friends (friend-of-a-friend, FOAF)

Map task:

– U (target user) is fixed and its friends list copied to all cluster nodes (“copy join”); each cluster node stores part of the social graph

– In: (X, <friendsX>), i.e. the local data for the cluster node

– Out:

if (U, X) are friends => (U, <friendsX\friendsU>), i.e. the users who are friends of X but not already friends of U

nil otherwise

Reduce task:

– In: (U, <<friendsA\friendsU>,<friendsB\friendsU>, … >), i.e. the FOAF lists for all users A, B, etc. who are friends with U

– Out (U, <(X1, N1), (X2, N2), …>), where each X is a FOAF for U, and N is its total number of occurrences in all FOAF lists (sort/rank the result!) 06-Apr-13

Page 19: Big Data Analysis on a Cloud Ecosystem-PATW 2013

PREPARED BY: ABHISHEK DE

19

Pro’s and Con’s

Batch, offline jobs

Write-once, read-many across full data set

Usually, though not always, simple computations

I/O bound by disk/network bandwidth

What it’s not:

High-performance parallel computing, e.g. MPI

Low-latency random access relational database

Always the right solution

06-Apr-13

Page 20: Big Data Analysis on a Cloud Ecosystem-PATW 2013

PREPARED BY: ABHISHEK DE

20

Cloud Security: Threats unveiled

XML SIGNATURE ATTACK:

The original SOAP body element is moved to a newly added bogus wrapper element in the SOAP security header. Note that the moved body is still referenced by the signature using its identifier attribute Id="body". The signature is still cryptographically valid, as the body element in question has not been modified (but simply relocated). Subsequently, in order to make the SOAP message XML schema compliant, the attacker changes the identifier of the cogently placed SOAP body (in this example he uses Id="attack"). The filling of the empty SOAP body with bogus content can now begin, as any of the operations denied by the attacker can be effectively executed due to the successful signature verification.

06-Apr-13

Page 21: Big Data Analysis on a Cloud Ecosystem-PATW 2013

PREPARED BY: ABHISHEK DE

21

Script Injection Attack

 targets only the AWS management console users.

exploits the shared credentials between the amazon shop interface and AWS.

The first vulnerability is exploits the GET parameters in the download link users utilize for downloading their X.509 certificates issued by Amazon. However the preconditions for the attack are rather high including use of UTF-7 encoding for the injected script to bypass server logic to encode standard HTML characters as well as the exploitation of features in specific IE versions.

The second script injection attack uses a persistent cross site scripting attack by exploiting the login session that is initiated with AWS the first time a user logs into the Amazons hop interface

06-Apr-13

Page 22: Big Data Analysis on a Cloud Ecosystem-PATW 2013

PREPARED BY: ABHISHEK DE

22Who uses it? Applications and InnovationsProjects under Hadoop:

HBase

ZooKeeper

Pig

Zombie

Hive

Sqoop

06-Apr-13

Page 23: Big Data Analysis on a Cloud Ecosystem-PATW 2013

PREPARED BY: ABHISHEK DE

23

References:

http://www.eucalyptus.com/what-is-cloud-computing

http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_sorts_a_petabyte_in_162/

http://int3.de/res/GfsMapReduce/GfsAndMapReduce.pdf

http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/Federation.html

http://www.change-project.eu/fileadmin/publications/Presentations/CHANGE_-_The_role_of_virtualisation_in_future_network_infrastructures_-_Warsaw_cluster_workshop_contribution.pdf

http://wiki.apache.org/hadoop/NameNode

06-Apr-13

Page 24: Big Data Analysis on a Cloud Ecosystem-PATW 2013

That’s the end..But the beginning of a new horizon..

Special thanks to the entire team that helped me in this endeavor.ALL QUERIES, PLEASE CONTACT ME AT: [email protected]

QUESTIONS?