Upload
abhishek-de
View
355
Download
4
Tags:
Embed Size (px)
DESCRIPTION
My presentation on Big Data for PATW 2013, IET(UK)
Citation preview
Big Data Analysis using Hadoop on a Eucalyptus CloudHow secure is our cloud?PRESENTED BY: ABHISHEK DE
STUDENT, CSE 2ND YEAR, BPPIMT
PREPARED BY: ABHISHEK DE
2
Contents:
The Big Data Crisis
Let’s embrace Cloud Computing
Benefits of cloud
Establishing an IaaS using Eucalyptus
A word on Virtualization
Hadoop as a Platform
MapReduce and HDFS
Typical algorithms
Benefits we achieve
How secure is the system?
06-Apr-13
PREPARED BY: ABHISHEK DE
3
The drifting era: BIG DATA and crisis
• YouTube users upload 48 hours of new video every minute of the day.
• 100 terabytes of data uploaded daily to Facebook.
• Twitter sees roughly 175 million tweets every day, and has more than 465 million accounts.
• Walmart handles more than 1 million customer transactions every hour, and databases more than 2.5 petabytes of data.
06-Apr-13
PREPARED BY: ABHISHEK DE
4
DATA is precious, too precious..
We need Infrastructure, which comes easily as a Service
06-Apr-13
PREPARED BY: ABHISHEK DE
5
Solution: Cloud Computing
Conventional Computing:You data gets processed in your own computer.
Cloud computing:You send your data to some other computer. It gets processed there and it comes back to you.
“Cloud Computing is the use of computing resources (hardware and soft ware) that are delivered as a service over a network (typically the Internet)”--WIKIPEDIA
06-Apr-13
PREPARED BY: ABHISHEK DE
6
Benefits of Cloud Computing:
High reliability.
Highly scalable and fault tolerant.
Reduced Cost: Only pay for what you
need.
Efficient management of resources.
Improved Security.
Achieved out of commodity hardware.
06-Apr-13
PREPARED BY: ABHISHEK DE
7
Why Eucalyptus?
“Elastic Utility Computing Architecture Linking Your Programs To Useful System”
Eucalyptus is the world's most widely deployed software platform for on-premise (private) Infrastructure as a Service (IaaS) clouds.
It uses existing infrastructure to create a scalable, secure web services layer that abstracts compute, network and storage to offer IaaS.
Eucalyptus can be dynamically scaled up or down depending on application workloads.
06-Apr-13
PREPARED BY: ABHISHEK DE
8
Architecture of Eucalyptus:
FRONT END:• Users login to
the cloud using credentials
• The user is redirected to the back end of the cloud, i.e., the Storage and the Resource pool
user1
user1@nc1:
BACK END:• Runs the
Node Controller.
• Mounts images as Virtual Machines or instances using XEN or KVM.
• Hosts the resource pool.FRONT END BACK END
06-Apr-13
PREPARED BY: ABHISHEK DE
9
XEN: Virtualize your resources
XEN, is the under laying technology used by eucalyptus. Xen hypervisor allows several guest operating systems to be executed on the same computer hardware concurrently.
Xen partitions a single physical machine into multiple virtual machines, to provide server consolidation and utility computing. Existing applications and binaries run unmodified.
The hypervisor controls the MMU, CPU scheduling, and interrupt controller, presenting a virtual machine to guests.
06-Apr-13
PREPARED BY: ABHISHEK DE
10
HADOOP: Solution to BIG DATA
Roughly how long does it take to read 1TB from a commodity hard disk:
That is roughly around 4 hours.
With HADOOP it takes around :
62 seconds…
06-Apr-13
PREPARED BY: ABHISHEK DE
11Birth of HADOOP: Opensource alternative to GFS
Pre-2004 : Cutting and Cafarella develop open source projects for web-scale indexing, crawling and search.
2004: Jeffrey Dean and Sanjay Ghemawat introduce map reduce model used internally at Google.
2006: Hadoop becomes official Apache project, Cutting joins Yahoo! Yahoo adopts Hadoop.
06-Apr-13
PREPARED BY: ABHISHEK DE
12HDFS: Hadoop Distributed File System
Files split into 128MB (or 64MB) blocks
Blocks replicated across several datanodes(usually 3)
Single namenode stores metadata (file names, block locations, etc.)
Optimized for large files, sequential reads Clients read from closest replica available.(note:
locality of reference.) If the replication for a block drops below target, it is
automatically re-replicated.Datanodes
11223344
112244
221133
114433
332244
Namenode
06-Apr-13
PREPARED BY: ABHISHEK DE
13 Data Flow
Web Servers Scribe Servers
Network Storage
Hadoop ClusterOracle RAC
MySQL
06-Apr-13
PREPARED BY: ABHISHEK DE
14
HADOOP and MapReduce:
Input
Map
Shuffle/SortReduce
Output
06-Apr-13
PREPARED BY: ABHISHEK DE
15
Word Count: A typical Example
06-Apr-13
PREPARED BY: ABHISHEK DE
16
Implementation: Hardware
Move code to data (local
computation)
Allow programs to scale
transparently w.r.t size of
input
Abstract away fault tolerance, synchronization
, etc.
06-Apr-13
PREPARED BY: ABHISHEK DE
17
HADOOP in action!
SOCIAL NETWORKING ANALYSIS
PAGE RANKING ANALYSIS
ANALYTICS ENGINE WITH MAP/REDUCE
IMAGE PROCESSING
06-Apr-13
PREPARED BY: ABHISHEK DE
18
Social Networking Analysis:
Problem: recommend new friends (friend-of-a-friend, FOAF)
Map task:
– U (target user) is fixed and its friends list copied to all cluster nodes (“copy join”); each cluster node stores part of the social graph
– In: (X, <friendsX>), i.e. the local data for the cluster node
– Out:
if (U, X) are friends => (U, <friendsX\friendsU>), i.e. the users who are friends of X but not already friends of U
nil otherwise
Reduce task:
– In: (U, <<friendsA\friendsU>,<friendsB\friendsU>, … >), i.e. the FOAF lists for all users A, B, etc. who are friends with U
– Out (U, <(X1, N1), (X2, N2), …>), where each X is a FOAF for U, and N is its total number of occurrences in all FOAF lists (sort/rank the result!) 06-Apr-13
PREPARED BY: ABHISHEK DE
19
Pro’s and Con’s
Batch, offline jobs
Write-once, read-many across full data set
Usually, though not always, simple computations
I/O bound by disk/network bandwidth
What it’s not:
High-performance parallel computing, e.g. MPI
Low-latency random access relational database
Always the right solution
06-Apr-13
PREPARED BY: ABHISHEK DE
20
Cloud Security: Threats unveiled
XML SIGNATURE ATTACK:
The original SOAP body element is moved to a newly added bogus wrapper element in the SOAP security header. Note that the moved body is still referenced by the signature using its identifier attribute Id="body". The signature is still cryptographically valid, as the body element in question has not been modified (but simply relocated). Subsequently, in order to make the SOAP message XML schema compliant, the attacker changes the identifier of the cogently placed SOAP body (in this example he uses Id="attack"). The filling of the empty SOAP body with bogus content can now begin, as any of the operations denied by the attacker can be effectively executed due to the successful signature verification.
06-Apr-13
PREPARED BY: ABHISHEK DE
21
Script Injection Attack
targets only the AWS management console users.
exploits the shared credentials between the amazon shop interface and AWS.
The first vulnerability is exploits the GET parameters in the download link users utilize for downloading their X.509 certificates issued by Amazon. However the preconditions for the attack are rather high including use of UTF-7 encoding for the injected script to bypass server logic to encode standard HTML characters as well as the exploitation of features in specific IE versions.
The second script injection attack uses a persistent cross site scripting attack by exploiting the login session that is initiated with AWS the first time a user logs into the Amazons hop interface
06-Apr-13
PREPARED BY: ABHISHEK DE
22Who uses it? Applications and InnovationsProjects under Hadoop:
HBase
ZooKeeper
Pig
Zombie
Hive
Sqoop
06-Apr-13
PREPARED BY: ABHISHEK DE
23
References:
http://www.eucalyptus.com/what-is-cloud-computing
http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_sorts_a_petabyte_in_162/
http://int3.de/res/GfsMapReduce/GfsAndMapReduce.pdf
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/Federation.html
http://www.change-project.eu/fileadmin/publications/Presentations/CHANGE_-_The_role_of_virtualisation_in_future_network_infrastructures_-_Warsaw_cluster_workshop_contribution.pdf
http://wiki.apache.org/hadoop/NameNode
06-Apr-13
That’s the end..But the beginning of a new horizon..
Special thanks to the entire team that helped me in this endeavor.ALL QUERIES, PLEASE CONTACT ME AT: [email protected]
QUESTIONS?