Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Big Data Analytics
Big Data Analytics
Lucas Rego Drumond
Information Systems and Machine Learning Lab (ISMLL)Institute of Computer Science
University of Hildesheim, Germany
Distributed File Systems and NoSQL Database
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 1 / 31
Big Data Analytics
Outline
1. Distributed File Systems
2. NoSQL DataBases
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 1 / 31
Big Data Analytics 1. Distributed File Systems
Outline
1. Distributed File Systems
2. NoSQL DataBases
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 1 / 31
Big Data Analytics 1. Distributed File Systems
Why do we need a Distributed File System?
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 1 / 31
Big Data Analytics 1. Distributed File Systems
Why do we need a Distributed File System?
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 2 / 31
Big Data Analytics 1. Distributed File Systems
Why do we need a Distributed File System?
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 3 / 31
Big Data Analytics 1. Distributed File Systems
Why do we need a Distributed File System?
Read??? - Whole File? - Specific part?
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 4 / 31
Big Data Analytics 1. Distributed File Systems
Why do we need a Distributed File System?
Write??? - Append to the end of the file? - Insert content in the middle?
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 5 / 31
Big Data Analytics 1. Distributed File Systems
Why do we need a Distributed File System?
We want to:
I Perform multiple parallel reads and writes
I Have the files available even if one computer crashes (replication)
I Hide parallelization and distribution details
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 6 / 31
Big Data Analytics 1. Distributed File Systems
What is a Distributed File System?
File Namespace
/
/home
/home/lucas
/home/lucas/big_file
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 7 / 31
Big Data Analytics 1. Distributed File Systems
What is a Distributed File System?
File Namespace
/
/home
/home/john
/home/john/big_file
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 8 / 31
Big Data Analytics 1. Distributed File Systems
Examples
I GFS (Google Inc.)
I HDFS (Apache Software Foundation)
I Ceph (Inktank, Red Hat)
I MooseFS (Core Technology / Gemius)
I Windows Distributed File System (DFS) (Microsoft)
I FhGFS (Fraunhofer)
I GlusterFS (Red Hat)
I Lustre
I Ibrix
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 9 / 31
Big Data Analytics 1. Distributed File Systems
Components
A typical distributed filesystem contains the following components
I Clients - they do the interface with the user
I Chunk nodes - stores chunks of files
I Master node - stores which parts of each file are on which chunk node
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 10 / 31
Big Data Analytics 1. Distributed File Systems
Distributed File Systems
The Google File System Architecture
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 11 / 31
Big Data Analytics 1. Distributed File Systems
Distributed File Systems - Storing files
C1 C2 C3 C4
Master node
/
/home
/home/john
/home/john/big_file
Chu
nk
1C
hun
k 2
Chu
nk
3C
hun
k 4
C5 C6 C7 C8
/home/john/big_file
Chunk 1 C1 C7
Chunk 2 C3 C5
Chunk 3 C4 C6
Chunk 4 C2 C8
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 12 / 31
Big Data Analytics 1. Distributed File Systems
Read Example
C1 C2 C3 C4
Master node
/
/home
/home/john
/home/john/big_file
C5 C6 C7 C8
/home/john/big_file
Chunk 1 C1 C7
Chunk 2 C3 C5
Chunk 3 C4 C6
Chunk 4 C2 C8
Client Application
1. read(/home/john/big_file, chunk 1)
2. (Chunk 1 handle, {C1, C7})
3. (Chunk 1 handle, byte range)
4. Chunk 1 data
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 13 / 31
Big Data Analytics 1. Distributed File Systems
Write Example
I Make sure each replica contains the same data all the time
I One replica is designated to be the primary replica
I Master pings the nodes to make sure they are alive
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 14 / 31
Big Data Analytics 1. Distributed File Systems
Write Example
C1 C2 C3 C4
Master node
/
/home
/home/john
/home/john/big_file
C5 C6 C7 C8
/home/john/big_file
Chunk 1 C1 C7
Chunk 2 C3 C5
Chunk 3 C4 C6
Chunk 4 C2 C8
Client Application
1. write(/home/john/big_file, chunk 1)
2. (Chunk 1 handle, {C1, C7})
3. (Chunk 1 handle, data)
6. done
4. (Chunk 1 handle, offset)
5. Return status (success or failure)
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 15 / 31
Big Data Analytics 1. Distributed File Systems
Considerations
I Reads are very efficient operations
I Writes are efficient if they are appends to the end of the file
I Write in the middle of a file can be problematicI Primary replica decides the order in which to make writes:
I Data is always consistent in all replicas
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 16 / 31
Big Data Analytics 1. Distributed File Systems
GFS vs. HDFS
HDFS GFSChunk Size 128Mb 64MbDefault replicas 2 Files (data and
generation stamp)3 Chunknodes
Master NameNode GFS MasterChunk Nodes DataNode Chunk Server
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 17 / 31
Big Data Analytics 1. Distributed File Systems
Google File System
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 18 / 31
Big Data Analytics 1. Distributed File Systems
Hadoop Distributed File System
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 19 / 31
Big Data Analytics 2. NoSQL DataBases
Outline
1. Distributed File Systems
2. NoSQL DataBases
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 20 / 31
Big Data Analytics 2. NoSQL DataBases
Databases for Big Data: NoSQL
NoSQL: “Not only SQL”
Wide variety of database technologies addressing:
I Non-relational
I Distributed storing and processing
I Dynamic Schema
I Horizontal Scalability
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 20 / 31
Big Data Analytics 2. NoSQL DataBases
Relational vs NoSQL Databases
Relational Databases
I Structured Data: RelationalTables
I Vertical Scaling
I ACID
I Atomic transaction
I More Functionality LessScalability
NoSQL Databases
I Structured and UnstructuredData: Collections
I Horizontal Scaling
I BASE
I Eventual Consistency
I Less Functionality MoreScalability
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 21 / 31
Big Data Analytics 2. NoSQL DataBases
Types of NoSQL Databases
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 22 / 31
Big Data Analytics 2. NoSQL DataBases
Graph databasesA graph database is a database that uses graph structures with nodes,edges, and properties to represent and store data.
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 23 / 31
Big Data Analytics 2. NoSQL DataBases
Graph databases
I Compared with relational databases, graph databases are often fasterfor associative data sets
I They map more directly to the structure of object-orientedapplications.
I As they depend less on a rigid schema, they are more suitable tomanage ad hoc and changing data with evolving schemas.
I Graph databases are a powerful tool for graph-like queries.
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 24 / 31
Big Data Analytics 2. NoSQL DataBases
Graph queries
I Reachability queries
I Shortest path queries
I Pattern queries
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 25 / 31
Big Data Analytics 2. NoSQL DataBases
Column Databases
Column databases stores a tuples consisting of three elements:
I Unique name: Used to reference the column.
I Value: The content of the column.
I Timestamp: The system timestamp used to determine the validcontent.
Main Advantage: allows to efficiently add new information about existingentities
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 26 / 31
Big Data Analytics 2. NoSQL DataBases
Example
{street: name: ”street”, value: ”1234 x street”, timestamp: 123456789,city: name: ”city”, value: ”san francisco”, timestamp: 123456789,zip: name: ”zip”, value: ”94107”, timestamp: 123456789,
}
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 27 / 31
Big Data Analytics 2. NoSQL DataBases
Document Databases
I Designed for storing, retrieving, and managing document-orientedinformation.
I In contrast to relational databases and their notions of ”Relations”(or ”Tables”), these systems are designed around an abstract notionof a ”Document”.
I Documents inside a document-oriented database are not required tohave all the same sections, slots, parts, or keys.
I Documents are addressed in the database via a unique key thatrepresents that document.
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 28 / 31
Big Data Analytics 2. NoSQL DataBases
Example 1
{FirstName: ”Bob”,Address: ”5 Oak St.”,Hobby: ”sailing”
}
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 29 / 31
Big Data Analytics 2. NoSQL DataBases
Example 2
{FirstName: ”Jonathan”,Address: ”15 Wanamassa Point Road”,Children: {
Name: ”Michael”, Age: 10,Name: ”Jennifer”, Age: 8,Name: ”Samantha”, Age: 5,Name: ”Elena”, Age: 2
}}
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 30 / 31
Big Data Analytics 2. NoSQL DataBases
Key–Value stores
I Key–Value stores use the associative array as their fundamental datamodel.
I In this model, data is represented as a collection of key–value pairs.
I The key–value model is one of the simplest non-trivial data models.
Example:{”Great Expectations”: ”John”,”Pride and Prejudice”: ”Alice”,”Wuthering Heights”: ”Alice”}
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Distributed File Systems and NoSQL Database 31 / 31