27
Big Data Analytics Big Data Analytics Rasoul Karimi Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany Big Data Analytics 1/1

Big Data Analytics - Universität Hildesheim · Big Data Analytics NoSQL Databases I A NoSQL or Not Only SQL database provides a mechanism for storage and retrieval of data that is

  • Upload
    lamque

  • View
    222

  • Download
    0

Embed Size (px)

Citation preview

Big Data Analytics

Big Data Analytics

Rasoul Karimi

Information Systems and Machine Learning Lab (ISMLL)Institute of Computer Science

University of Hildesheim, Germany

Big Data Analytics

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 1 / 1

Big Data Analytics

Introduction

I Big Data: Storing and accessing large amounts of (unstructured) data

I Distributed File Systems: allow files to be accessed using the sameinterfaces and semantics as local files.

I Distributed Databases: store data in multiple computers.

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 0 / 1

Big Data Analytics

Traditional Database Applications

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 1 / 1

Big Data Analytics

Distributed Applications

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 2 / 1

Big Data Analytics

Relational vs Big-Data Technologies

I Structure: In relational data bases, the data is structured but in bigdata technologies the data is raw.

I Process: Relational data bases were initially created for transactionalprocessing, but big data technologies are for data analysis.

I Entities: A relational data base is big if it has many entities andmany rows per each entity. But in big data technologies, the numberof entities is not necessary high. The amount of information thatexist for entities is big.

I horizontal scaling: Adding new nodes in big data technologies canbe done with low cost while in relational data base, either it fixed or itis expensive.

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 3 / 1

Big Data Analytics

NoSQL Databases

I A NoSQL or Not Only SQL database provides a mechanism forstorage and retrieval of data that is modeled in means other thanthe tabular relations used in relational databases.

I Motivations for this approach include simplicity of design andhorizontal scaling.

I The data structure differs from the RDBMS, and therefore someoperations are faster in NoSQL and some in RDBMS.

I Most NoSQL stores lack true ACID transactions.

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 4 / 1

Big Data Analytics

NoSQL Databases

I Graph: Allegro, Neo4J, OrientDB, Virtuoso

I Column: Accumulo, Cassandra, HBase

I Document: Clusterpoint, Couchbase, MarkLogic, MongoDB

I Key-value: Dynamo, FoundationDB, MemcacheDB, Redis, Riak

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 5 / 1

Big Data Analytics

Graph

A graph is a representation of a set of objects where some pairs of objectsare connected by links.Example applications of graph in computer science:

I Automata: self-operating virtual machines to help in logicalunderstanding of input and output process

I Markov Decision Process: provide a mathematical framework formodeling decision making

I XML Path Language (XPATH): is a query language for selectingnodes from an XML document.

In big data, graphs are used to model entities and their relationships thatare distributed in all nodes.

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 6 / 1

Big Data Analytics

Graph databasesA graph database is a database that uses graph structures with nodes,edges, and properties to represent and store data.

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 7 / 1

Big Data Analytics

Graph databases

I Nodes represent entities such as people, businesses, accounts, or anyother item you might want to keep track of.

I Properties are relevant information that relate to nodes.

I Edges are the lines that connect nodes to nodes

I Most of the important information is really stored in the edges.

I Meaningful patterns emerge when one examines the connections andinterconnections of nodes, properties, and edges.

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 8 / 1

Big Data Analytics

Graph databases

I Compared with relational databases, graph databases are often fasterfor associative data sets

I They map more directly to the structure of object-orientedapplications.

I As they depend less on a rigid schema, they are more suitable tomanage ad hoc and changing data with evolving schemas.

I Graph databases are a powerful tool for graph-like queries.

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 9 / 1

Big Data Analytics

Graph queries

I Reachability queries

I shortest path queries

I Pattern queries

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 10 / 1

Big Data Analytics

Resource Description Framework (RDF)

I RDF is W3C standard format for Linked Data.

I It is similar to classic conceptual modeling approaches likeentity-relationship and class diagram.

I In RDF, data are subject-predicate-object expressions (triples). Forexample, ”The Shirt has the color blue” in RDF is as the triple: asubject denoting ”the shirt”, a predicate denoting ”has”, and anobject denoting ”the color blue”.

I A collection of RDF statements represents a labeled, directedmulti-graph.

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 11 / 1

Big Data Analytics

Example

There is a Person identified by:http : //www .w3.org/People/EM/contact#mewhose name is Eric Miller, whose email address is [email protected], and whosetitle is Dr.< http : //www .w3.org/People/EM/contact#me >< http ://www .w3.org/2000/10/swap/pim/contact#fullName > ”EricMiller”

< http : //www .w3.org/People/EM/contact#me >< http ://www .w3.org/2000/10/swap/pim/contact#mailbox >< mailto : [email protected] >

< http : //www .w3.org/People/EM/contact#me >< http ://www .w3.org/2000/10/swap/pim/contact#personalTitle > ”Dr .”

< http : //www .w3.org/People/EM/contact#me >< http : //www .w3.org/1999/02/22−rdf − syntax − ns#type >< http : //www .w3.org/2000/10/swap/pim/contact#Person >

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 12 / 1

Big Data Analytics

Column Databases

I Column databases are a tuple (a key-value pair) consisting of threeelements:

I Unique name: Used to reference the columnI Value: The content of the column.I Timestamp: The system timestamp used to determine the valid

content.

I Differences to a relational database

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 13 / 1

Big Data Analytics

Example

{street: name: ”street”, value: ”1234 x street”, timestamp: 123456789,city: name: ”city”, value: ”san francisco”, timestamp: 123456789,zip: name: ”zip”, value: ”94107”, timestamp: 123456789,

}

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 14 / 1

Big Data Analytics

Document Databases

I A document-oriented database is a computer program designed forstoring, retrieving, and managing document-oriented information.

I In contrast to relational databases and their notions of ”Relations”(or ”Tables”), these systems are designed around an abstract notionof a ”Document”.

I Documents inside a document-oriented database are not required tohave all the same sections, slots, parts, or keys.

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 15 / 1

Big Data Analytics

Example 1

{FirstName: ”Bob”,Address: ”5 Oak St.”,Hobby: ”sailing”

}

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 16 / 1

Big Data Analytics

Example 2

{FirstName: ”Jonathan”,Address: ”15 Wanamassa Point Road”,Children: {

Name: ”Michael”, Age: 10,Name: ”Jennifer”, Age: 8,Name: ”Samantha”, Age: 5,Name: ”Elena”, Age: 2

}}

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 17 / 1

Big Data Analytics

Document Databases

I Documents are addressed in the database via a unique key thatrepresents that document.

I Documents can be retrieved by theirI keyI content

I Documents are organized throughI CollectionsI TagsI Non-visible MetadataI Directory hierarchiesI Buckets

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 18 / 1

Big Data Analytics

XML Databases

I XML databases are in the category of document-oriented databases.

I An XML database is a data persistence software system that allowsdata to be stored in XML format.

I Two major classes of XML database exist:I XML-enabled: The input and output are XML, but they are stored in

relational tables.I Native XML: In addition to the input and output, the structure of

data base is also XML.

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 19 / 1

Big Data Analytics

XML-enabled Databases

I XML enabled databases typically offer one or more of the followingapproaches to storing XML within the traditional relational structure:

I XML is stored into a CLOB (Character large object)I XML is ‘cut‘ into a series of Tables based on a SchemaI XML is stored into a native XML Type as defined by the ISO

I Typically an XML enabled database is best suited where the majorityof data are non-XML.

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 20 / 1

Big Data Analytics

Native XML databases

I Defines a (logical) model for an XML document and stores andretrieves documents according to that model.

I Has an XML document as its fundamental unit of (logical) storage.

I Need not have any particular underlying physical storage model.

Additionally, many XML databases provide a logical model of groupingdocuments, called ”collections”.

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 21 / 1

Big Data Analytics

querying Languages in XML databases

All XML databases now support at least one form of querying syntax:

I XPath

I XSLT

I XQuery

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 22 / 1

Big Data Analytics

Key–Value stores

I Key–Value stores use the associative array as their fundamental datamodel.

I In this model, data is represented as a collection of key–value pairs.

I The key–value model is one of the simplest non-trivial data models.

Example:{”Great Expectations”: ”John”,”Pride and Prejudice”: ”Alice”,”Wuthering Heights”: ”Alice”}

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 23 / 1

Big Data Analytics

Object Database

I An object database is a database management system in whichinformation is represented in the form of objects as used inobject-oriented programming.

I Most object databases also offer some kind of query language,allowing objects to be found using a declarative programmingapproach (OQL)

I Access to data can be faster because joins are often not needed.

I Many object databases offer support for versioning.

I They are specially suitable in applications with complex data.

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 24 / 1

Big Data Analytics

NoSQL databases on the cloud

I NoSQL databases can be run cloud platforms like Amazon WebServices.

I There are three common deployment models for NoSQL on the cloud:

I Virtual machine image: cloud platforms allow users to rent virtualmachine instances for a limited time and run a NoSQL database onthese virtual machines.

I Database as a service: some cloud platforms offer options for usingfamiliar NoSQL database products as a service, such as MongoDB,Redis and Cassandra, without physically launching a virtual machineinstance for the database.

Rasoul Karimi, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 25 / 1