34

Hadoop and IoT Sinergija 2014

Embed Size (px)

DESCRIPTION

Hadoop and IoT

Citation preview

Page 1: Hadoop and IoT Sinergija 2014
Page 2: Hadoop and IoT Sinergija 2014

Hadoop and IoTDarko Marjanović

Đorđe Stepanić

Miloš Milovanović

Page 3: Hadoop and IoT Sinergija 2014

AGENDA

BIG DATAHADOOP AND IOT MODELHADOOPIOTHADOOP DATA PROCESSINGHIVESTINGER INITIATIVEQ&A

Page 4: Hadoop and IoT Sinergija 2014

BIG DATA

Big Data describes the collection of complex and large data sets such that it’s difficult to capture, process, store, search and analyze using conventional data base systems.

Anything that Won't Fit in Excel.

*Definition taken from (www.bigdata-startups.com)

Page 5: Hadoop and IoT Sinergija 2014

BIG DATA DIMESIONS

1992 100GB/Day

2002 100GB/Second

2013 28,000GB/Second 2018 50,000GB/Second

Page 6: Hadoop and IoT Sinergija 2014

HADOOP AND IOT

Page 7: Hadoop and IoT Sinergija 2014

HADOOP

Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware.

Hadoop was created by Doug Cutting and Mike Cafarella in 2005

All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and thus should be automatically handled in software by the framework.

Page 8: Hadoop and IoT Sinergija 2014

HADOOP COMPONENTS

Hadoop common

HDFS

Map Reduce

YARN (Starting with Hadoop 2.x.x)

Page 9: Hadoop and IoT Sinergija 2014

HADOOP HDFS

The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework.

Page 10: Hadoop and IoT Sinergija 2014

HADOOP MAP REDUCE

Map Reduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

Page 11: Hadoop and IoT Sinergija 2014

HADOOP YARN

Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster management technology. YARN is now characterized as a large-scale, distributed operating system for big data applications.

Page 12: Hadoop and IoT Sinergija 2014

HADOOP ECOSYSTEM

The main groups of tools in the Hadoop ecosystem:Data Ingestion (Flume, Sqoop …)Data Processing (Pig, Hive, Storm …) Cluster Management(Ambari)Security (Knox)

Page 13: Hadoop and IoT Sinergija 2014

DATA INGESTION

FlumeFlume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming event data.

SqoopApache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

WEB HDFS REST API

Page 14: Hadoop and IoT Sinergija 2014

FLUME EXAMPLE

Page 15: Hadoop and IoT Sinergija 2014
Page 16: Hadoop and IoT Sinergija 2014

SQOOP AND WEB HDFS API EXAMPLE

Page 17: Hadoop and IoT Sinergija 2014

IOT

Page 18: Hadoop and IoT Sinergija 2014

UBIQUITOUS COMPUTING & INTERNET OF THINGS

Ubiquitous computing - trend (wave) in computing where computers are spreaded throughout our everyday environment.Concept: one person - many computers

Internet Of Things - is the network of physical objects accessed through the Internet, which contains embedded technology to interact (sense and communicate) with internal states or the external environment(Cisco definition).

Page 19: Hadoop and IoT Sinergija 2014

INTERNET OF THINGS COMPONENTS

Page 20: Hadoop and IoT Sinergija 2014

INTERNET OF THINGS AND BIG DATA

Page 21: Hadoop and IoT Sinergija 2014

REAL-TIME DATA, STRUCTURED AND UNSTRUCTURED DATA GENERATED FROM INTERNET OF THINGS

Page 22: Hadoop and IoT Sinergija 2014

* Production - energy savings, lower maintenance costs, prediction of machine failure, quality control etc.

** Logistic - efficient supply control , optimization of transport, environmental controls in the warehouse, JIT, lean logistics, better capacity utilization etc.

Smart cities & environment - smart parking, traffic congestion, smart lighting, waste management, noise urban maps, air pollution etc.

Smart agriculture

eHealth

and everything you can imagine...

INTERNET OF THINGS - FIELDS OF APPLICATION

Page 23: Hadoop and IoT Sinergija 2014

HADOOP DATA PROCESSING

Input:- Raw data files- No metadata- No schema

Objective:- Perform analysis, run interactive queries- Explore, structure and analyze the data- Real-time processing (Apache Storm)- Visualization

Page 24: Hadoop and IoT Sinergija 2014

HIVE

Apache Hive is a data warehousing software that facilitates querying and managing large datasets residing in distributed storage.

Hive provides:- Tools ETL processes- A mechanism for imposing a structure on a variety of data formats- Access to files stored in HDFS or other storage systems- Query execution via MapReduce?

Page 25: Hadoop and IoT Sinergija 2014

HIVE ARCHITECTURE

Data Model:- Tables- Partitions- Buckets

SERDEs

Datatypes:Common primitive data types (int,

boolean, float, double, string, char, date, timestamp, …)

+Complex data types (structs, maps, arrays)

UI

Driver

Compiler

Metastore

Execution engine

Page 26: Hadoop and IoT Sinergija 2014

HIVE.NOW

Hive defines a simple SQL-like query language, called HQL, that enables users familiar with SQL to query the data.

Scalable and extensible.

Most commonly used for:- Log analysis- Statistical analysis- Document indexing

Page 27: Hadoop and IoT Sinergija 2014

HIVE SCRIPT EXAMPLE

Page 28: Hadoop and IoT Sinergija 2014

STINGER INITIATIVE

Stinger is the initiative to improve query execution time and increase SQL functionality for Apache Hive. Microsoft and Hortonworks worked actively in the Apache community towards completing Stinger.

Announced in February 201344 companies, 145 developers, 392,000 lines of Java code

Hive 0.13Speed: Hive on Tez, vectorized query engine & cost-based optimizer Scale: dynamic partition loads and smaller hash tables SQL: CHAR & DECIMAL datatypes, subqueries for IN / NOT IN

Improved Hive performance up to 100x.

Page 29: Hadoop and IoT Sinergija 2014

STINGER.NEXT

Stinger.next is a continuation of Stinger initiative to further speed, scale and SQL in Hive in the open Apache Hive community.

Main goals: - transactions with ACID semantics - sub-second queries - SQL:2011 Analytics - usability improvements

To be delivered in next 18 months.

Page 30: Hadoop and IoT Sinergija 2014

STINGER.NEXT

*Photo taken from the official Hortonworks website (www.hortonworks.com)

Page 31: Hadoop and IoT Sinergija 2014

HIVE ON SPARK

Apache Spark is a fast and general engine for large-scale data processing.

Spark powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming.

Hive-Spark Machine Learning Integration will allow Hive users to run machine learning models via Hive.

Page 33: Hadoop and IoT Sinergija 2014

Please rate this lecture

and win Windows Phone NOKIA Lumia 1320

Help us choose the best Sinergija lecturer! Microsoft will award you – at the conference

end, we’ll give one NOKIA Lumia 1320 to someone from the audience – randomly.

Go to www.mssinergija.net, log in and cast your votes!

You can rate only lectures that you were present at, just once. More lectures you rate,

more chances you have.

Winner will be announced at the official Sinergija web portal, www.mssinergija.net

Page 34: Hadoop and IoT Sinergija 2014