38
Introduction into Big Data analytics Lecture 2 – Big data platforms Janusz Szwabiński

Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Introduction into Big Data analyticsLecture 2 – Big data platforms

Janusz Szwabiński

Page 2: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Outlook of today’s talk

● Available Big Data Sets

● Project suggestions

● Big data platforms

Page 3: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Available Big Data Sets

● Pointers to data sets

– How To Get Experience Working With Large Datasets

– Quora

– KDNuggets – Datasets for Data Mining and Data Science

– Research Pipeline

– Google Public Data Directory

– StackExchange Data Explorer

– Kaggle

Page 4: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Available Big Data Sets

● Generic repositories

– AWS Public Datasets

– Comprehensive Knowledge Archive Network

– Stanford Large Network Dataset Collection

– Open Flights

– ASA Flight data

– Wikipedia

Page 5: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Available Big Data Sets

● Geo data

– OpenStreetMap

– Natural Earth Data

– GeoNames

– Libre Map Project

– Landsat

Page 6: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Available Big Data Sets

● Web data

– Google Books n-gram

– Public Terabyte Dataset Project

– Common Crawl

– Freebase Data

– StackOverflow

– UCI KDD Data

Page 7: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Available Big Data Sets

● Government data

– European Parliament proceedings

– US government data

– UK government data

– US Patent and Trademark Office

– World Bank data

– Public health data sets

– Aid information

– UN data

– Polish Statistical Office

Page 8: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Suggestions for projects

● Trend prediction in fashion

● Quote search engine

● Real-time analysis of Twitter’s public stream with Storm

● Correlating price/volume of low volume stocks with social media

– search information related to future price and volume movements

– find indicators to predict abnormal price or volume changes

Page 9: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Suggestions for projects

● Stock signal generation using real time Twitter analysis

– develop a scoring mechanism that summarizes Twitter news

– generate a real-time signal that could be used to make trading decisions

● Music recommendation system with geospatial information

– MMTD - Million Musical Tweets Dataset

● Answer classifier based on StackOverflow data

Page 10: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Suggestions for projects

● How to name your new-born baby?

– prediction of trends in baby names around the world

● Impact of popular culture on baby names

● Error correction in OCR datasets

● Movie exploration/recommendation system

● Best transport choice

● Fake reviews detection

● Food identification in photos

– see e.g. https://www.yelp.com/dataset/challenge

● Oscar/Golden Globe award analysis

Page 11: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Suggestions for projects

● Interesting ideas for trendy writers

● Image-based geolocalization

● Animal identification in photos

● Plant identification in photos

● Currency trend analyzer

– data source: http://www.histdata.com/

Page 12: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Big data platforms● one stop solution for Big Data needs

– integrated IT solution for developing, deploying and managing Big Data

– combines several software systems, tools and hardware to provide easy to use system to enterprises

● important features:

– able to accommodate new tools based on the business requirement

– supports linear scale-out

– has capability for rapid deployment

– supports variety of data formats

– provides data analysis and reporting tools

– provides real-time data analysis software

– has tools for searching the data through large data sets

Page 13: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Hadoop● http://hadoop.apache.org/

● an open-source software framework for storing data and running applications on clusters of commodity hardware

● why it is so important?

– ability to store and process huge amounts of any kind of data, quickly

– computing power - Hadoop's distributed computing model processes big data fast

● the more computing nodes you use, the more processing power you have

– fault tolerance - data and application processing are protected against hardware failure

● if a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail

● multiple copies of all data are stored automatically.

– flexibility - unlike traditional relational databases, you don’t have to preprocess data before storing it

– low cost - the open-source framework is free and uses commodity hardware to store large quantities of data

– scalability - you can easily grow your system to handle more data simply by adding nodes with little administration effort

Page 14: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Hadoop

Source: https://www.sas.com/en_us/insights/big-data/hadoop.html

Page 15: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Hadoop● challenges:

– MapReduce programming is not a good match for all problems

● good for simple information requests and problems that can be divided into independent units

● not efficient for iterative and interactive analytic tasks

– a widely acknowledged talent gap - it can be difficult to find entry-level programmers who have sufficient Java skills to be productive with MapReduce

● distribution providers are racing to put relational (SQL) technology on top of Hadoop

● Hadoop administration seems part art and part science, requiring low-level knowledge of operating systems, hardware and Hadoop kernel settings

– data security issues

● Kerberos authentication protocol is a great step toward making Hadoop environments secure

– lacking tools for data quality and standardization

Page 16: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Hadoop● important application domains:

– Digital Marketing Optimization

– Data exploration and discovery (Product and sales data for online shopping portal and stores)

– Fraud detection and prevention

– Social network and relationship in the network

– Fraud detection in banking

– Fraud detection for telecom industry

– Data retention (for retaining the long term data and for archiving purposes)

– Insurance

– Healthcare

Page 17: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Hadoop based commercial platforms

● Cloudera

● Amazon EMR

● Hortonworks

● MapR

● IBM Open Platform

● Microsoft HDInsight

● Intel Distribution for Apache Hadoop

● Datastax Enterprise Analytics

● Teradata’s Hadoop for Enterprise

● Pivotal HD

Page 18: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Cloudera

● https://www.cloudera.com/

● one of the first commercial Hadoop based platforms

● interesting (and free) download:

– QuickStarts for CDH 5.12● https://www.cloudera.com/downloads/quickstart_vms/5-12.html

● virtualized clusters for easy installation on your desktop● single-node cluster that make it easy to quickly get hands-

on with CDH for testing, demo, and self-learning purposes● includes Cloudera Manager for managing the cluster● tutorial, sample data, and scripts for getting started

included● deployed via Docker containers or VMs

Page 19: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Amazon EMR● https://aws.amazon.com/emr/

● a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances

● other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink possible

● interaction with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB

● secure and reliable handling of a broad set of big data use cases, including log analysis, web indexing, data transformations (ETL), machine learning, financial analysis, scientific simulation, and bioinformatics

● simple and predictable pricing:

– per-second rate for every second used, with a one-minute minimum charge

– a 10-node Hadoop cluster: $0.15 per hour

Page 20: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Amazon EMR● good to know: AWS Free Tier (12 Month Introductory Period)

– https://aws.amazon.com/free/

Page 21: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Hortonworks ● https://hortonworks.com/

● a leading innovator in the data industry, creating, distributing and supporting enterprise-ready open data platforms and modern data applications

● 100% open-source software without any propriety software

● Hortonworks Hadoop distribution is enterprise ready with following features:

– centralized management and configuration of clusters

– built-in security and data governance

– centralized security administration

● Hortonworks Sandbox

– a virtual machine with Hadoop preconfigured

– a set of hands-on tutorials to get you started with Hadoop.

– an environment to help you explore related projects in the Hadoop ecosystem like Apache Pig, Apache Hive, Apache HCatalog and Apache HBase

Page 22: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

MapR

● https://mapr.com/

● MapR provides access to a variety of data sources from a single computer cluster, including:

– big data workloads such as Apache Hadoop and Apache Spark

– a distributed file system

– a multi-model database management system

– event stream processing, combining analytics in real-time with operational applications

● its technology runs on both commodity hardware and public cloud computing services

Page 23: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

IBM Open Platform ● https://www-03.ibm.com/software/products/en/ibm-open-platform-with-ap

ache-Hadoop

● native support for rolling upgrades for Hadoop services

● support for long-running applications within YARN for enhanced reliability & security

● heterogeneous storage in HDFS for in-memory, SSD in addition to HDD

● Spark in-memory distributed compute engine

● Java, Python & Scala languages

● Apache Hadoop projects included: HDFS, YARN, MapReduce, Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format, Pig, Snappy, Solr, Spark, Sqoop, Zookeeper, Open JDK, Knox, Slider

● free IOP Quick Start Edition for non-production software: https://www.ibm.com/support/knowledgecenter/en/SSPT3X_4.2.0/com.ibm.swg.im.infosphere.biginsights.install.doc/doc/qse_main.html

Page 24: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Microsoft HDInsight

● https://azure.microsoft.com/en-in/services/hdinsight/

● a fully-managed cloud service for easy, fast, and cost-effective processing of massive amounts of data

● uses popular open-source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, R & more

● enables a broad range of scenarios such as ETL, Data Warehousing, Machine Learning, IoT and more

Page 26: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Datastax Enterprise Analytics

● https://www.datastax.com/

● Big Data analytics platform based on Apache Cassandra database management system which runs on the top of Apache Hadoop installation

– includes a proprietary solution for security management, searching data, data monitoring and visualization

● it comes with powerful integrated analytics system

● multiple models supported: key-value, tabular, JSON/Document and graph data formats

● real-time processing possible

Page 27: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Teradata’s Hadoop for Enterprise

● https://www.teradata.pl/

● pre-configured hardware, software and services to accelerate time to Hadoop production

● deep integration of tools and services in the Hadoop ecosystem, specifically in the areas of data access, data movement manageability, supportability and serviceability

● extends the enterprise-ready Hadoop ecosystem with advanced professional services

Page 28: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Pivotal HD

● https://pivotal.io/

● an enterprise-capable, commercially supported distribution of Apache Hadoop 2.0 packages targeted to traditional Hadoop deployments

● patches assuring the interoperability of the components

● advantage of big data analytics without the overhead and complexity of a project built from scratch

● automatic parallelization of Map Reduce jobs to handle data at scale, thereby eliminating the need for developers to write scalable and parallel algorithms

● Pivotal HD Single Node VM available for free

Page 29: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Open source platforms and tools

● MapReduce

● GridGain

● HPCC Systems

● Apache Spark

● Apache Storm

● SAMOA

Page 30: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

MapReduce● a programming model and an associated implementation for processing and

generating big data sets with a parallel, distributed algorithm on a cluster

● a MapReduce program is composed of:

– a Map() procedure (method) that performs filtering and sorting ((such as sorting students by first name into queues, one queue for each name)

– a Reduce() method that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies)

● the "MapReduce System" (also called "infrastructure" or "framework") orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance

● the model is a specialization of the split-apply-combine strategy for data analysis

● inspired by the map and reduce functions commonly used in functional programming, although their purpose in the MapReduce framework is not the same as in their original forms

Page 31: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

MapReduce

● the key contributions of the MapReduce framework are not the actual map and reduce functions (which, for example, resemble the 1995 MPI reduce and scatter operations), but the scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine

● a single-threaded implementation of MapReduce will usually not be faster than a traditional (non-MapReduce) implementation

● any gains are usually only seen with multi-threaded implementations

● optimizing the communication cost is essential to a good MapReduce algorithm.[10]

● MapReduce libraries have been written in many programming languages, with different levels of optimization

● a popular open-source implementation that has support for distributed shuffles is part of Apache Hadoop

● the name MapReduce originally referred to the proprietary Google technology, but has since been genericized

Page 32: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

GridGrain

● https://www.gridgain.com/

● in-memory computing platform built on Apache Ignite

– can function as an in-memory data grid

– or it can be deployed as an in-memory transactional SQL database

● combines the speed of in-memory computing with the durability of disk-based storage

● used in financial services, fintech, software, healthcare, telecom, ecommerce, online services, retail, and more

● free 30-day trial available

Page 33: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

HPCC Systems

● https://hpccsystems.com/

● an alternative to Hadoop and other Big Data platforms

● an open source, data-intensive computing system platform developed by LexisNexis Risk Solutions

● incorporates a software architecture implemented on commodity computing clusters to provide high-performance, data-parallel processing for applications utilizing big data

● includes system configurations to support both parallel batch data processing (Thor) and high-performance online query applications using indexed data files (Roxie)

● includes a data-centric declarative programming language for parallel data processing called ECL

● virtual image with a pre-configured HPCC available for download

Page 34: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

HPCC Systems

Page 35: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Apache Spark

● https://spark.apache.org/

● an open-source cluster-computing framework

– an interface for programming entire clusters with implicit data parallelism and fault tolerance

● runs on Hadoop, Mesos, standalone, or in the cloud

● it can access diverse data sources including HDFS, Cassandra, HBase, and S3

● originally developed at the University of California, Berkeley's AMPLab

● the codebase was later donated to the Apache Software Foundation, which has maintained it since

● application programming interfaces for Java, Python, Scala, and R

● DataFrames with support for structured and semi-structured data

● Spark SQL - a domain-specific language (DSL) to manipulate DataFrames in Scala, Java, or Python

Page 36: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Apache Spark

● “Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk”

Page 37: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

Apache Storm

● http://storm.apache.org/

● a free and open source distributed realtime computation system

● processing of unbounded streams of data

● it is doing for realtime processing what Hadoop did for batch processing

● simple, can be used with any programming language

● many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL

● fast: a benchmark clocked it at over a million tuples processed per second per node

● scalable and fault-tolerant

● easy to set up and operate

● integrates with the queueing and database technologies you already use

Page 38: Introduction into Big Data analytics Lecture 2 – Big data ...prac.im.pwr.edu.pl/~szwabin/assets/bdata/2.pdf · Big data platforms one stop solution for Big Data needs – integrated

SAMOA

● https://samoa.incubator.apache.org

● Scalable Advanced Massive Online Analysis

● an open source platform for mining big data streams

● a collection of distributed streaming algorithms for the most common data mining and machine learning tasks (classification, clustering, regression)

● programing abstractions to develop new algorithms

● a pluggable architecture that allows it to run on different distributed stream processing engines (Storm, S4, Samza)

● written in Java