IBM BigInsights: Smart Analytics for Big Data

© 2016 IBM Corporation

IBM’s BigInsights: Smart Analytics for Big Data

Created by C. M. Saracco, IBM Silicon Valley Lab January 2016

© 2016 IBM Corporation 2

IBM Disclaimer Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.


Agenda

§ The Big Data challenge

§  IBM’s approach -  Portfolio overview -  BigInsights

•  Open source core platform with Apache Hadoop •  IBM technologies for data analysts, data scientists, and administrators •  How BigInsights fits within a broader IT infrastructure

§ How IBM can help you get off to a quick start


The Big Data Challenge


Business leaders frequently make decisions based on information they don’t trust, or don’t have 1 in 3

83% of CIOs cited “Business intelligence and analytics” as part of their visionary plans to enhance competitiveness

Business leaders say they don’t have access to the information they need to do their jobs 1 in 2

of CEOs need to do a better job capturing and understanding information rapidly in order to make swift business decisions

60%

… and organizations need deeper insights

Information is at the center of a new wave of opportunity…

2.5 million items per minute

300,000 tweets per minute

200 million emails per minute 220,000 photos

per minute

5 TB per flight

> 1 PB per day gas turbines


Extract insight from a high volume, variety and velocity of data in a timely and cost-effective manner

Big Data presents big opportunities

Manage and benefit from diverse data types and data structures Analyze streaming data and large volumes of persistent data Scale from terabytes to zettabytes

Variety: Velocity: Volume:


What we hear from customers . . . .

§ Lots of potentially valuable data is dormant or discarded due to size/performance considerations

§ Large volume of unstructured or semi-structured data is not worth integrating fully (e.g. Tweets, logs, . . .)

§ Not clear what should be analyzed (exploratory, iterative)

§  Information distributed across multiple

systems and/or Internet § Some data has a short useful lifespan

§ Volumes can be extremely high

§ Analysis needed in the context of existing information (not stand alone)


Merging the traditional and Big Data approaches

IT Structures the data to answer that question

IT Delivers a platform to enable creative discovery

Business Explores what questions could be asked

Business Users Determine what question to ask

Monthly sales reports Profitability analysis Customer surveys

Brand sentiment Product strategy Maximum asset utilization

Big Data Approach Iterative & Exploratory

Traditional Approach Structured & Repeatable


Big Data scenarios span many industries

Identify criminals and threats from disparate video, audio, and data feeds

Make risk decisions based on real-time transactional data

Predict weather patterns to plan optimal wind turbine usage, and optimize capital expenditure on asset placement

Detect life-threatening conditions at hospitals in time to intervene

Multi-channel customer sentiment and experience a analysis


Information Ingestion and Operational Information

Landing and Archive Zone

Real-time Analytics

Zone

Enterprise Warehouse

and Mart Zone

Information Governance, Security and Business Continuity

Analytic Appliances

Big Data Platform Capabilities

Streaming Data

Text Data

Applications Data

Time Series

Geo Spatial

Relational

•  Information Ingest •  Real Time Analytics • Warehouse & Data Marts •  Analytic Appliances

Social Network

Video & Image

All Data Sources

Advanced Analytics /

New Insights New / Enhanced

Applications

Automated Process

Case Management

Analytic Applications

Cognitive Learn Dynamically?

Prescriptive Best Outcomes?

Predictive What Could Happen?

Descriptive What Has Happened?

Exploration and Discovery

What Do You Have?

Watson

Cloud Services

ISV Solutions

Alerts

IBM Big Data and analytics sample architecture


Big Data in practice


Big Data adoption (Hadoop emphasis)


IBM’s approach


IBM analytics platform strategy for Big Data

•  Integrate and manage the full variety, velocity and volume of Big Data

•  Apply advanced analytics

•  Visualize all available data for ad-hoc analysis

•  Support workload optimization and scheduling

•  Provide for security and governance

•  Integrate with enterprise software

Discovery& Exploration

Prescriptive Analytics

Predictive Analytics

Content Analytics

Business Intelligence

DataMgmt

Hadoop & NoSQL

Content Mgmt

DataWarehouse

Information Integration & Governance

IBM ANALYTICS PLATFORM Built on Spark. Hybrid. Trusted.

Spark Analytics Operating SystemMachine Learning On premises On cloud

Data at rest & In-motion. Inside & outside the firewall. Structured & unstructured.


IBM BigInsights for Apache Hadoop

Discovery& Exploration

Prescriptive Analytics


Content Analytics

Business Intelligence

DataMgmt

Hadoop & NoSQL

Content Mgmt

DataWarehouse

Information Integration & Governance

IBM ANALYTICS PLATFORM Built on Spark. Hybrid. Trusted.

Spark Analytics Operating SystemMachine Learning On premises On cloud

Data at rest & In-motion. Inside & outside the firewall. Structured & unstructured.

§  Analytical platform for persistent Big Data

–  100% open source core with IBM add-ons for analysts, data scientists, and admins

–  On premise or cloud §  Distinguishing

characteristics –  Built-in analytics . . . .

Enhances business knowledge

–  Enterprise software integration . . . . Complements and extends existing capabilities

–  Production-ready . . . . Speeds time-to-value

§  IBM advantage –  Combination of

software, hardware, services and research


Text Analytics

POSIX Distributed Filesystem

Multi-workload, multi-tenant scheduling

IBM BigInsights Enterprise Management

Machine Learning on Big R

Big R (R support)

IBM Open Platform with Apache Hadoop (HDFS,YARN,MapReduce,Ambari,Flume,HBase,Hive,Ka;a,Knox,Oozie,Pig,

Slider,Solr,Spark,Sqoop,Zookeeper)

IBM BigInsights Data Scientist

IBM BigInsights Analyst

Big SQL

BigSheets

Industry standard SQL (Big SQL)

Spreadsheet-style tool (BigSheets)

Overview of BigInsights

Free Quick Start (non production): •  IBM Open Platform •  BigInsights Analyst, Data Scientist features •  Community support

. . .


Warehousing Zone

Enterprise Warehouse

Data Marts

Ingestion and Real-time Analytic Zone Streams

Connectors

BI & Reporting


Analytics and Reporting Zone

Visualization & Discovery

Landing and Analytics Sandbox Zone

Hive/HBaseColStores

Documents in variety of formats

MapReduce

Hadoop

Metadata and Governance Zone

ETL, MDM, Data Governance

Hadoop and the enterprise


BigInsights ISV Partner Ecosystem

lHelium SW


A Closer Look at IBM BigInsights . . . .


Text Analytics





Big R (R support)





Big SQL

BigSheets





. . .



About the IBM Open Platform for Apache Hadoop

§ Flexible platform for processing large volumes of data -  Includes Apache Hadoop and many popular open source projects in the Hadoop

ecosystem -  Supports wide variety of data -  Supports variety of popular APIs

§ Enables applications to work with thousands of nodes and petabytes of data in a highly parallel, cost effective manner -  CPU + disks = “node” -  Nodes can be combined into clusters -  New nodes can be added as needed without changing

•  Data formats •  How data is loaded •  How jobs are written


About Spark § Part of the IBM Open Platform for Apache Hadoop § Fast, general purpose, cluster computing system for large-scale

data processing § Speed (100x faster than MapReduce) -  In-memory processing -  Distributed computing

§ Flexibility -  Wide range of workloads -  Built-in libraries -  Scala, Java, Python APIs -  Works with Hadoop or stand alone

Logistic regression in Hadoop and Spark

from http://spark.apache.org


Open source currency §  Timely updates as new open source versions released §  Install only those components you want / need §  Component levels as of December 2015 (Apache licenses):

Ambari 2.1 Ambari Metrics 0.1.0 Hadoop 2.7.1 Flume 1.5.2 HDFS 2.7.1 HBase 1.1.1 Hive 1.2.1 Kafka 0.8.2.1 Knox 0.6.0 Oozie 4.2.0 Pig 0.15.0 Slider 0.80.0 Solr 5.1.0 Spark 1.5.1 Sqoop 1.4.6 YARN + MapReduce 2.7.1 ZooKeeper 3.4.6


Initiative for an open big data platform (ODPi)


Understanding ODPi . . . . Why is IBM involved? §  Strong history of leadership in open source & standards §  Supports our commitment to open source currency in all

future releases §  Accelerates our innovation within Hadoop &

surrounding applications

ODPi and Apache Software Foundation (ASF) §  ODPi supports the ASF mission §  ASF provides a governance model around individual

projects without looking at ecosystem §  ODPi aims to provide a vendor-led consistent

packaging model for core Apache components as an ecosystem

All Standard Apache Open Source Components

HDFS

YARN

MapReduce

Ambari HBase

Spark

Flume

Hive Pig

Sqoop

HCatalog

Solr/Lucene

ODPi initial spec


Text Analytics





Big R (R support)





Big SQL

BigSheets





. . .



Spreadsheet-style analysis (BigSheets) §  Web-based analysis and

visualization

§  Spreadsheet-like interface

-  Explore, manipulate data

without writing code -  Invoke pre-built functions

-  Generate charts

-  Export results of analysis

-  Create custom plug-ins

-  . . .


SQL for Hadoop (Big SQL)

SQL-based Application

Big SQL Engine

Data Storage

IBM data server client

SQL MPP Run-time

DFS

28

§  Comprehensive, standard SQL –  SELECT: joins, unions, aggregates, subqueries . . . –  GRANT/REVOKE, INSERT … INTO –  Procedural logic in SQL –  Stored procs, user-defined functions –  IBM data server JDBC and ODBC drivers

§  Optimization and performance

–  IBM MPP engine (C++) replaces Java MapReduce layer –  Continuous running daemons (no start up latency) –  Message passing allow data to flow between nodes

without persisting intermediate results –  In-memory operations with ability to spill to disk (useful

for aggregations, sorts that exceed available RAM) –  Cost-based query optimization with 140+ rewrite rules

§  Various storage formats supported –  Data persisted in DFS, Hive, HBase –  No IBM proprietary format required

§  Integration with RDBMSs via LOAD, query federation

BigInsights


About Big SQL performance: Hadoop-DS benchmark

§  TPC (Transaction Processing Performance Council) -  Formed August 1988 -  Widely recognized as most credible, vendor-independent SQL benchmarks -  TPC-H and TPC-DS are the most relevant to SQL over Hadoop

•  R/W nature of workload not suitable for HDFS

§  Hadoop-DS benchmark: BigInsights, Hive, Cloudera -  Run by IBM & reviewed by TPC certified auditor -  Based on TPC-DS. Key deviations

•  No data maintenance or persistence phases (not supported across all vendors) -  Common set of queries across all solutions

•  Subset that all vendors can successfully execute at scale factor •  Queries are not cherry picked

-  Most complete TPC-DS like benchmark executed so far -  Analogous to porting a relational workload to SQL on Hadoop

Visit Hadoop Dev (https://developer.ibm.com/hadoop) and search on “hadoop-ds”


IBM Big SQL runs 100% of the queries

Key points §  With Impala and Hive, many

queries needed to be re-written, some significantly

§  Owing to various restrictions, some queries could not be re-written or failed at run-time

§  Re-writing queries in a benchmark scenario where results are known is one thing – doing this against real databases in production is another

Other environments require significant effort at scale

Results for 10TB scale shown here


Hadoop-DS benchmark – single user @ 10TB

Big SQL 3.6x faster than Impala; 5.4x faster than Hive 0.13 for single query stream using 46 common queries

Based on IBM internal tests comparing BigInsights Big SQL, Cloudera Impala and Hortonworks Hive (current versions available as of 9/01/2014) running on identical hardware. The test workload was based on the latest revision of the TPC-DS benchmark specification at 10TB data size. Successful executions measure the ability to execute queries a) directly from the specification without modification, b) after simple modifications, c) after extensive query rewrites. All minor modifications are either permitted by the TPC-DS benchmark specification or are of a similar nature. All queries were reviewed and attested by a TPC certified auditor. Development effort measured time required by a skilled SQL developer familiar with each system to modify queries so they will execute correctly. Performance test measured scaled query throughput per hour of 4 concurrent users executing a common subset of 46 queries across all 3 systems at 10TB data size. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera. Hortonworks, the Hortonworks logo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other countries.


Hadoop-DS benchmark – multi-user @ 10TB

Big SQL 2.1x faster than Impala; 8.5x faster than Hive 0.13 for 4 query streams using 46 common queries

Based on IBM internal tests comparing BigInsights Big SQL, Cloudera Impala and Hortonworks Hive (current versions available as of 9/01/2014) running on identical hardware. The test workload was based on the latest revision of the TPC-DS benchmark specification at 10TB data size. Successful executions measure the ability to execute queries a) directly from the specification without modification, b) after simple modifications, c) after extensive query rewrites. All minor modifications are either permitted by the TPC-DS benchmark specification or are of a similar nature. All queries were reviewed and attested by a TPC certified auditor. Development effort measured time required by a skilled SQL developer familiar with each system to modify queries so they will execute correctly. Performance test measured scaled query throughput per hour of 4 concurrent users executing a common subset of 46 queries across all 3 systems at 10TB data size. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera. Hortonworks, the Hortonworks logo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other countries.


§ Define business terms

§  Manage data lineage

§  Create information policies §  Search and browse assets from across data stores

§ Enhance existing meta data with descriptions and associations

§ Provide a solid foundation for information integration and governance projects

Limited use license: InfoSphere Governance Catalog

Demo: https://www.youtube.com/watch?v=w1xFdhHyEJg


§ Eclipse-based tool

§  SQL development, testing §  Database exploration

§  Supports Big SQL, select RDBMSs

License: Data Studio


Text Analytics





Big R (R support)

IBM Open Platform with Apache Hadoop* (HDFS,YARN,MapReduce,Ambari,Flume,HBase,Hive,Ka;a,Knox,Oozie,Pig,




Big SQL

BigSheets





…



What is Big R?

R Clients

Scalable Statistics Engine

Data Sources

Embedded R Execution

R Packages

R Packages

1

2

3

1.  Explore, visualize, transform, and model big data using familiar R syntax and paradigm (no MapReduce code)

2.  Scale out R •  Partitioning of large data (“divide”) •  Parallel cluster execution of

pushed down R code (“conquer”) •  All of this from within the R

environment (Jaql, Map/Reduce are hidden from you

•  Almost any R package can run in this environment

3.  Scalable machine learning •  A scalable statistics engine that

provides canned algorithms, and an ability to author new ones, all via R

“End-to-end integration of R-Project with BigInsights”

Pull data (summaries) to

R client

Or, push R functions

right on the data


Text analytics • Distills structured info from unstructured text - Sentiment analysis - Consumer behavior - Illegal or suspicious activities - …

• Parses text and detects meaning with annotators

• Understands the context in which the text is analyzed

• Features pre-built extractors for names, addresses, phone numbers, etc.

I had an iphone, but it's dead @JoaoVianaa.

(I've no idea where it's) !Want a Galaxy now !!!

@rakonturmiami im moving to miami in 3 months.

i look foward to the new lifestyle

I'm at Mickey's Irish Pub Downtown (206 3rd St, Court Ave, Des Moines) w/ 2 others http://4sq.com/gbsaYR


Extracting information from text

Entity Analytics

Preventative Maintenance

Customer Segmentation

Sentiment Affinity

…

Analyze

Text Single column or

document

•  sentencesegmentaJon

•  tokenizaJon•  part-of-speech

tagging

•  languagedetecJon

Recognize

EnJtyRecogniJon

MachineDataPrimiJves

SenJment

…

Describe via extractors

Information Extraction (IE)

Tagged syntax

Classified words /

attributes

Classified words /

attributes Text

preparation

•  extracJonoperaJons

via lexical analysis via deep linguistic analysis

•  spanoperaJons•  joinoperaJons•  consolidaJons•  ……

•  verb-centricabstracJon•  noun-centricabstracJon•  shallowparsing•  …


Web-based tool to define rules to extract data and derive information from unstructured text Graphical interface to describe structure of various textual formats – from log file data to natural language

Text analytics tooling


Pre-built text extractors

§ The extractor library contains a rich set of pre-built extractors -  Finance actions -  Named Entities -  Generic -  Machine Data -  Sentiment Analysis

§ You can control output properties -  Output columns and names -  Row filters

§ Some pre-built extractors can be customized -  Add / remove dictionary terms


Text Analytics





Big R (R support)





Big SQL

BigSheets



Overview of BigInsights Free Quick Start (non production): •  IBM Open Platform •  BigInsights Analyst, Data Scientist features •  Community support

. . .



IBM GPFS: HDFS alternative § Drop-in replacement for HDFS § No need for dedicated analytics

infrastructure -  Cost savings -  No need to move data in and out of an

analytics dedicated silo •  Software defined infrastructure for multi-

tenancy §  Fast parallel file system §  POSIX & HDFS semantics §  Hadoop & non-Hadoop apps §  Built-in high availability

Compute Cluster

GPFS-FPO

HDFS


Platform Symphony

Resource Orchestration

Workload Manager

C C C C C C

C C C C C C

D

D

D

D

D

D

D

D

D

D

D

D

C C C C C C

A A A A

A A A A

A A A A

A A A A

B

B

B

B

B

B

B

B

B

B

B

BB B B B B B

Third Party Trading & Risk Platforms In-house Applications BigInsights

A B C D

Multiple users, applications and lines of business on a shared, heterogeneous, multi-tenant grid


Text Analytics





Big R (R support)





Big SQL

BigSheets





. . .



Text Analytics





Big R (R support)





Big SQL

BigSheets





. . .



IBMBigInsightsBigInsightsQuickStartEdiJon

IBMOpenPlaUormwith

ApacheHadoop

EliteSupportforIBMOpenPlaUormwith

ApacheHadoop

BigInsightsAnalystModule

BigInsightsData

ScienJstModule

BigInsightsEnterprise

ManagementModule

BigInsightsforApacheHadoop

ApacheHadoopStack:HDFS,YARN,MapReduce,Ambari,Hbase,Hive,Oozie,Parquet,ParquetFormat,Pig,Snappy,Solr,Spark,Sqoop,Zookeeper,OpenJDK,Knox,Slider

✔ ✔ ✔ * * * ✔

BigSQL–100%ANSIcompliant,highperformant,secureSQLengine

✔ ✔ ✔ ✔

BigSheets–spreadsheet-likeinterfacefordiscovery&visualizaJon

✔ ✔ ✔ ✔

BigR–advancedstaJsJcal&datamining

✔ ✔ ✔

MachineLearningwithBigR–machinelearningalgorithmsapplytoHadoopdataset

✔ ✔ ✔

AdvancedTextAnaly?cs–visualtoolingtoannotateautomatedtextextracJon

✔ ✔ ✔

EnterpriseMgmt–Enhancedcluster&resourcemgmt&POSIX-compliantfilesystems(GPFS,PlaUormSymphony)

✔ ✔

GovernanceCatalog,DataStudio ✔ ✔ ✔

CognosBI,InfoSphereStreams,WatsonExplorer,DataClick

✔

IBM BigInsights for Apache Hadoop offerings

* Paid support for IBM Open Platform with Apache Hadoop required for BigInsights modules


Pricing & Licensing

Products BigInsightsQuickStart

IBMOpenPlaUormwith

ApacheHadoop

EliteSupportforIBMOpen

PlaUormwithApacheHadoop

BigInsightsAnalystModule

BigInsightsData

ScienJstModule

BigInsightsEnterprise

ManagementModule

BIgInsightsforApacheHadoop

PricingTerms Free Free Yearly

Subscription Only

Perpetual or Monthly License

Supportprovided Community Community IBM 24x7 support

UsageLicenseNon-

production, five node cap

Production Usage

PricingModel Free Free Node based pricing

Accessvia ibm.com/hadoop Passport Advantage

•  Community support via Hadoop Dev •  Over 100,000 visitors since inception •  Modeled after StackOverflow, most popular developer Q&A site on web


http://bluemix.net

§  Prototype, demo, trial in the cloud

§  Empowers developers to rapidly drive insight from all data

§  Adds Hadoop-based analytics to your application

§  Enterprise features - BigSheets, Big SQL, Text analytics, HiveQL, HttpFS

§  Delivered via IBM BlueMix

§  For Production, deployments at scale in the cloud

§  Delivers flexibility and efficiency with subscription pricing

§  Scales to meet spikes in demand without on-premise infrastructure

§  Drives enterprise-class, complex analytics on Big Data sets

§  Available via the IBM Cloud Marketplace and Bluemix

Cloud deployment options

Developer sandbox

Analytics for Hadoop

http://www.ibm.com/cloud http://www.bluemix.net

Enterprise Hadoop as a Service

Production environment


Summary and Fast Start


IBM Analytic Platform Capabilities

IBM integrates and extends Hadoop and Spark

Data Warehousing PureData for Analytics, Operational Analytics

Entity Extraction and Matching Big Match

Security and Compliance Optim, Guardium Audit and Encryption

Data Integration and Governance Information Server

Enterprise Search Watson Explorer

Real-time Analytics Streams

Predictive Modeling and Descriptive Statistics SPSS, Big R and Scalable Algorithms

Analysis, Reporting, and Exploration Watson Analytics, Cognos, BigSheets

Fast, ANSI SQL 2011, and Secure SQL Big SQL

Enterprise File System GPFS-FPO

Cluster Resource and Workload Management Platform Symphony

Large Scale Text Extraction Big Text

IBM Open Platform with Apache Hadoop


IBM investing heavily in Big Data and analytics

$24B Investment

in both organic development

and 30+ acquisitions

$100M Announced investmentin IBM Interactive Experience, creating 10 new labs worldwide

9 Analytics Solution Centers

1,000 universities

Developing curriculumand training for analytics with

$1B To bring

cognitive services and applications

to market


Industry-leading Strategy and Analytics practice - 15,000 staff

Diverse portfolio of analytical and cognitive computing capabilities

More than 100,000 people trained on analytics

20+ Big Data & Analytics solutions on Cloud Marketplace

Leading infrastructure platforms

Comprehensive enterprise-grade Hadoop

Expertise and technology set IBM apart


Spark investments: community, core, and consumption

Core Accelerating Spark

capabilities

Community Growing Spark

knowledge & expertise

Consumption Using Spark within IBM

& partner products

Spark Technology Center Big Data University

SystemML open source contribution

Spark stand-alone Hadoop distribution IBM portfolio 30+ research initiatives

3500+ IBM developers and researchers


The bottom line about IBM and Big Data

§ Big Data is a strategic initiative for IBM -  Significant investments across software, hardware and services.

§ BigInsights -  Enables firms to exploit growing variety, velocity, and volume of data -  Delivers diverse range of analytics -  Leverages and extends open source -  Provides enterprise-class features and supporting services -  Complement existing software investments and commercial offerings

§  IBM advantage -  Full solution spanning software, hardware & services -  Rapid technology advances through partnerships with IBM Research -  Global reach


Want to learn more? §  Download Quick Start offering §  Follow tutorials, videos, and more §  Links all available from HadoopDev

–  https://developer.ibm.com/hadoop/


IBM big data • IBM big data • IBM big data

IBM big data • IBM big data • IBM big data

IBM

big

dat

a

• IB

M b

ig d

ata

IBM

big data • IBM

big data

THINK