38
© 2011 IBM Corporation Big Data and Big Insights Smart Analytics in Internet-scale Jukka Ruponen, IT Architect

Big Data and Big Insights - Aalto Universityinformation.aalto.fi/en/research/ressem/big_data_final.pdf · Big Data, Big Insights How BigInsights fits into an enterprise data architecture

Embed Size (px)

Citation preview

© 2011 IBM Corporation

Big Data and Big InsightsSmart Analytics in Internet-scale

Jukka Ruponen, IT Architect

2 October 13, 2011 © 2011 IBM Corporation

Big Data, Big Insights

Information Growing at a Phenomenal Rate . . .

2009800 exabytes(800 000 petabytes)

as much data andcontent by 2020

35 zettabytes(35 000 000 petabytes)

44x

of world’s datais unstructured80%

IBM: CxO Studies 2009-2010IDC: The Digital Universe Decade – Are You Ready? May 2010

Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 2010–2015

of data transferred in mobile networksis audio/video50%

Business leaders frequently make decisions based on information they don’t trust, or don’t have

1 in 3

83%of CIOs cited “Business intelligence and analytics” as part of their visionary plansto enhance competitiveness

Business leaders say they don’t have access to the information they need to do their jobs

1 in 2

of CEOs need to do a better job capturing and understanding information rapidly in order to make swift business decisions

60%

Velocity

VolumeVariety

3 October 13, 2011 © 2011 IBM Corporation

Big Data, Big Insights

Streams and Oceans of information

High speed information, flowing in real-time, often transient. Sensors, instruments... Real-time logs, activity monitors... Streaming content like video/audio... High speed transactions like tickers,

trades or traffic systems...

Information streams Information oceans

Information is stored outside conventional systems.

Data may originate from different internal systems or from the Web.

Collections of what has already streamed... Social media, click streams, stored logs,

emails, etc. Unstructured or mixed schema documents,

like claims, forms, desktop applications... Structured data from disparate systems...

4 October 13, 2011 © 2011 IBM Corporation

Big Data, Big Insights

444

How do we extract Insight and Value from a High Volume, Variety and Velocity of data, in a Timely and Cost-effective manner?

Big Data presents a Big Challenge

Manage and benefit from diverse data types and data structures

Analyze increasingly accelerating streams of data, new and changed

Scale from terabytes to zettabytes

Variety:

Velocity:

Volume:

5 © 2011 IBM Corporation

Big Data, Big Insights

Bringing together a Large Volume, Variety and Velocity of Data to Find New Insights?

Multi-channel customer sentiment and experience analysis

Detect life-threatening conditions at hospitals in time to intervene

Predict weather patterns to plan optimal wind turbine usage, and optimize capital expenditure on asset placement

Make risk decisions based on real-time transactional data

Identify criminals and threats from disparate video, audio, and data feeds

6 October 13, 2011 © 2011 IBM Corporation

Big Data, Big Insights

Growing demand for Big Data Analytics

Public safety

Finance Smarter Healthcare Multi-channel sales

. . .

Telecom

Manufacturing

Traffic Control

Trading Analytics

7 © 2011 IBM Corporation

Big Data, Big Insights

7

Model the weather to optimize placement of turbines, maximizing power generation and longevity

Build models to cover forecasting and real-time operation of power generation units

Incorporate 6 PB of structured and semi-structured information flows

Optimize capital investments

based on 6 Petabytesof information

8 © 2011 IBM Corporation

Big Data, Big Insights

8

Identify unauthorized content streaming in digital media (piracy issues)

Quantify annual revenue loss and analyse trends

Incorporate high variety of unstructured and semi-structured data (future plans for video content analysis)

Protect your intellectual property

based on 1 Year of social media data around the Internet

9 October 13, 2011 © 2011 IBM Corporation

Big Data, Big Insights

DataDataWarehouseWarehouse

Business IntelligenceFinancial PerformanceStrategy Management

Sales ManagementWorkforce Management

Predictive AnalyticsGovernance, Risk and

ComplianceFinancial Risk Management

Marketing and Campaign

ManagementOnline Web AnalyticsAdvanced Analytics

IdentityInsight

Global Name Recognition

InformationIntegration

ETLETL

Data QualityData QualityCommon M

eta Data

Common Meta Data

Industry Models

MDM

MasterMasterDataData

Change Data Capture

Information Services Director

Data Warehousing

OtherRDBMS

OLAP Data

OLAP Data

CubesCubes

ReferenceReferenceDataData

ERP

CRM

ECM

Business Processes and Applications

SCM

Enterprise Data Architecture

Business Analyticsand Optimization

Businesses are well prepared for Structured Data

Financial data, Customer data, Product data, Process data,

Transactional data...

Relational, Structured, Numeric, Cleaned, Normalized, Reconciled

10 October 13, 2011 © 2011 IBM Corporation

Big Data, Big Insights

DataDataWarehouseWarehouse

Business IntelligenceFinancial PerformanceStrategy Management

Sales ManagementWorkforce Management

Predictive AnalyticsGovernance, Risk and

ComplianceFinancial Risk Management

Marketing and Campaign

ManagementOnline Web AnalyticsAdvanced Analytics

IdentityInsight

Global Name Recognition

InformationIntegration

ETLETL

Data QualityData QualityCommon M

eta Data

Common Meta Data

Industry Models

MDM

MasterMasterDataData

Change Data Capture

Information Services Director

Data WarehousingTerabytes

OtherRDBMS

OLAP Data

OLAP Data

CubesCubes

ReferenceReferenceDataData

ERP

CRM

ECM

Business Processes and Applications

SCM

Business Analyticsand Optimization

?Text, rich text, audio, video, click

streams, log files, sensor data, raw data, web feeds, data streams...

Volume - Variety – Velocity

Unstructured, Non-relational

Petabytes... Exabytes...

?

Businesses are Not well prepared for Big Data

Enterprise Data Architecture

11 © 2011 IBM Corporation

Big Data, Big Insights

The Traditional Approach:Business Requirements Drive Solution Design

Business Defines Requirements and the Questions they Need Answers for

IT Designs a Solution with a set

structure and functionality

Business executes queries to answer questions over and over

New requirements

require redesign and rebuild

Stretched By:• Highly variable data and content• Iterative, exploratory analysis (e.g. scientific

research, behavioral modeling, etc.)• Volatile sources• Ill-defined questions and changing requirements

Well-Suited To:• High value, structured data• Repeated operations and processes (e.g.

transactions, reports, BI, etc.)• Relatively stable sources • Well-understood requirements

12 © 2011 IBM Corporation

Big Data, Big Insights

The Big Data Approach:Information Sources Drive Creative Discovery

Business and IT Identify Information Sources Available

IT Delivers a Platform that

enables creative exploration of all

available data and content

Business determines What Questions they Could Ask by

exploring the data and relationships

New insights drive integration

to traditional technology

13 October 13, 2011 © 2011 IBM Corporation

Big Data, Big Insights

How to merge the Traditional and Big Data approach?

ITStructures the data to answer that question

ITDelivers a platform to enable creative discovery

Business UsersExplores what questions couldbe asked

Business UsersDetermine what question to ask

Monthly sales reportsProfitability analysisCustomer surveys

Brand/product sentiment?Product strategy?Maximum asset utilization?

Big Data ApproachIterative & Exploratory Analysis

Traditional ApproachStructured & Repeatable Analysis

The question is NOT whether we need either Left or Right.The question IS When and How we can Balance between Both!

14 © 2011 IBM Corporation

Big Data, Big Insights

Big Data Shouldn’t Be a SiloMust be an integrated part of your enterprise information architecture

“Big Data Platform”Data WarehousePlatform

Enterprise Integration

BusinessSystems

New Sources

15 © 2011 IBM Corporation

Big Data, Big Insights

Potential solution: A Big Data PlatformBring together any data source, at any velocity and variety to generate insight

Analyzing a variety of data at enormous volumes

Insights on streaming data

Large volume structured data analysis

Big Data Platform

• Variety

• Velocity

• Volume

16 © 2011 IBM Corporation

Big Data, Big Insights

Big Data related Open Source technologies and concepts• Apache Hadoop (including the Hadoop Distributed File System (HDFS), MapReduce

framework, and common utilities), a software framework for data-intensive applications that exploit distributed computing environments

• Pig, a high-level programming language and runtime environment for Hadoop

• Jaql, a high-level query language based on JavaScript Object Notation (JSON), which also supports SQL.

• Hive, a data warehouse infrastructure designed to support batch queries and analysis of files managed by Hadoop

• HBase, a column-oriented data storage environment designed to support large, sparsely populated tables in Hadoop

• Flume, a facility for collecting and loading data into Hadoop

• Lucene, text search and indexing technology

• Avro, data serialization technology

• ZooKeeper, a coordination service for distributed applications

• Oozie, workflow/job orchestration technology

• UIMA, Unstructured Information Management Architecture, for creating, integrating and deploying unstructured information management solutions from combination of semantic analysis and search components.

17 © 2011 IBM Corporation

Big Data, Big Insights

MapReduce

• MapRecuce is a programming model for runningparallel data intensive functions against the data inHadoop file system

• Map function processes key-value pairs, resultingin an intermediate set of key-value pairs

• Reduce function then processes those intermediatekey-value pairs, merging the value for associated keys

• Common tasks for MapReduce are word counting,sorting and indexing

Source: http://www.techspot.co.in/2011/07/mapreduce-for-dummies.html

Map Map

Reduce

Traditional way (serial):

MapReduce way (parallel):

18 © 2011 IBM Corporation

Big Data, Big Insights

Our vision and Big Data platform

Source: https://www.ibm.com/developerworks/data/library/techarticle/dm-1110biginsightsintro

19 October 13, 2011 © 2011 IBM Corporation

Big Data, Big Insights

19

Example business applications for Big Data platform

Utilities Weather impact analysis on

power generation and supply Smart meter data analysis

eCommerce Analyze consumer behavior

and buying patterns Digital asset piracy

Multi-channel Integration Integrated customer behavior

modeling

Transportation Traffic and weather

impact on logistics, fuel consumption, time

Call Centers Recognize patterns,

predict trends, Voice-to-text mining for customer behavior understanding

Financial Services Improved risk decisions Customer sentiment analysis AML

IT Transition log analysis

for multiple transactional systems

Telecommunications Operations, data traffic and

failure analysis from devices, sensors and GPS inputs

20 October 13, 2011 © 2011 IBM Corporation

Big Data, Big Insights

Who's this “Guy”?

A Breakthrough in Internet-scale analytics and innovation.However, the success is dependant on the quality of the information we work on.

21 October 13, 2011 © 2011 IBM Corporation

Big Data, Big Insights

Watson and Big Data

Approx. 200M pages of text(to compete on Jeopardy!)

Watson’s Memory

Watson uses the Apache Hadoop open framework to distribute the workload for loading information into memory.

Big Data technology was used to build Watson’s knowledge base

Similar technology can now be used for Advanced Business Analytics

POS Data

CRM/ERPData Consumer

Generated Data

Distilled Insight Spending habits Social relationships Buying trends

IBM BigInsights

Advanced Searchand Analysiscapabilities

YOU!

22 October 13, 2011 © 2011 IBM Corporation

Big Data, Big Insights

InfoSphere BigInsights BigInsights is a software platform designed to help firms discover

and analyze business insights hidden in large volumes of a diverse range of data.

This data is often ignored or discarded because it's too impractical or difficult to process using traditional means. Examples of such data include log records, click streams, social media data, news feeds, electronic sensor output and even some transactional data

Visualization– Uses IBM Many-Eyes technology

http://many-eyes.com– Part of BigSheets UI

BigSheets– “Table-like” UI for BigInsights– Data Discovery and Manipulation– Jobs & Simulations

BigInsights– Hadoop / MapReduce -based framework– Extended with IBM capabilities, such as Agents, GPFS,

Indexing, Analytics, Enterprise Integration, Administration and more

BigSheets

BigInsights

Visualization

23 October 2011 ,13 © 2011 IBM Corporation

Big Data, Big Insights

BigInsights demo

BigSheets demo.mp4 (file)BigSheets demo (youtube)[length 3:56 when start from 1:15]

24 October 13, 2011 © 2011 IBM Corporation

Big Data, Big Insights

How BigInsights fits into an enterprise data architecture Example 1: Using BigInsights to filter and summarize big data for the warehouse

• BigInsights can sift through large volumes ofunstructured or semi-structured data, capturingrelevant information that can augment existingcorporate data in a warehouse.

• Once in the warehouse, traditional businessintelligence and query/report writing tools canwork with the extracted, aggregated andtransformed portions of raw data in BigInsights.

Example 2: BigInsights serving as a query-ready archive for a data warehouse

• This potential deployment approach involves using BigInsightsas a query-ready archive for a data warehouse.

• With this approach, frequently accessed data can bemaintained in the warehouse while “cold” or outdatedinformation can be offloaded to BigInsights.

• This allows firms to manage the size of their existing datamanagement platforms while servicing the well-establishedneeds of their existing applications.

25 October 13, 2011 © 2011 IBM Corporation

Big Data, Big Insights

25

Unica

DB2

Coremetrics

Streams

Netezza

DataStage

DBADBA

Manageability IntegrationConsumability

Data Explorer Application Flows Dashboards/Reports Administration

BigInsights Enterprise Console

BigInsights Enterprise Engine

Language (Jaql, Pig, Hive, HBase)

Workflow orchestration Workload Prioritization

Map-reduce (Hadoop + Adaptive Map-Reduce)

File system (GPFS, HDFS)

Performance

AnalystAnalystAnalystAnalystDBA/Analyst/DBA/Analyst/ProgrammerProgrammer

SPSS

Cognos

Analytics Indexing

(parallel, partitioned, real-time)

DBs

JMS HTTP

Web &App logs

Crawlers

Streams

Analytics

BigInsights architecture

26 October 13, 2011 © 2011 IBM Corporation

Big Data, Big Insights

IBM InfoSphere Streams Continuously analyze massive volumes of data (PB's per day)

Perform complex analytics of heterogeneous data types including text, images, audio, voice, VoIP, video, police scanners, web traffic, email, GPS data, financial transaction data, satellite data, sensors, and any other type of digital information

Leverage sub-millisecond latencies to react events and trends

Adapt to rapidly changing data forms and types.

Development environment– Streams Processing Language (SPL)– Eclipse-based IDE

Runtime environment– SPADE– Stream Processing Application Declarative Engine

Toolkits & Adapters– Connectors to data sources– Math & text functions– Operator library– Mining and Financial Services toolkit

(SPSS)

Runtime

Toolkits

Development Streams Studio

AdaptersInput

OperatorsProcess

SinksOutputStreams Live Graph

SPADE

27 October 13, 2011 © 2011 IBM Corporation

Big Data, Big Insights

Streams

InfoSphere Streams.mov (file)InfoSphere Streams (youtube, length 1:28)

28 October 13, 2011 © 2011 IBM Corporation

Big Data, Big Insights

Streams for Video Contour Detection

Original Picture

Contour Detection

29 October 13, 2011 © 2011 IBM Corporation

Big Data, Big Insights

Streams for Telephony

30 October 13, 2011 © 2011 IBM Corporation

Big Data, Big Insights

Streams for Real-time Geomapping

© 2011 IBM Corporation

Big Data, Big Insights

Predictive Analytics in a Neonatal ICU

• Real-time analytics and correlations on physiological data streams – Blood pressure, Temperature, EKG, Blood

oxygen saturation etc.,

• Early detection of the onset of potentially life-threatening conditions– Up to 24 hours earlier than current medical

practices – Early intervention leads to lower patient

morbidity and better long term outcomes

• Technology also enables physicians to verify new clinical hypotheses

32 October 13, 2011 © 2011 IBM Corporation

Big Data, Big Insights

Stream Computing for Healthcare

Stream Computing for Healthcare.mov (file)Stream Computing for Healthcare (youtube, length 1:28)

33 October 13, 2011 © 2011 IBM Corporation

Big Data, Big Insights

In-Motion Analytics

Batch orientedanalytics

New Insights

Massive Scale Analytics

Database &Warehouse

At-rest data analytics

Traditional / Relational Data Sources

Non-Traditional / Non-Relational Data Sources

Varied data formats

Semi-structured, unstructured... InfoSphere

BigInsights

Results

InfoSphere Streams

Conventional Analytics

Ultra LowLatencyResults

Traditional(OLTP/OLAP)

Real-time(RTAP)

Massive Data

Streaming dataanalyticsIn-Memory or

Disk-based Database

PredictiveAnalytics

Business Intelligence Web and Marketing Analytics

Conventional vs “Big Data” analytics together

34 October 13, 2011 © 2011 IBM Corporation

Big Data, Big Insights

DataDataWarehouseWarehouse

Business IntelligenceFinancial PerformanceStrategy Management

Sales ManagementWorkforce Management

Predictive AnalyticsGovernance, Risk and

ComplianceFinancial Risk Management

Marketing and Campaign

ManagementOnline Web AnalyticsAdvanced Analytics

IdentityInsight

Global Name Recognition

InformationIntegration

ETLETL

Data QualityData QualityCommon M

eta Data

Common Meta Data

Industry Models

MDM

MasterMasterDataData

Change Data Capture

Information Services Director

Data WarehousingTerabytes

OtherRDBMS

OLAP Data

OLAP Data

CubesCubes

ReferenceReferenceDataData

ERP

CRM

ECM

Business Processes and Applications

SCM

Business Analyticsand Optimization

Volume - Variety – VelocityUnstructured, Non-relational

Petabytes... Exabytes...

Businesses are Not well prepared for Big Data

Enterprise Data Architecture

BigInsights

Streams

35 © 2011 IBM Corporation

Big Data, Big Insights

Analyse unstructured content, like emails, call center logs, documents, knowledge base content, web content, sharepoint sites etc.

Based on UIMA open standard architecture

Transform raw information into business insight quickly without building models or deploying complex systems

Achieve insights in hours or days, not weeks or months

Reporting through corporate Business Intelligence

Easy to use for e.g contact center agents, knowledge workers or management to search and explore content

Flexible and extensible for deeper insights

Insight to Unstructured Contentwith IBM Content Analytics

36 October 13, 2011 © 2011 IBM Corporation

Big Data, Big Insights

Content Analytics demo

Content Analytics demo.mp4 (file)Content Analytics demo (youtube, length 9:33)

37 IBM ConfidentialOctober 13, 2011 © 2011 IBM Corporation

Big Data, Big Insights

Crawler Framework

Supported Crawlers

• Web (HTTP)• Windows File System• Unix File System• FileNet P8• DB2 Content Manager• Content Integrator• DB2• JDBC• NNTP• Lotus Notes• QuickPlace• SharePoint• Microsoft Exchange• WebSphere Portal• Web Content Mgmt• Domino Doc Mgmt

Custom Crawler

Crawler

Plug-inDocument

Cache

IBM Extended Lucene Indexer

ThumbnailIndex

Facet CountSub Index

TaxonomyIndex

SearchIndex

UIMA

Document Processor

ParserDocument Generator

Indexer

Text Miner

Applications

Search and Text Analytics

Runtime

Search and Text Analytics

RuntimeSearch and

Text Analytics Runtime

Search and Text Analytics

RuntimeText Analytics

Runtime

Text Analytics Runtime

Common Infrastructure

Administrator

Analyst

Control Monitor ConfigurationSecurity Scheduler Logging

Discovery

2 1

3 4

5

14

ICA Architecture

38 October 2011 ,13 © 2011 IBM Corporation

Big Data, Big Insights

Summary

Big Data– Massive scale data, usually outside of conventional business systems– Has huge Volume, Variety or Velocity

BigInsights– Analytical platform for Persistent Big Data to bring new business insights– Based on Open Source + IBM technology + technology expertise– Designed to Integrate with existing Enterprise Management Information Systems and

Business Analytics

Streams– Solution to capture and analyze high velocity In-motion Big Data with ultra-low latency– Designed to Integrate with existing Enterprise Management Information Systems and

Business Analytics

Big Data platform– Combination of Software, Hardware, Services and Advanced Research– IBM's Big Data platform is based on Open Source, BigInsights and Streams technologies– Has to provide enterprise integration, scalability, manageability and security

Apache Hadoop– Open Source framework for data-intensive distributed work (applications)