Upload
doanliem
View
236
Download
3
Embed Size (px)
Citation preview
© 2011 IBM Corporation
Big Data and Big InsightsSmart Analytics in Internet-scale
Jukka Ruponen, IT Architect
2 October 13, 2011 © 2011 IBM Corporation
Big Data, Big Insights
Information Growing at a Phenomenal Rate . . .
2009800 exabytes(800 000 petabytes)
as much data andcontent by 2020
35 zettabytes(35 000 000 petabytes)
44x
of world’s datais unstructured80%
IBM: CxO Studies 2009-2010IDC: The Digital Universe Decade – Are You Ready? May 2010
Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 2010–2015
of data transferred in mobile networksis audio/video50%
Business leaders frequently make decisions based on information they don’t trust, or don’t have
1 in 3
83%of CIOs cited “Business intelligence and analytics” as part of their visionary plansto enhance competitiveness
Business leaders say they don’t have access to the information they need to do their jobs
1 in 2
of CEOs need to do a better job capturing and understanding information rapidly in order to make swift business decisions
60%
Velocity
VolumeVariety
3 October 13, 2011 © 2011 IBM Corporation
Big Data, Big Insights
Streams and Oceans of information
High speed information, flowing in real-time, often transient. Sensors, instruments... Real-time logs, activity monitors... Streaming content like video/audio... High speed transactions like tickers,
trades or traffic systems...
Information streams Information oceans
Information is stored outside conventional systems.
Data may originate from different internal systems or from the Web.
Collections of what has already streamed... Social media, click streams, stored logs,
emails, etc. Unstructured or mixed schema documents,
like claims, forms, desktop applications... Structured data from disparate systems...
4 October 13, 2011 © 2011 IBM Corporation
Big Data, Big Insights
444
How do we extract Insight and Value from a High Volume, Variety and Velocity of data, in a Timely and Cost-effective manner?
Big Data presents a Big Challenge
Manage and benefit from diverse data types and data structures
Analyze increasingly accelerating streams of data, new and changed
Scale from terabytes to zettabytes
Variety:
Velocity:
Volume:
5 © 2011 IBM Corporation
Big Data, Big Insights
Bringing together a Large Volume, Variety and Velocity of Data to Find New Insights?
Multi-channel customer sentiment and experience analysis
Detect life-threatening conditions at hospitals in time to intervene
Predict weather patterns to plan optimal wind turbine usage, and optimize capital expenditure on asset placement
Make risk decisions based on real-time transactional data
Identify criminals and threats from disparate video, audio, and data feeds
6 October 13, 2011 © 2011 IBM Corporation
Big Data, Big Insights
Growing demand for Big Data Analytics
Public safety
Finance Smarter Healthcare Multi-channel sales
. . .
Telecom
Manufacturing
Traffic Control
Trading Analytics
7 © 2011 IBM Corporation
Big Data, Big Insights
7
Model the weather to optimize placement of turbines, maximizing power generation and longevity
Build models to cover forecasting and real-time operation of power generation units
Incorporate 6 PB of structured and semi-structured information flows
Optimize capital investments
based on 6 Petabytesof information
8 © 2011 IBM Corporation
Big Data, Big Insights
8
Identify unauthorized content streaming in digital media (piracy issues)
Quantify annual revenue loss and analyse trends
Incorporate high variety of unstructured and semi-structured data (future plans for video content analysis)
Protect your intellectual property
based on 1 Year of social media data around the Internet
9 October 13, 2011 © 2011 IBM Corporation
Big Data, Big Insights
DataDataWarehouseWarehouse
Business IntelligenceFinancial PerformanceStrategy Management
Sales ManagementWorkforce Management
Predictive AnalyticsGovernance, Risk and
ComplianceFinancial Risk Management
Marketing and Campaign
ManagementOnline Web AnalyticsAdvanced Analytics
IdentityInsight
Global Name Recognition
InformationIntegration
ETLETL
Data QualityData QualityCommon M
eta Data
Common Meta Data
Industry Models
MDM
MasterMasterDataData
Change Data Capture
Information Services Director
Data Warehousing
OtherRDBMS
OLAP Data
OLAP Data
CubesCubes
ReferenceReferenceDataData
ERP
CRM
ECM
Business Processes and Applications
SCM
Enterprise Data Architecture
Business Analyticsand Optimization
Businesses are well prepared for Structured Data
Financial data, Customer data, Product data, Process data,
Transactional data...
Relational, Structured, Numeric, Cleaned, Normalized, Reconciled
10 October 13, 2011 © 2011 IBM Corporation
Big Data, Big Insights
DataDataWarehouseWarehouse
Business IntelligenceFinancial PerformanceStrategy Management
Sales ManagementWorkforce Management
Predictive AnalyticsGovernance, Risk and
ComplianceFinancial Risk Management
Marketing and Campaign
ManagementOnline Web AnalyticsAdvanced Analytics
IdentityInsight
Global Name Recognition
InformationIntegration
ETLETL
Data QualityData QualityCommon M
eta Data
Common Meta Data
Industry Models
MDM
MasterMasterDataData
Change Data Capture
Information Services Director
Data WarehousingTerabytes
OtherRDBMS
OLAP Data
OLAP Data
CubesCubes
ReferenceReferenceDataData
ERP
CRM
ECM
Business Processes and Applications
SCM
Business Analyticsand Optimization
?Text, rich text, audio, video, click
streams, log files, sensor data, raw data, web feeds, data streams...
Volume - Variety – Velocity
Unstructured, Non-relational
Petabytes... Exabytes...
?
Businesses are Not well prepared for Big Data
Enterprise Data Architecture
11 © 2011 IBM Corporation
Big Data, Big Insights
The Traditional Approach:Business Requirements Drive Solution Design
Business Defines Requirements and the Questions they Need Answers for
IT Designs a Solution with a set
structure and functionality
Business executes queries to answer questions over and over
New requirements
require redesign and rebuild
Stretched By:• Highly variable data and content• Iterative, exploratory analysis (e.g. scientific
research, behavioral modeling, etc.)• Volatile sources• Ill-defined questions and changing requirements
Well-Suited To:• High value, structured data• Repeated operations and processes (e.g.
transactions, reports, BI, etc.)• Relatively stable sources • Well-understood requirements
12 © 2011 IBM Corporation
Big Data, Big Insights
The Big Data Approach:Information Sources Drive Creative Discovery
Business and IT Identify Information Sources Available
IT Delivers a Platform that
enables creative exploration of all
available data and content
Business determines What Questions they Could Ask by
exploring the data and relationships
New insights drive integration
to traditional technology
13 October 13, 2011 © 2011 IBM Corporation
Big Data, Big Insights
How to merge the Traditional and Big Data approach?
ITStructures the data to answer that question
ITDelivers a platform to enable creative discovery
Business UsersExplores what questions couldbe asked
Business UsersDetermine what question to ask
Monthly sales reportsProfitability analysisCustomer surveys
Brand/product sentiment?Product strategy?Maximum asset utilization?
Big Data ApproachIterative & Exploratory Analysis
Traditional ApproachStructured & Repeatable Analysis
The question is NOT whether we need either Left or Right.The question IS When and How we can Balance between Both!
14 © 2011 IBM Corporation
Big Data, Big Insights
Big Data Shouldn’t Be a SiloMust be an integrated part of your enterprise information architecture
“Big Data Platform”Data WarehousePlatform
Enterprise Integration
BusinessSystems
New Sources
15 © 2011 IBM Corporation
Big Data, Big Insights
Potential solution: A Big Data PlatformBring together any data source, at any velocity and variety to generate insight
Analyzing a variety of data at enormous volumes
Insights on streaming data
Large volume structured data analysis
Big Data Platform
• Variety
• Velocity
• Volume
16 © 2011 IBM Corporation
Big Data, Big Insights
Big Data related Open Source technologies and concepts• Apache Hadoop (including the Hadoop Distributed File System (HDFS), MapReduce
framework, and common utilities), a software framework for data-intensive applications that exploit distributed computing environments
• Pig, a high-level programming language and runtime environment for Hadoop
• Jaql, a high-level query language based on JavaScript Object Notation (JSON), which also supports SQL.
• Hive, a data warehouse infrastructure designed to support batch queries and analysis of files managed by Hadoop
• HBase, a column-oriented data storage environment designed to support large, sparsely populated tables in Hadoop
• Flume, a facility for collecting and loading data into Hadoop
• Lucene, text search and indexing technology
• Avro, data serialization technology
• ZooKeeper, a coordination service for distributed applications
• Oozie, workflow/job orchestration technology
• UIMA, Unstructured Information Management Architecture, for creating, integrating and deploying unstructured information management solutions from combination of semantic analysis and search components.
17 © 2011 IBM Corporation
Big Data, Big Insights
MapReduce
• MapRecuce is a programming model for runningparallel data intensive functions against the data inHadoop file system
• Map function processes key-value pairs, resultingin an intermediate set of key-value pairs
• Reduce function then processes those intermediatekey-value pairs, merging the value for associated keys
• Common tasks for MapReduce are word counting,sorting and indexing
Source: http://www.techspot.co.in/2011/07/mapreduce-for-dummies.html
Map Map
Reduce
Traditional way (serial):
MapReduce way (parallel):
18 © 2011 IBM Corporation
Big Data, Big Insights
Our vision and Big Data platform
Source: https://www.ibm.com/developerworks/data/library/techarticle/dm-1110biginsightsintro
19 October 13, 2011 © 2011 IBM Corporation
Big Data, Big Insights
19
Example business applications for Big Data platform
Utilities Weather impact analysis on
power generation and supply Smart meter data analysis
eCommerce Analyze consumer behavior
and buying patterns Digital asset piracy
Multi-channel Integration Integrated customer behavior
modeling
Transportation Traffic and weather
impact on logistics, fuel consumption, time
Call Centers Recognize patterns,
predict trends, Voice-to-text mining for customer behavior understanding
Financial Services Improved risk decisions Customer sentiment analysis AML
IT Transition log analysis
for multiple transactional systems
Telecommunications Operations, data traffic and
failure analysis from devices, sensors and GPS inputs
20 October 13, 2011 © 2011 IBM Corporation
Big Data, Big Insights
Who's this “Guy”?
A Breakthrough in Internet-scale analytics and innovation.However, the success is dependant on the quality of the information we work on.
21 October 13, 2011 © 2011 IBM Corporation
Big Data, Big Insights
Watson and Big Data
Approx. 200M pages of text(to compete on Jeopardy!)
Watson’s Memory
Watson uses the Apache Hadoop open framework to distribute the workload for loading information into memory.
Big Data technology was used to build Watson’s knowledge base
Similar technology can now be used for Advanced Business Analytics
POS Data
CRM/ERPData Consumer
Generated Data
Distilled Insight Spending habits Social relationships Buying trends
IBM BigInsights
Advanced Searchand Analysiscapabilities
YOU!
22 October 13, 2011 © 2011 IBM Corporation
Big Data, Big Insights
InfoSphere BigInsights BigInsights is a software platform designed to help firms discover
and analyze business insights hidden in large volumes of a diverse range of data.
This data is often ignored or discarded because it's too impractical or difficult to process using traditional means. Examples of such data include log records, click streams, social media data, news feeds, electronic sensor output and even some transactional data
Visualization– Uses IBM Many-Eyes technology
http://many-eyes.com– Part of BigSheets UI
BigSheets– “Table-like” UI for BigInsights– Data Discovery and Manipulation– Jobs & Simulations
BigInsights– Hadoop / MapReduce -based framework– Extended with IBM capabilities, such as Agents, GPFS,
Indexing, Analytics, Enterprise Integration, Administration and more
BigSheets
BigInsights
Visualization
23 October 2011 ,13 © 2011 IBM Corporation
Big Data, Big Insights
BigInsights demo
BigSheets demo.mp4 (file)BigSheets demo (youtube)[length 3:56 when start from 1:15]
24 October 13, 2011 © 2011 IBM Corporation
Big Data, Big Insights
How BigInsights fits into an enterprise data architecture Example 1: Using BigInsights to filter and summarize big data for the warehouse
• BigInsights can sift through large volumes ofunstructured or semi-structured data, capturingrelevant information that can augment existingcorporate data in a warehouse.
• Once in the warehouse, traditional businessintelligence and query/report writing tools canwork with the extracted, aggregated andtransformed portions of raw data in BigInsights.
Example 2: BigInsights serving as a query-ready archive for a data warehouse
• This potential deployment approach involves using BigInsightsas a query-ready archive for a data warehouse.
• With this approach, frequently accessed data can bemaintained in the warehouse while “cold” or outdatedinformation can be offloaded to BigInsights.
• This allows firms to manage the size of their existing datamanagement platforms while servicing the well-establishedneeds of their existing applications.
25 October 13, 2011 © 2011 IBM Corporation
Big Data, Big Insights
25
Unica
DB2
Coremetrics
Streams
Netezza
DataStage
DBADBA
Manageability IntegrationConsumability
Data Explorer Application Flows Dashboards/Reports Administration
BigInsights Enterprise Console
BigInsights Enterprise Engine
Language (Jaql, Pig, Hive, HBase)
Workflow orchestration Workload Prioritization
Map-reduce (Hadoop + Adaptive Map-Reduce)
File system (GPFS, HDFS)
Performance
AnalystAnalystAnalystAnalystDBA/Analyst/DBA/Analyst/ProgrammerProgrammer
SPSS
Cognos
Analytics Indexing
(parallel, partitioned, real-time)
DBs
JMS HTTP
Web &App logs
Crawlers
Streams
Analytics
BigInsights architecture
26 October 13, 2011 © 2011 IBM Corporation
Big Data, Big Insights
IBM InfoSphere Streams Continuously analyze massive volumes of data (PB's per day)
Perform complex analytics of heterogeneous data types including text, images, audio, voice, VoIP, video, police scanners, web traffic, email, GPS data, financial transaction data, satellite data, sensors, and any other type of digital information
Leverage sub-millisecond latencies to react events and trends
Adapt to rapidly changing data forms and types.
Development environment– Streams Processing Language (SPL)– Eclipse-based IDE
Runtime environment– SPADE– Stream Processing Application Declarative Engine
Toolkits & Adapters– Connectors to data sources– Math & text functions– Operator library– Mining and Financial Services toolkit
(SPSS)
Runtime
Toolkits
Development Streams Studio
AdaptersInput
OperatorsProcess
SinksOutputStreams Live Graph
SPADE
27 October 13, 2011 © 2011 IBM Corporation
Big Data, Big Insights
Streams
InfoSphere Streams.mov (file)InfoSphere Streams (youtube, length 1:28)
28 October 13, 2011 © 2011 IBM Corporation
Big Data, Big Insights
Streams for Video Contour Detection
Original Picture
Contour Detection
© 2011 IBM Corporation
Big Data, Big Insights
Predictive Analytics in a Neonatal ICU
• Real-time analytics and correlations on physiological data streams – Blood pressure, Temperature, EKG, Blood
oxygen saturation etc.,
• Early detection of the onset of potentially life-threatening conditions– Up to 24 hours earlier than current medical
practices – Early intervention leads to lower patient
morbidity and better long term outcomes
• Technology also enables physicians to verify new clinical hypotheses
32 October 13, 2011 © 2011 IBM Corporation
Big Data, Big Insights
Stream Computing for Healthcare
Stream Computing for Healthcare.mov (file)Stream Computing for Healthcare (youtube, length 1:28)
33 October 13, 2011 © 2011 IBM Corporation
Big Data, Big Insights
In-Motion Analytics
Batch orientedanalytics
New Insights
Massive Scale Analytics
Database &Warehouse
At-rest data analytics
Traditional / Relational Data Sources
Non-Traditional / Non-Relational Data Sources
Varied data formats
Semi-structured, unstructured... InfoSphere
BigInsights
Results
InfoSphere Streams
Conventional Analytics
Ultra LowLatencyResults
Traditional(OLTP/OLAP)
Real-time(RTAP)
Massive Data
Streaming dataanalyticsIn-Memory or
Disk-based Database
PredictiveAnalytics
Business Intelligence Web and Marketing Analytics
Conventional vs “Big Data” analytics together
34 October 13, 2011 © 2011 IBM Corporation
Big Data, Big Insights
DataDataWarehouseWarehouse
Business IntelligenceFinancial PerformanceStrategy Management
Sales ManagementWorkforce Management
Predictive AnalyticsGovernance, Risk and
ComplianceFinancial Risk Management
Marketing and Campaign
ManagementOnline Web AnalyticsAdvanced Analytics
IdentityInsight
Global Name Recognition
InformationIntegration
ETLETL
Data QualityData QualityCommon M
eta Data
Common Meta Data
Industry Models
MDM
MasterMasterDataData
Change Data Capture
Information Services Director
Data WarehousingTerabytes
OtherRDBMS
OLAP Data
OLAP Data
CubesCubes
ReferenceReferenceDataData
ERP
CRM
ECM
Business Processes and Applications
SCM
Business Analyticsand Optimization
Volume - Variety – VelocityUnstructured, Non-relational
Petabytes... Exabytes...
Businesses are Not well prepared for Big Data
Enterprise Data Architecture
BigInsights
Streams
35 © 2011 IBM Corporation
Big Data, Big Insights
Analyse unstructured content, like emails, call center logs, documents, knowledge base content, web content, sharepoint sites etc.
Based on UIMA open standard architecture
Transform raw information into business insight quickly without building models or deploying complex systems
Achieve insights in hours or days, not weeks or months
Reporting through corporate Business Intelligence
Easy to use for e.g contact center agents, knowledge workers or management to search and explore content
Flexible and extensible for deeper insights
Insight to Unstructured Contentwith IBM Content Analytics
36 October 13, 2011 © 2011 IBM Corporation
Big Data, Big Insights
Content Analytics demo
Content Analytics demo.mp4 (file)Content Analytics demo (youtube, length 9:33)
37 IBM ConfidentialOctober 13, 2011 © 2011 IBM Corporation
Big Data, Big Insights
Crawler Framework
Supported Crawlers
• Web (HTTP)• Windows File System• Unix File System• FileNet P8• DB2 Content Manager• Content Integrator• DB2• JDBC• NNTP• Lotus Notes• QuickPlace• SharePoint• Microsoft Exchange• WebSphere Portal• Web Content Mgmt• Domino Doc Mgmt
Custom Crawler
Crawler
Plug-inDocument
Cache
IBM Extended Lucene Indexer
ThumbnailIndex
Facet CountSub Index
TaxonomyIndex
SearchIndex
UIMA
Document Processor
ParserDocument Generator
Indexer
Text Miner
Applications
Search and Text Analytics
Runtime
Search and Text Analytics
RuntimeSearch and
Text Analytics Runtime
Search and Text Analytics
RuntimeText Analytics
Runtime
Text Analytics Runtime
Common Infrastructure
Administrator
Analyst
Control Monitor ConfigurationSecurity Scheduler Logging
Discovery
2 1
3 4
5
14
ICA Architecture
38 October 2011 ,13 © 2011 IBM Corporation
Big Data, Big Insights
Summary
Big Data– Massive scale data, usually outside of conventional business systems– Has huge Volume, Variety or Velocity
BigInsights– Analytical platform for Persistent Big Data to bring new business insights– Based on Open Source + IBM technology + technology expertise– Designed to Integrate with existing Enterprise Management Information Systems and
Business Analytics
Streams– Solution to capture and analyze high velocity In-motion Big Data with ultra-low latency– Designed to Integrate with existing Enterprise Management Information Systems and
Business Analytics
Big Data platform– Combination of Software, Hardware, Services and Advanced Research– IBM's Big Data platform is based on Open Source, BigInsights and Streams technologies– Has to provide enterprise integration, scalability, manageability and security
Apache Hadoop– Open Source framework for data-intensive distributed work (applications)