View
873
Download
0
Category
Tags:
Preview:
DESCRIPTION
Presentation from the Persistent Customer Summit about Big Data
Citation preview
BIG DATA Defined:
Data Stack 3.0
Persistent Systems
June 2012
1 24 July 2012
The Data Revolution is Happening Now
The growing need for large-volume, multi-
structured “Big Data” analytics,
as well as … “Fast Data”, have positioned the
industry at the cusp of the most radical
revolution in database architectures in 20
years.
We believe that the economics of data will
increasingly drive competitive advantage.
Source: Credit Suisse Research, Sept 2011
24 July 2012 2
Enterprise Value is Shifting to Data
3
Mainframe
Operating
Systems
ERP
Apps
Data
2013 2006
Database
1995 1985 1975 24 July 2012
Organizational leaders want analytics to exploit their growing data and computational power to get smart, and get innovative, in ways they never could before. Source - MIT Sloan Management Review- The New Intelligent Enterprise Big Data, Analytics
and the Path From Insights to Value By Steve LaValle, Eric Lesser,
Rebecca Shockley, Michael S. Hopkins and Nina Kruschwitz
December 21, 2010
What Data Can Do For You
24 July 2012 4
Source: New York Times, September 2, 2009. Tesco, British Grocer, Uses Weather to Predict Sales By Julia Werdigier
http://www.nytimes.com/2009/09/02/business/global/02weather.html
Britain often conjures images of unpredictable weather, with downpours sometimes followed
by sunshine within the same hour — several times a day.
Such randomness has prompted Tesco, the country’s largest grocery chain, to create…its own
software that calculates how shopping patterns change “for every degree of temperature and
every hour of sunshine.”
Determining Shopping Patterns
British Grocer, Tesco Uses Big Data
by Applying Weather Results to Predict
Demand and Increase Sales
24 July 2012 5
GlaxoSmithKline is aiming to build direct relationships with 1 million consumers in a year using
social media as a base for research and multichannel marketing. Targeted offers and
promotions will drive people to particular brand websites where external data is integrated
with information already held by the marketing teams.
Source: Big data: Embracing the elephant in the room By Steve Hemsley
http://www.marketingweek.co.uk/big-data-embracing-the-elephant-in-the-room/3030939.article
Tracking Customers in Social Media
Glaxo Smith Kline Uses Big Data
to Efficiently Target Customers
24 July 2012 6
What does India Think?
Persistent enables Aamir Khan Productions and Star Plus use
Big Data to know how people react to some of the most
excruciating social issues.
http://www.satyamevjayate.in/
24 July 2012 7
Satyamev Jayate - Aamir Khan’s pioneering, interactive socio-cultural TV show - has caught the
interest of the entire nation. It has already generated ~7.5M responses in 4 weeks over SMS,
Facebook, Twitter, Phone Calls and Discussion Forums by its viewers across the world over. This
data is being analyzed and delivered in real-time to allow the producers to understand the
pulse of the viewers, to gauge the appreciation for the show and most importantly to spread
the message. Harnessing the truth from all this data is a key component of the show’s success.
24 July 2012 8
WE ALREADY HAVE DATABASES.
WHY DO WE NEED TO DO ANYTHING
DIFFERENT?
9 24 July 2012
● Transaction processing capabilities ideally suited for transaction-oriented operational stores.
● Data types – numbers, text, etc.
● SQL as the Query language
● De-facto standard as the operational store for ERP and mission critical systems.
● Interface through application programs and query tools
Relational Database Systems for
Operational Store
10 24 July 2012
● Operational data stores store on-line transactions – Many writes, some reads.
● Large fact table, multiple dimension tables
● Schema has a specific pattern – star schema
● Joins are also very standard and create cubes
● Queries focus on aggregates.
● Users access data through tools such as Cognos, Business Objects, Hyperion etc.
Enterprise Data Warehouse for Decision
Support
11 24 July 2012
Data Stack 2.0: Enterprise Data Warehouse Systems
Standard Enterprise Data Architecture
Data Warehouse Engine
Optimized Loader Extraction Cleansing
(ETL)
Analyze Query
Metadata Repository
Relational Databases
Legacy Data
Purchased Data
ERP Systems
Relational Databases
Application Logic
Presentation Layer
Data Stack 1.0:
Operational Data Systems
12 24 July 2012
One in two business executives believe that they do not have sufficient information across their organization to do their job
Source: IBM Institute for Business Value
Despite the two data stacks ..
13 24 July 2012
Data has Variety
24 July 2012 14
Less than 40% of
the Enterprise
Data is stored in
Data Stack 1.0 or
Data Stack 2.0.
Beyond the Operational Systems, data
required for decision making is scattered
within and beyond the enterprise
ERP Systems
CRM Systems
Enterprise
Data Warehouse
Structured
Data Sources
Email Systems Collaboration
/Wiki Sites
Document Repositories
Project artifacts
Employee Surveys
Customer Call
Center Records
Unstructured
Data Sources
Organizational
Workflow
Sensor
Data
Cloud
Data Sources
CRM Systems
Expense
Management System Vendor
Collaboration Systems
Supply Chain
Systems
Location and
Presence Data
Public
Data Sources
Weather forecasts
Demographic
Data
Maps
Economic Data
Social
Networking Data
Feeds
15 24 July 2012
5 Exabytes of information was
created between the dawn of
civilization through 2003, but that
much information is now created
every 2 days, and the pace is
increasing
Eric Schmidt
at the Techonomy Conference,
August 4, 2010 (1 exabyte = 1018 bytes )
Data Volumes are Growing
24 July 2012 16
The Continued Explosion of Data in the
Enterprise and Beyond
80% of new information growth is
unstructured content –
90% of that is currently unmanaged
1990 2000 2010 2020 Source: IDC, The Digital Universe Decade – Are You Ready?, May 2010
2009
800,000 petabytes
2020
35 zettabytes
44x as much
Data and Content
Over Coming Decade
17 24 July 2012
What comes first -- Structure or data?
18
Schema/
Structure Data
24 July 2012
Structure First is Constraining
Time to create a new data stack for unstructured data. Data Stack 3.0.
19 24 July 2012
The Path to Data Stack 3.0:
Must support Variety, Volume and Velocity
24 July 2012 20
Data Stack 3.0
Dynamic Data Platform
Uncovering Key Insights
Schema less Approach
PBs of Data
End User Direct Access
Structured + Semi Structured
Data Stack 2.0
Enterprise Data Warehouse
Support for Decision Making
Un-normalized Dimensional Model
TBs of Data
End User Access Through Reports
Structured
Data Stack 1.0
Relational Database Systems
Recording Business Events
Highly Normalized Data
GBs of Data
End User Access through Ent Apps
Structured
Can Data Stack 3.0 Address Real Problems?
Large Data
Volume at Low
Price
Diverse Data
beyond
Structured Data
Queries that
Are Difficult to
Answer
Answer Queries
that No One
Dare Ask
24 July 2012 21
Time-out!
Internet companies
have already
addressed the same
problems.
22 24 July 2012
● Twitter has 140 million active users and more than 400 million tweets per day.
● Facebook has over 900 million active users and an average of 3.2 billion Likes and Comments are generated by Facebook users per day.
● 3.1 billion email accounts in 2011, expected to rise to over 4 billion by 2015.
● There were 2.3 billion internet users (2,279,709,629) worldwide in the first quarter of 2012, according to Internet World Stats data updated 31st March 2012.
Internet Companies have to deal with large
volumes of unstructured real-time data.
23 24 July 2012
● Hosted service
● Large cluster (1000s of nodes) of low-cost
commodity servers.
● Very large amounts of data -- Indexing
billions of documents, video, images etc..
● Batch updates.
● Fault tolerance.
● Hundreds of Million users,
● Billions of queries every day.
Their data loads and pricing requirements
do not fit traditional relational systems
24 24 July 2012
● It is the platform that distinguishes them from everyone else.
● They required: – high reliability across data centers
– scalability to thousands of network nodes
– huge read/write bandwidth requirements
– support for large blocks of data which are gigabytes in size.
– efficient distribution of operations across nodes to reduce bottlenecks
Relational databases were not suitable and would have been cost prohibitive.
They built their own systems
25 24 July 2012
Companies have
created business
models to support
and enhance this
software.
Internet Companies have open-sourced the
source code they created for their own use.
26 24 July 2012
Open Source Rules !
27
Hadoop
Infrastructure
24 July 2012
What about support !
28 24 July 2012
Allows for analysis of massive volumes of information • Structured and Unstructured • External and Internal
Thousands of users, millions of files, terabytes of data needs to be handled
Commoditized hardware can be used to reduce costs
Big Data can and should integrate with existing enterprise information architecture
Only Big Data makes it possible!
Enterprises Always had Data.
Now there is a way to handle it!
24 July 2012 29
PERSISTENT SYSTEMS AND BIG DATA
24 July 2012 30
Persistent Systems has an experienced team of Big Data Experts that has created the technology building blocks to help you implement a Big Data Solution
that offers a direct path to unlock the value in your data.
Big Data Expertise at Persistent ● 10+ projects executed with Leading ISVs and Enterprise Customers
● Dedicated group to MapReduce, Hadoop and Big Data Ecosystem
(formed 3 years ago)
● Engaged with the Big Data Ecosystem, including leading ISVs and
experts
• Preferred Big Data Services Partner of IBM and Microsoft
24 July 2012
Big Data Leadership and Contributions
● Code Contributions to Big Data Open Source Projects, including:
– Hadoop, Hive, and SciDB
● Dedicated Hadoop cluster in Persistent
● Created PeBAL – Persistent Big Data Analytics Library
● Created Visual Programming Environment for Hadoop
● Created Data Connectors for Moving Data
● Pre-built Solutions to Accelerate Big Data Projects
24 July 2012 33
Persistent’s Big Data Offerings 1. Setting up and Maintaining Big Data Platform
2. Data Analytics on Big Data Platform
3. Building Applications on Big Data
Foundational Infrastructure and Platform (Built Upon Selected 3rd Party Big Data Platforms and Technologies;
Cluster of Commodity Hardware)
Persistent Platform Enhancement IP
(PeBAL Analytics Library, Data Connectors)
Persistent Pre-built Horizontal Solutions
(Email, Text, IT Analytics, … )
Persistent Pre-built
Industry Solution: Retail
Technology Assets
Vis
ual P
rog
ram
min
g
Too
ls
Persistent Pre-built
Industry Solution: Banking
Persistent Pre-built
Industry Solution: Telco
Big Data Custom
Services
Extension of
Your Team
Discovery Workshop
Training for Your Team
Team Formation Process
Cluster Sizing/Config
People Assets
Methodology
24 July 2012 34
Commercial/ Open Source Product
Persistent IP External Data source
Email Server
Co
nn
ector Fram
ewo
rk
IBM Tivoli
BBCA
Web Proxy
Social Me
dia Connector
Twitter, Facebook
Email Server
Web Proxy
DW
NoSQL
RDBMS
Data Warehouse
PIG/Jqal Text Analytics/ GATE/SystemT
Hive
Persistent Analytics Library (PEBAL)
Graph Fn Set Fn …. ….. ….. Text Analytics Fn
Solutions
MapReduce and HDFS Cluster Monitoring
Admin App
Wo
rkflow
Integratio
n
Co
nn
ector Fram
ewo
rk
BI Tools Reports & Alerts
Persistent Next Generation Data Architecture
24 July 2012 35
Persistent Big Data Analytics Library
WHY PEBAL • Lots of common problems – not all of them are solved in Map Reduce
• PigLatin, Hive, JAQL are languages and not libraries – something is
needed to run on top that is not tied to SQL like interaces
BENEFITS OF A READY MADE SOLUTION • Proven – well written and tested
• Reuse across multiple applications
• Quicker implementation of map reduce applications
• High performance
FEATURES • Organized as JAQL functions, PeBAL implements several graph, set, text
extraction, indexing and correlation algorithms.
• PeBAL functions are schema agnostic.
• All PeBAL functions are tried and tested against well defined use cases.
24 July 2012 36
24 July 2012 37
Graph
Set
Text
Analytics
Inverted
Lists
Web
Analytics
Statistics
Visual Programming Environment
ADOPTION BARRIERS • Steep Learning Curve
• Difficult to Code
• Ad-hoc reporting can’t always be done by writing programs
• Limited tooling available
VISUAL PROGRAMMING ENVIRONMENT • Use Standard ETL tool as the UI environment for generating PIG scripts
BENEFITS • ETL Tools are widely used in Enterprises
• Can leverage large pool of skilled people who are experts in ETL and BI
tools
• UI helps in iterative and rapid data analysis
• More people will start using it
24 July 2012 38
Visual Programming Environment for
Hadoop
HDFS/ Hive HDFS
Persistent IP
Data Flow UI
PIG Convertor
HDFS
PIG UDF Library
Big Data Platform
ETL Tool
Metadata
Data Data
Data Sources
PIG code
24 July 2012 39
Persistent Connector Framework
OUT OF THE BOX • Database, Data Warehouse
• Microsoft Exchange
• Web proxy
• IBM Tivoli
• BBCA
• Generic Push connector for *any* content
FEATURES • Bi-directional connector (as applicable)
• Supports Push/Pull mechanism
• Stores data on HDFS in an optimized format
• Supports masking of data
WHY CONNECTOR FRAMEWORK • Pluggable Architecture
20+ Years
24 July 2012 40
Persistent Data Connectors
24 July 2012 41
Persistent’s Breadth of Big Data Capabilities
Horizontal and Vertical Pre-built Solutions
Big Data Platform (PeBAL) analytics libraries and Connectors
IT Management
Big Data Application Programming
Distributed File Systems
Cluster Layer
Tooling
• RDBMS/DWH to import/export data
• Text Analytics libraries
• Data Visualization using Web2.0 and reporting tools - Cognos, Microstrategy
• Ecosystem tools like - Nutch, Katta, Lucene
• Job configuration, management and monitoring with BIgInsight’s job
scheduler (MetaTracker)
• Job failure and recovery management
• Deep JAQL expertise - JAQL Programming, Extending JAQL using UDFs,
Integration of third party tools/libraries, Performance tuning, ETL using JAQL
• Expertise in MR programming - PIG, Hive, Java MR
• Deep expertise in analytics - Text Analytics - IBM’s text extraction solution (AQL + SystemT)
• Statistical Analytics - R, SPSS, BigInsights Integration with R
• HDFS
• IBM GPFS
• Platform Setup on multi-
node clusters, monitoring, VM based
setup
• Product Deployment Persistent IP for Big Data Solutions
Big Data Platform Components 24 July 2012 42
Persistent Roadmap to Big Data
1. Learn
2. Initiate
3. Scale 4. Measure
5. Manage
Discover and
Define Use Cases
Improve Knowledge Base
and Shared Big Data Platform
Upgrade to Production
if Successful
Validate with
a POC
Measure Effectiveness
and Business Value
24 July 2012 43
Build a social
graph of all
customers
Overlay sales
data on the
graph
Identify
influential
customers
using network
analysis
Target these
customers for
promotions.
Customer Analytics
24 July 2012 44
Identifying your most
influential customers ?
Targeting influential customers is best way to
improve campaign ROI!
70 million customers
> 1billion transactions
over twenty years
Few thousand
Influential customers
Overview of Email Analytics
● Key Business Needs – Ensure compliance with respect to a variety of business and IT communications and
information sharing guidelines. – Provide an ongoing analysis of customer sentiment through email communications.
● Use Cases – Quickly identify if there has been an information breach or if the information is being shared in
ways that is not in compliance with organizational guidelines.
– Identify if a particular customer is not being appropriately managed.
● Benefits – Ability to proactively manage email analytics and communications across the organization in a
cost-effective way.
– Reduce the response time to manage a breach and proactively address issues that emerge through ongoing analysis of email.
24 July 2012 45
Using Email to Analyze Customer
Sentiment
24 July 2012 46
Sense the mood of your customers through their emails
Carry out detailed analysis on customer team interactions and response times
Analyzing Prescription Data
24 July 2012 47
1.5 million patients are
harmed by medication
errors every year
Identifying erroneous prescriptions can save lives!
Source: Center for Medication Safety & Clinical Improvement
Overview of IT Analytics
● Key Business Needs – Troubleshooting issues in the world of advanced and cloud based systems is highly complex, requiring
analysis of data from various systems.
– Information may be in different formats, locations, granularity, data stores.
– System outages have a negative impact on short-term revenue, as well as long-term credibility and reliability.
– The ability to quickly identify if a particular system is unstable and take corrective action is imperative.
● Use Cases – Identify security threats and isolate the corresponding external factors quickly.
– Identify if an email server is unstable, determine the priority and take preventative action before a complete failure occurs.
● Benefits – Reduced maintenance cost
– Higher reliablity and SLA compliance
24 July 2012 48
Consumer Insight from Social Media
24 July 2012 49
Find out what the customers are talking about your organization or product in the social media
1. Structured Analysis Responses to Pledge, multiple choice questions
2. Unstructured Analysis Responses to following questions • Share your story
• Ask a question to Aamir • Send a message of hope • Share your solution
Content Filtering Rating Tagging System (CFRTS) L0, L1, L2 phased analytics 3. Impact Analysis
Crawling general internet for measuring the before & after scenario on a particular topic
Web/TV Viewer
Response to Pledge multiple choice questions Web, emails, IVR/Calls Individual blogs Social widgets Videos …
IVR
SM
S W
eb, S
ocia
l Me
dia
(Str
uctu
red
) So
cial
Me
dia
(uns
truc
ture
d)
Insights for Satyamev Jayate – Variety of
sources
Rigorous Weekly
Operation Cycle
producing instant
analytics Killer combo of Human+Software to
analyze the data efficiently Topic opens on
Sunday
Live Analytics
report is sent
during the show
Data capture
from SMS, phone
calls, social
media, website,
System runs L0
Analysis, L1, L2
Analysts continue
JSONs are
created for the
external and
internal
dashboards
Featured content
is delivered thrice
a day all through
out the week.
Episode Tags are
refined and
messages are re-
ingested for
another pass
24 July 2012 52
Thank you
Anand Deshpande (anand@persistent.co.in)
http://in.linkedin.com/in/ananddeshpande
Persistent Systems Limited
www.persistentsys.com
53 24 July 2012
Next Generation Sequencing
24 July 2012 54
Sequencing machines are getting affordable
Running cost of sequencing is going down
NGS machines generate TBs of data per week.
Need to analyze this data in time
Analysis results are critical for human life, personalized medicines
Recommended