View
371
Download
7
Category
Preview:
DESCRIPTION
Presented on April 17th for InnoTech Dallas.
Citation preview
BIG DATA = BIG DECISIONSBIG DATA = BIG DECISIONS
Bob Zurek | SVP Products | Epsilon | www.epsilon.com
Consider the following:• New model for data • Accessible over TCP/IP and variety of languages• Initially difficult to understand• Capable of processing thousands of ops/sec• Very different from old model• Threatening as much was invested in old model• Changing course seems ridiculous
Source: Eben Hewitt
Source: IBM
IBM IMS
“IMS is IBM's premier transaction and hierarchical database management system, virtually unsurpassed in database and transaction processing availability and speed” – IBM 2013
“Mission-critical processing that requires unparalleled performance is best served by a hierarchical model. Analytics and business intelligence are best served by a relational model. Most Fortune 100 companies use both.”
A New Model Is Invented
A Disruptive Model
A Threatening Model
A Competitive Model
Data evolution
Source: Eben Hewitt
Source: McKinsey
Big data – a growing torrent
$600 to buy a disk drive that canstore all of the world’s music
5 billion mobile phonesin use in 2010
30 billion
pieces of content sharedon Facebook every month
40% projected growth in global data
generated per year vs. 5%growth in globalIT spending235 terabytes data collected by the
U.S. Library of Congress by April 2011
15 out of 17sectors in the United States have more datastored per company than the U.S. Library of Congress
Big data confusion?
Source: IBM
A greater scope of information
New kinds of data and analysis
Real-time information
Data influx from new technologies
Non-traditional forms of media
Large volumes of data
The latest buzzword
Social media data
18%
16% 15% 13% 13%
10%
8%
7%
What do business executivesthink “big data” is?
Source: McKinsey
Big data is…
Large pools of data Large pools of data that can be captured, that can be captured, communicated, communicated, aggregated, stored, aggregated, stored, and analyzedand analyzed
• Vertical scaling = throw hardware at it• Optimize the application = sql, indexes, access• Employ caching layers = MemcacheD, Coherence• Denormalization = reduce joins• Sharding/Shared Nothing = split the data up• Innovation = columnar
How are we solving (historically)?
Doug Cutting = Nutch
Google = GFS and GMR
A search engine project at Yahoo
Big data innovation incubatedBig data innovation incubated
“Hadoop is an amazing technology stack. We now depend on it to run eBay.”
Bob Page, Vice President of Analytics, eBay
Source: http://www.wired.com/wiredenterprise/2011/10/how-yahoo-spawned-hadoop/
eBay erected a Hadoop cluster spanning 530 servers – now five times the size!
It can get complex and confusing
“It replaced our need for ETL”
“It is great for batch processing in parallel”
“A beautiful platform for all of problems”
What it’s not good for
• High volume transactional data
• Structured data with low latency
“Note that Hadoop is not an Extract-Transform-Load (ETL) tool. It is a platform that supports running ETL processes in parallel. The data integration vendors do not compete with Hadoop; rather, Hadoop is another channelfor use of their data transformation modules. “
Teradata/Cloudera Presentation
What it’s really good for
• Index building
• Pattern recognitions
• Sentiment analysis
• Machine generated data
• Log processing
• Web scale = Google, Twitter, YouTube
Use Cases
Online Travel Reservations
Mobile Data
E-Commerce
Energy Discovery
Energy SavingsInfrastructure Management
Image Processing
Fraud Detection
IT Security
HealthCare
Analyze machine generated data
Semantic analysis for relevance
Suggest ways customers save money
Spot fraud anomolies
Process mobile data
Large marketplaces
Sort and process seismic data
Detecting patterns in sat imagery
Travel booking
Collecting device logs
Relational is still in play
Some innovations worth a look
Dynamically Scaling OLTP = “No Need To Shard”
The NoSQL generation
• Document Storage Model• Allows MTV to store
hierarchical data• Flexible schema to model
structure/data by brand• Needed to have ability
to query nested content• No need for a shared
disk storage
• Released by NSA to open source• Apache Accumulo• Based on Google Big Table• Built on top of Hadoop• Fine-grained access control• Cell level security • Server side programming
• Schemaless model = Easy to to add fields • Document oriented = Json format (think objects)• Built from the ground up to be distributed• Auto sharding • Distributed querying capabilities
Why NoSQL?
NoSQL Use Case
1. Click/Event into Hadoop
2. Data Analyzed via Map Reduce jobs; generates 100M profiles based on campaigns running
3. Selected profiles loaded into Couch
4. Ad targeting logic query Couch with sub-second latency to optimizedecision and real-time ad placement
Source: Couchbase
Hadoop Augmentation• Side-by-Side will be commonplace• ETL solutions support Hadoop • Relational Databases
• Provide ETL interfaces to Hadoop• Execute map/reduce jobs inside DBMS
• NoSQL supports ETL
Example Hybrid DBMS SystemsOracle Endeca Server• Hybrid Search/Analytic Database• Supports structured, unstructured, semi-structured• No schema required. Records stacked.• Columnar
Trends• SQL On Hadoop – Hadapt, Clodera Impala, EMC• Unified Support of Structured, Unstructured, Semi• Embedding Search• Expanded ETL/ELT Support• Big Data In Motion Takes Hold• Added Data Mining and Analytic Functions In NoSQL• Embedding R Language = gain in popularity• Data Scientists instrumental in business success
Recommended