View
719
Download
2
Category
Tags:
Preview:
DESCRIPTION
Watch this presentation by Andrei Yurkevich, Altoros's President and CTO, to know what are the main challenges causing a big data project fail. Reveal a strategy that can help you to mitigate risks when planning a large-scale long-term project. Enjoy vivid examples that show the mistakes Altoros made and learn how all the issues were overcome with a prototype. See more at http://blog.altoros.com/big-data-analytics-2013-in-london.html
Citation preview
© ALTOROS Systems | CONFIDENTIAL
Big Data, Big Projects, Big
Mistakes: How to Jumpstart and Deliver
with Success
Andrei YurkevichChief Technology Officer
andrei.yurkevich@altoros.com
© ALTOROS Systems | CONFIDENTIAL 2
• Hadoop/NoSQL performance engineering
• Cluster Automation & Server Templates on Joyent, AWS, SoftLayer, Rackspace,
CloudStack and OpenStack using Chef/Puppet, RightScale and SCALR
• 300+ employees globally (UK, USA, Denmark, Switzerland, Norway, Belarus,
Argentina)
• v
About Altoros
Featured customers Partners
© ALTOROS Systems | CONFIDENTIAL 3
It's a Mad Mad Big Data World
© ALTOROS Systems | CONFIDENTIAL 4
It's a Mad Mad Big Data World
© ALTOROS Systems | CONFIDENTIAL
It's a Mad Mad Big Data World
56 Combinations
© ALTOROS Systems | CONFIDENTIAL
It's a Mad Mad Big Data World
56 Combinations
15625
© ALTOROS Systems | CONFIDENTIAL 7
It's a Mad Mad Big Data World
© ALTOROS Systems | CONFIDENTIAL 8
No clear business goals
Big amounts of data
from many sources
Architecture design
The variety of tools
Compatibility of technologies/platforms
Lack of professionals
All features in one release
Budget
Big Data Traps
© ALTOROS Systems | CONFIDENTIAL 9
1 million of sensors generates 2.5 TB of data daily
Project Requirements
© ALTOROS Systems | CONFIDENTIAL 10
Project Requirements
Functional requirements Value Non-functional requirements
The amount of data added daily: 2.5 TB• Infrastructure-independent
architecture
• Scalability
• Open-source tools
Data type: raw data processed
data
Data storage time:
raw data Processed data
min a week min a year
Response time:
for building reports based on a pre-set template
for building reports for a custom period of time
< 30 sec
< 6 hours
Uptime: 99%
Fault-tolerance: required
Deployment cost per day: < $1,000
© ALTOROS Systems | CONFIDENTIAL 11
InfrastructureAmazon AWS Joyent Rackspace
Types of a contract On Demand, Reserved, Spot
On Demand, Reserved
On Demand
Types of instances (classified by compute units)
• General Purpose• Compute optimized• Memory optimized• Storage optimized
• Standard• High Memory• High CPU• High Storage• High I/O
• General Purpose
Storage options • EBS• S3• Low-cost storage
• Network storage based on ZFS
• Cloud Block Storage
• Cloud Files
Operating systems Linux, Windows SmartOS, Linux, Windows
Linux, Windows
A management console
AWS Console Joyent SmartDataCenter
Cloud Control Panel
A Cloud API • Command line interface
• Java, .NET, Ruby SDK and API
• Command line interface (CLI)
• Node.js SDK• REST API
REST API
Regions America, Europe, Asia, Australia
North America, Europe
America, Europe, Asia, Australia
Estimated cost per month
$18,300 $17,500 $21,350
© ALTOROS Systems | CONFIDENTIAL 12 a good fit a normal fit a bad fit
InfrastructureOption 2 Option 1
Feature Amazon AWS Joyent Rackspace
Types of a contract On Demand, Reserved, Spot
On Demand, Reserved On Demand
Types of instances (classified by compute units)
• General Purpose• Compute optimized• Memory optimized• Storage optimized
• Standard• High Memory• High CPU• High Storage• High I/O
• General Purpose
Storage options • EBS• S3• Low-cost storage
• Network storage based on ZFS
• Cloud Block Storage• Cloud Files
Operating systems Linux, Windows SmartOS, Linux, Windows
Linux, Windows
A management console AWS Console Joyent SmartDataCenter Cloud Control Panel
A Cloud API • Command line interface
• Java, .NET, Ruby SDK and API
• Command line interface (CLI)
• Node.js SDK• REST API
REST API
Regions America, Europe, Asia, Australia
North America, Europe America, Europe, Asia, Australia
Estimated cost per month $18,300 $17,500 $21,350
Score 1.5 3.5
© ALTOROS Systems | CONFIDENTIAL 13
Features HBase Cassandra MongoDB MySQL Cluster
License Apache Apache AGPL GPL
Protocol HTTP/REST (also Thrift)
Thrift and custom binary CQL3
Custom, binary (BSON)
JDBC, ODBC
Data model Column family Column family JSON documents Tables
Queries / Query Language
JRuby-based (JIRB) shell
Cassandra Query Language
JavaScript expressions
SQL
Partitioning Strategy
Ordered Partitioning
Random Partitioning
Sharding by key Partition by key
Replication between nodes
yes yes yes yes
Replication between data centers
noyes
noyes
Capability to store 2.5 TB daily
yes yes yes yes
Implementation Experience
1+ 1+ 2+ 5+
Score 2 3 2 5
Choosing a Database
a good fit a normal fit a bad fit
© ALTOROS Systems | CONFIDENTIAL 14
Choosing a DatabaseFeatures HBase Cassandra MongoDB MySQL Cluster
License Apache Apache AGPL GPL
Protocol HTTP/REST (also Thrift)
Thrift and custom binary CQL3
Custom, binary (BSON)
JDBC, ODBC
Data model Column family Column family JSON documents Tables
Queries / Query Language
JRuby-based (JIRB) shell
Cassandra Query Language
JavaScript expressions
SQL
Partitioning Strategy
Ordered Partitioning
Random Partitioning
Sharding by key Partition by key
Replication between data centers
noyes
noyes
Capability to store 2.5 TB daily
yes yes yes yes
Implementation Experience
1+ 1+ 2+ 5+
Deployment cost per day
$450 $400 $500 $1,500
Score 2.5 4 2.5 0
a good fit a normal fit a bad fit
© ALTOROS Systems | CONFIDENTIAL 15
Choosing a database: Cassandra, MongoDB, HBase
Storing Raw Data
© ALTOROS Systems | CONFIDENTIAL 16
Feature HBase Cassandra MongoDB
Replication between data centers
Asynchronous, needs testing
Replicas can span data centers with
synchronous replication
Not supported
A cluster admin node NameNode Any node mongos process
Implementation Experience
1+ 1+ 2+
Time spent on inserting 30 MB of data
7 sec 9 sec 20 sec
Deployment cost per day $450 $400 $500
Score 2 2.5 0
Choosing a Database
a good fit a normal fit a bad fit
© ALTOROS Systems | CONFIDENTIAL 17
Architecture of the System
© ALTOROS Systems | CONFIDENTIAL 18
Examples of reports
Storing Processed Data
© ALTOROS Systems | CONFIDENTIAL 19
Prototype’s Correspondence to the Initial Requirements
A requirement The prototype features
Storing of 2.5 TB of daily raw data for a week Capable
Storing of 1.5 TB of processed data for a year Capable
Response time for building reports based on a pre-set template ~25 sec
Response time of less than 6 hours for building a custom report ~7 hours
Scalability Good
Infrastructure Independence Yes
Using open-source tools For all components
Fault-tolerance Yes
Deployment cost per day < $1,000 ~$600
© ALTOROS Systems | CONFIDENTIAL 20
Properly visualize and test the functionality
Detect bottlenecks and change a technology/tool/database before it was implemented in the real system
Get a real vision of the final solution
Make sure you stick to the budget
How to Make a Big Data Project Work
© ALTOROS Systems | CONFIDENTIAL 21
Andrei YurkevichPresident/CTO
andrei.yurkevich@altoros.com
Recommended