Large dataset processing in the CloudKevin Glenny and GridwiseTech team
Simplified data oriented system
Internal or external
data sources
applications working on data
IT systems are constantly growing
Increased numberof users
Increased numberof applications
Increased amountof data
IT systems are constantly growing
Infrastructure bottleneck
Example
Electronics manufacturer
24/7 production
Report computation too long
for decision making
2.5 million transactions daily
4TB data to manage
What is Cloud computing?
„Transparant access to
capabilities using a
pay-per-use
business model”
Benefits:– Dynamic scaling
– Pay-for-use
– Off-shored administration
What are the delivery models?
SaaS (Software as a Service)– SalesForce.com, 63,00 clients
PaaS (Platform as a Service)– Google App Engine (2008), Microsoft Azure
(2008)
IaaS (Infrastructure as a Service)– Amazon Elastic Compute Cloud, 8.2 million
instances launched since 2006
Application data processing
Database sharding (MySQL,
postgreSQL etc.)
NoSQL (Google's BigTable,
Amazon's Dynamo etc.)
Data-grid (GigaSpaces XAP, Oracle Coherance, InfiniSpan etc.)
Data-grid and sharding in the Cloud
All data processing and persistencein the Cloud
Achievements:•Near real-time•Dynamic scaling (applicationand resources)•Pay-per-use•Reduced administration•HA
Remaining issues
Getting large datasets in and out of the Cloud– Bandwidth limited client side
– Resort to mailing hard drives!
Performance - 2 to 50% slow down
Data security/privacy - trust
SLAs – plan for the worst
Conclusions
Data oriented systems datasets grow causing bottlenecks
Datasets in the Cloud can be processed using scalable technologies
Challenges remain
Main – how to get the data to the Cloud?