29
Dealing with datasets that grow so large that they become awkward to work with using on- hand database management tools. Handling BigData On the Public Cloud

Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

Embed Size (px)

Citation preview

Page 1: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools.

Handling BigData On the Public Cloud

Page 2: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

Based on InterOp 2011 presentation by Liran Zelkha ([email protected])

Co founder of ScaleBase Before that, lead Aluna – a database and

architecture consulting company Over 15 years of hands on technology

experience

Page 3: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

Agenda

What is Big Data Big Data On Public Clouds Some solutions

Page 4: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

What is Big Data?

Page 5: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

Big Data (from wikipedia) …are datasets that grow so large that they become

awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics, and visualizing. This trend continues because of the benefits of working with larger and larger datasets allowing analysts to "spot business trends, prevent diseases, combat crime." Though a moving target, current limits are on the order of terabytes, exabytes and zettabytes of data.

Page 6: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

Top 3 ways to know you have big data:

Page 7: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

Number 3:

... you get a call from the utility company asking you not to run 'that brownout query' again. (@aristippus303 at Datawatch)

Page 8: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

Number 2:

... it piles up so high that it disappears into the clouds (@evertlammerts - I assume pun was intended?)

Page 9: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

Number 1:

... the SAN undergoes gravitational collapse and you get cited by OSHA for an unlicensed singularity. (@datamartist)

Page 10: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

But seriously

Its not a single number It is a set of parameters

Page 11: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

Volume of Data

Velocity of Data

Big Data Parameters

Complexity of Analysis

Big Data

http://www2.neilmcgovern.com/main.html

Page 12: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

Where do we see big data?

Everywhere Data Warehouse OLTP

Web 2.0 SaaS Billing Fraud detection CMS …Family history …Social networking

Page 13: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

Volume of data

How much data do you have? The more, the merrier

– Better analysis Used to be measured in 100’s of GB,

then TB, now PB But even a 300GB DB can still have Big

Data problems “If you have over 1TB of data – you

have a Big Data problem”, IDC

Page 14: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

Velocity of data

How many users access the data? How many writes occur on your data? How much transactions does your

database have? Measured in TPS, counted by the

thousands

Page 15: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

Complexity of Analysis

How complex are your queries? An example:

SELECT * FROM ( SELECT w.*, ROWNUM rnum FROM ( SELECT distinct w.watcher_id from watch w left outer join Profile p on p.watcher_id = w.watcher_id join atom_feed af on af.resource_id_hash = w.resource_id_hash join atom_feed_entry afe on afe.atom_feed_id = af.atom_feed_id where (p.LAST_ENTRY_PROCESSED_DATE is null or p.LAST_ENTRY_PROCESSED_DATE < afe.create_date) and (p.email_enabled_flag is null or p.email_enabled_flag != 'F') and af.resource_id = w.resource_id and afe.create_date <= sysdate - ? ORDER BY w.watcher_id ASC ) w where ROWNUM <= ? ) where rnum > ?;

Page 16: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

Big Data on Public Clouds

Page 17: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

Again from Wikipedia

– Public cloud or external cloud describes cloud computing in the traditional mainstream sense, whereby resources are dynamically provisioned on a fine-grained, self-service basis over the Internet, via web applications/web services, from an off-site third-party provider who bills on a fine- grained utility computing basis.

Page 18: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

Public Cloud Implications Pros:

Elastic Unlimited storage Unlimited capacity

Cons: Performance Standard hardware (no appliances...)

Page 19: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

Some Solutions

Page 20: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

Column Store Database

New databases that internally store the data in columns, and not rows.

Very good for OLAP Excellent for BigData

Page 21: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

NoSQL Database

Page 22: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

Again, from Wikipedia:

– NoSQL is the term used to designate database management systems that differ from classic relational database management systems (RDBMSes) in some way. These data stores may not require fixed table schemas, and usually avoid join operations and typically scale horizontally. Academics and papers typically refer to these databases as structured storage, a term that would include classic relational databases as a subset.

Page 23: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

NoSQL Database Non-relational databases Usually store data in memory,

replicated across multiple machines

Great latency

Page 24: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

Unstructured Schema Since SQL is not used, ERD can be

dynamic Some solutions store data as objects of

any kind Some use binary serialization of the

object Others use Map API (put, get) Players include: Casandra, HiveDB,

MemBase, MongoDB

Page 25: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

newSQL Dubbed by 451 analyst Matthew Aslett

"NewSQL" is our shorthand for the various new scalable/high performance SQL database vendors. We have previously referred to these products as 'ScalableSQL' to differentiate them from the incumbent relational database products. Since this implies horizontal scalability, which is not necessarily a feature of all the products, we adopted the term 'NewSQL' in the new report. And to clarify, like NoSQL, NewSQL is not to be taken too literally: the new thing about the NewSQL vendors is the vendor, not the SQL.

Page 26: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

New Databases New database engines Usually scale very well, can store a lot

of data, and targeted for virtual environments

Players include– NimbusDB– VoltDB

Page 27: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

New MySQL Storage Engines

New databases that look like MySQL from the outside– MySQL network protocol– MySQL SQL flavor

Players include– Akiban– ScaleDB

Page 28: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

ScaleBase ScaleBase offers Database Load

Balancers Scalability and high availability for

your database, totally transparent to your application

Page 29: Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Handling BigData On the Public

Summary There are many ways to handle

BigData on cloud environments Understand your data

requirements well – and use the right tool for the job

No one tool fits them all!