NoSQL Databases - UMD

Preview:

Citation preview

NoSQL DatabasesAn efficient way to store and query heterogeneous

astronomical data in DACE Nicolas Buchschacher - University of Geneva - ADASS 2018

• Data and Analysis Center for Exoplanets.• Facility to store, exchange and analyse data related to exoplanets

(observations and synthetic populations).• Web front end and python API to query the database (in dev).• 3D visualisation (come to the demo booth !)

Nicolas Buchschacher - University of Geneva - ADASS 2018

DACEhttps://dace.unige.ch

Nicolas Buchschacher - University of Geneva - ADASS 2018

• Store heterogeneous observational data produced by different instruments.

• Regularly adapt the data model (we are at the end of the chain).• Store synthetic population simulations data points (~500M / Pop).• Ensure high availability (load balancing) and persistence (replications).

The Requirements

Nicolas Buchschacher - University of Geneva - ADASS 2018

CASSANDRA• Fully distributed database developed by Apache.• Column oriented storage => can have thousand columns in a table.• High availability (no Master-Slave). Every node is equivalent.• Asynchronous and automatic replication across the nodes based on

a replication factor.• Consistency level can be controlled for each query.

Nicolas Buchschacher - University of Geneva - ADASS 2018

CAP (Brewer) TheoremThis is impossible for a distributed database to simultaneously provide more than 2 of the 3 following requirements: • Consistency: Every read receives the most recent value, or an error.• Availability: Every read or write request receives a non-error reply.• Partition tolerance: The system continues to work properly if

there are some missing nodes.

=> Cassandra is considered as an AP database. But consistency can be managed with the “consistency level” in each query.

Nicolas Buchschacher - University of Geneva - ADASS 2018

CASSANDRA (Storage)

ID Param1 Param2 Param31 v 11 v 21 NULL2 v 12 NULL NULL3 v 13 NULL v 334 v 14 v 24 NULL

ID Param11 v 112 v 123 v 134 v 14

ID Param21 v 214 v 24

ID Param33 v 33

Row Oriented (SQL) Column Oriented

Nicolas Buchschacher - University of Geneva - ADASS 2018

CASSANDRA (Storage)

ID Param1 Param2 Param3 Param41 v 11 v 21 NULL NULL2 v 12 NULL NULL NULL3 v 13 NULL v 33 NULL4 v 14 v 24 NULL NULL5 v 15 NULL v 35 v 45

ID Param11 v 112 v 123 v 134 v 145 v 51

ID Param21 v 214 v 24

ID Param33 v 335 v 35

Column Oriented

ID Param45 v 45

Row Oriented (SQL)

Nicolas Buchschacher - University of Geneva - ADASS 2018

CASSANDRA Partitions & Replication

ID Param1 Param2 Param3 Param4

1 v 11 v 21 NULL v 41

2 v 12 NULL NULL NULL

3 v 13 NULL v 33 v 43

4 v 14 v 24 NULL v 44

R R

RR

Rep.

DC 1 (rep = 2)

Nicolas Buchschacher - University of Geneva - ADASS 2018

CASSANDRA Partitions & Replication

ID Param1 Param2 Param3 Param4

1 v 11 v 21 NULL v 41

2 v 12 NULL NULL NULL

3 v 13 NULL v 33 v 43

4 v 14 v 24 NULL v 44

R R

RR

Rep.

query ID = 2 or 4

DC 1 (rep = 2)

Nicolas Buchschacher - University of Geneva - ADASS 2018

CASSANDRA Partitions & Replication

DC 1 (rep = 2) DC 2 (rep = 3)

DC 3 (rep = 2)

Nicolas Buchschacher - University of Geneva - ADASS 2018

CASSANDRA Pros & Cons+ Store a lot of heterogenous columns without performance impact.+ Fully distributed (no Master-Slave): all nodes are equivalent.+ Easily scalable by adding new nodes.+ Open source, compatible with major operating systems.+ Easily expandable to the PB scale (Apple ex: 75’000 nodes).

- No relation between tables => No JOIN operations.- Not transactional (not important in our case).- Poor set of search and filter operations.

Nicolas Buchschacher - University of Geneva - ADASS 2018

Solr / SolrCloud• Open source enterprise search platform.• Powerful indexer (full-text search, spatial, filtering, sort …) based on

Apache Lucene.• REST API with a lot of supported formats (JSON, CSV, Python,…).• SolrCloud is the distributed version of Solr.

DSE combine both in a single software … but it’s not free !!!

+ =>

Nicolas Buchschacher - University of Geneva - ADASS 2018

SQL NoSQLFixed schema, avoid column / table manipulations

Flexible schema, easily add new columns / parameters

Vertical scalability. Increase CPU, RAM and disks Horizontal scalability (add nodes)

JOIN operations No JOIN, no subqueriesTransactional operations Not transactional

Centralised approach. Load balance using Master-Slave

De-centralised approach. Naturally load balanced

Long history and experience (1970) New generation (~2000)

SQL vs NoSQL comparisons

Nicolas Buchschacher - University of Geneva - ADASS 2018

Conclusion

• Is NoSQL better ? No !!! It’s different …• Consider NoSQL databases if you need to store heterogeneous data

and want a flexible data model.• NoSQL is Big Data oriented.• Consider NoSQL databases if availability and scalability is a strong

requirement.• SQL is still on the stage and very powerful (we still use Postgres)• Are VO standards ready for NoSQL ?

Nicolas Buchschacher - University of Geneva - ADASS 2018

Thank you

Recommended