Data analytics with NOSQL

Preview:

Citation preview

Data Analytics with NOSQL

Mukundan AgaramChris Weiss

Some initial thoughts about data...

Continual issues with large scale web apps– Data growth + query response time

● Data growth => performance degradation● Explosion of big data “analytics” use cases

– Increase in unstructured data● More interconnectivity, more formats, lack of structure...● Document oriented data (XML/JSON) are difficult to

manage and search

– Distributed server configurations ● Large systems, more distribution and HA

Cloud services has aggravated these issues

Agenda for the night

● What is NOSQL?● Varieties of NOSQL● Key Industry Use Cases● Applications for Data Analytics● Landscape● Demos/Walkthroughs● Closing Discussions

What is NOSQL?

● “...mechanism for storage and retrieval of datathat is modeled in means other than tabularrelations used in relational databases.”Wikipedia

● Non SQL or Non-relational● Not Only SQL● Technically since late 1960...

– E.g. IDMS, IMS, MUMPS, Cache, BerkeleyDB

What is NOSQL?

● Drivers for modern day NOSQL– Web 2.0

– Big Data

– Facebook, Google, Amazon, Expedia etc.

– Horizontal scaling to clusters of computers● Achilles heel for RDBMS

– Cost

– Provide ● HA● Partition Tolerance (a.k.a sharding)● Speed

NOSQL - Drawbacks and Barriers

● Compromise on consistency (CAP Theorem)● Custom query languages vs. SQL● Lack of standardized interfaces● Existing investments in RDBMS● Most lack true ACID transactions.

– Use an “eventually” consistent model

– Data is replicated with a conflict resolution algorithm

– Methods for conflict resolution and distribution varysignificantly

CAP Theorem

● a.k.a Brewer's theorem● Impossible for a distributed computer system to

simultaneously provide – Consistency

● all nodes see same data at same time

– Availability ● Every request receives a response

– Partition Tolerance● Fault tolerance to partitioning because of network failures

CAP alignment for NOSQL

Source: http://blog.nahurst.com/visual-guide-to-nosql-systems

NOSQL direction

The landscape is morphing...● Current NOSQL industry focus

– Address large distributed systems reactionary to theCAP theorem

● The newer breed of NOSQL address importantaspects such as ACID

● There is a new buzz word …– NewSQL

Database Evolution

NOSQL Model Classification

Key Value Stores &Caches

Data is represented as a collection of (K,V) pairs. In-memory,persistent or eventually persistent.

Document Databases Data is stored in JSON document structures.

RDF, OWL & Triple Stores

Meaningful way to connect information. Can inference overtriples (S,P,O). Can be represented graphically. SPARQL

Wide Column Databases Extensible record set. Stores data tables as sections ofcolumns. Great for EDW.

Graph Databases Stores data as a graph G(V,E). Great for correlation analysis,recommendation engines and fraud detection.

Multi-model Databases Combination of one or more varieties of the above.

NOSQL Models

● Key-Value – Cache (EHCache, BigMemory, Coherence, Memcached)

– Store (Redis, Riak, AeroSpike, Oracle NoSQL)

● Document (MongoDB, CouchDB, AmazonDynamoDB)

● Wide Column (Cassandra, HBase, Vertica)

● Graph (Neo4j, Titan, Giraph)

● Multi-model (OrientDB, ArangoDB, Sqrrl)

Source: www.db-engines.com

Consider NOSQL for...

● Enabling “big data” and “web” scale– Massive distribution through horizontal scaling

● Performant queries (alternatives to RDBMS)– Denormalization and large horizontal scalability

● Massive write volumes (Facebook, Twitter)● Fast and dynamic access to key data ● Flexible schemas and data types● Data/Schema Migration● Developer centric environments

Consider NOSQL for...

● Diverse data organization options– Hierarchical correlation

– Graph correlation

– Semantic relationships

– Set based analytics

● Caching in end usage format● Data Archival● Big Data Analytics

– Cumulative metrics and insights

– Correlation

Where RDBMS/SQL is better..

● OLTP ● Data Integrity● SQL centricity● Complex relationships

– Exception of graph NOSQL

● Maturity, stability and standardization

Use Cases● Log management (unstructured data)● Data synchronization (online vs. offline sources)

– Shopping cart, Field sales/services, PoS, Gaming,Transportation/telemetry

● User profile management● Customer 360 degree view● Fraud detection ● Medical/Healthcare diagnosis● Data Archival● Recommendation Engines

Applications for Data Analytics

● Complements (part of) Hadoop and Big Data● Acts as the persistence infrastructure for larger

machine learning use cases– Predictive Analytics

– Fraud/Anomaly/Outlier Detection

– Recommendation engines

● Provides a back drop for interesting datavisualization initiatives– Integrate with visualization packages such as

Tableau

Interesting links

● Redis in Practice: Who's online?www.lukemelia.com/blog/archives/2010/01/17/redis-in-practice-whos-online/

● Inventory list of NOSQL systemswww.nosql-database.org

● Database Engine ranking and analyticswww.db-engines.com

● Visual guide to NOSQL systemswww.blog.nahurst.com/visual-guide-to-nosql-systems

Case Studies / Demos

● Retail fraud detection – Neo4j

– Contrasting with OrientDB

– Tinkerpop/Gremlin/Blue Print

● 360 degree single view of voter information– MongoDB

● Schema on read – Hadoop

Gremlin Blueprints Architecture

Neo4j OrientDB TitanGraph ArangoDB

Qualified Voter – Use Case

● Tracks registration information for all voters inMichigan

● Uses a tabular geography model● Highly normalized schema

– Data partitioned into subsets● Enable local application instances and row level security

● Expensive queries when doing reporting● Expensive queries for performing “single view”

of voter● Several tables with tens of millions of records

Voter Schema

Find the first 100 voters in Ingham county withstatus and school district

SELECT V.VOTER_IDENTIFICATION_NUMBER,V.FIRST_NAME, V.LAST_NAME, G.CODE AS GENDER,

IDS.NAME AS ID_STATUS, UST.NAME AS UOCAVA_STATUS,

VA.ADDRESS_LINE_ONE, VA.CITY, VA.ZIP_CODE,

DIS.NAME AS SCHOOL_DISTRICT

FROM VOTER V, VOTER_ADDRESS VA, GENDER G,

IDENTIFICATION_STATUS IDS, UOCAVA_STATUS UST, VOTER_STATUS_TYPE VST,

STREET_RANGE SI, DISTINCT_POLITICAL_AREA DPA, DISTINCT_POLITICAL_AREA_DIS DPAD,

DISTRICT DIS, DISTRICT_TYPE DT, COUNTY CO

WHERE V.ID = VA.VOTER_ID AND V.GENDER_ID = G.ID AND V.IDENTIFICATION_STATUS_ID = IDS.ID

AND V.UOCAVA_STATUS_ID = UST.ID AND V.VOTER_STATUS_TYPE_ID = VST.ID AND VST.NAME = 'Active'

AND VA.STREET_RANGE_ID = SI.ID AND SI.DISTINCT_POLITICAL_AREA_ID = DPA.ID

AND VA.IS_ACTIVE = 'Y'

AND DPA.COUNTY_ID = CO.ID AND CO.NAME = 'Ingham'

AND DPA.ID = DPAD.DISTINCT_POLITICAL_AREA_ID AND DPAD.DISTRICT_ID = DIS.ID

AND DIS.DISTRICT_TYPE_ID = DT.ID AND DT.NAME = 'School'

AND ROWNUM <= 100;

Expensive in terms of IO

● Multiple objects read● Two stage IO:● Read index● Read entire table row● Selected and WHERE clause columns

assembled and then filtered● Resources for larger volume query would be

high – memory, CPU, fast disk

Parting conclusions

● NOSQL is a mixed bag of fruit● This space is growing● There are hundreds of products● Best value is realized from identifying the

correct use case– Functional requirements

– Non-functional requirements

Finally you can use NOSQL for...

Thank You!!

Questions?

Recommended