63
Solr The Search First NoSQL Database

Solr cloud the 'search first' nosql database extended deep dive

Embed Size (px)

DESCRIPTION

Presented by Mark Miller, Software Engineer, Cloudera As the NoSQL ecosystem looks to integrate great search, great search is naturally beginning to expose many NoSQL features. Will these Goliath's collide? Or will they remain specialized while intermingling – two sides of the same coin. Come learn about where SolrCloud fits into the NoSQL landscape. What can it do? What will it do? And how will the big data, NoSQL, Search ecosystem evolve. If you are interested in Big Data, NoSQL, distributed systems, CAP theorem and other hype filled terms, than this talk may be for you.

Citation preview

Page 1: Solr cloud the 'search first' nosql database   extended deep dive

Solr

The Search First NoSQL Database

Page 2: Solr cloud the 'search first' nosql database   extended deep dive

• Mark Miller: Cloudera employee, Lucene PMC member, Apache member

• Started playing with Lucene in 2006

• Lucene committer since 2008

• Solr committer since 2009

Who Am I?

Page 3: Solr cloud the 'search first' nosql database   extended deep dive

My Dog

Page 4: Solr cloud the 'search first' nosql database   extended deep dive

Big Data is getting Bigger

• The total Big Data market reached $11.4 billion in 2012

• The Big Data market is projected to reach $18.1 billion in 2013, an annual growth of 61%

• On pace to exceed $47 billion by 2017.

Page 5: Solr cloud the 'search first' nosql database   extended deep dive

3 basic needs

• Storage

• Processing

• Search

Page 6: Solr cloud the 'search first' nosql database   extended deep dive

Two Standouts in the Big Data Market

•Hadoop

•NoSQL

Page 7: Solr cloud the 'search first' nosql database   extended deep dive

Ultimately, the NoSQL market is largely up for grabs. Each NoSQL database has its related strengths and weaknesses, and no one NoSQL database currently “does it all.” Big Data practitioners must take a number of factors into consideration when selecting a NoSQL database to facilitate large-scale transactional workloads, including scalability, performance, security, and ease-of-development.

Big Data Vendor Revenue and Market Forecast (Wikibon)

Page 8: Solr cloud the 'search first' nosql database   extended deep dive

RMDBS

• The classic way to store your data.

• ACID is great, transactions are cool, SQL is well known and understood.

• Scaling is *hard*, but possible (see Facebook’s MySQL cluster)

• ‘impedance mismatch’ sucks

Page 9: Solr cloud the 'search first' nosql database   extended deep dive

Search

• Search has been moving from an expensive, complicated option to an affordable and more easy necessity.

• Lot’s of data begs for the ability to process it, store it, and search it.

Page 10: Solr cloud the 'search first' nosql database   extended deep dive

Enterprise Search Engines

• Verity - acquired by Autonomy in 2005

• FAST - acquired by Microsoft in 2008

• Endeca - acquired by Oracle in 2011

• Autonomy - acquired by HP in 2011

• Vivisimo - acquired by IBM in 2012

Page 11: Solr cloud the 'search first' nosql database   extended deep dive

NoSQL

• Not Only SQL rather than ‘No SQL’

• Except that makes little sense...

• “when ‘NoSQL’ is applied to a database, it refers to an ill- defined set of mostly open-source databases, mostly developed in the early 21st century, and mostly not using SQL.” - NoSQL Distilled

Page 12: Solr cloud the 'search first' nosql database   extended deep dive

NoSQL

• Key-Value

• Columnar

• Document

• Graph

Page 13: Solr cloud the 'search first' nosql database   extended deep dive

In the beginning..

• BerkeleyDB (1991?)

• Lotus Notes (1989?)

• Bayou (1996?)

Page 14: Solr cloud the 'search first' nosql database   extended deep dive

In the beginning of the modern era...

• BigTable (Google) (started in 2004, paper in 2006)

• Dynamo (Amazon) (paper in 2007)

Page 15: Solr cloud the 'search first' nosql database   extended deep dive

Derivatives

• Dynamo: Cassandra, CouchDB, Voldemort, Riak

• BigTable: Cassandra, HBase, Redis, HyperTable, Accumulo

Page 16: Solr cloud the 'search first' nosql database   extended deep dive

Also...

• AppEngine storage built on BigTable

• DynamoDB - based on the principles of Dynamo

Page 17: Solr cloud the 'search first' nosql database   extended deep dive

When it comes to NoSQL, Open Source rules the

roost.

• I won’t be talking about any solution that is not based on Open Source - only because those solutions are not popular.

• "there’s a notion that NoSQL is an open-source phenomenon.” - NoSQL Distilled

Page 18: Solr cloud the 'search first' nosql database   extended deep dive

The 2013 Future of Open Source Survey Results

Black Duck and North Bridge

Page 19: Solr cloud the 'search first' nosql database   extended deep dive

What’s Popular?

• NoSQL database proliferation - NoSQL databases are a dime a dozen. Why?

• Which solutions should we look at?

Page 20: Solr cloud the 'search first' nosql database   extended deep dive

indeed.com

• Indeed.com is an employment-related metasearch engine for job listings

• Indeed is the #1 job site worldwide, with over 100 million unique visitors per month. Indeed is available in more than 50 countries and 26 languages, covering 94% of global GDP.

Page 21: Solr cloud the 'search first' nosql database   extended deep dive

http://db-engines.com

• DB-Engines is an initiative to collect and present information on database management systems (DBMS). In addition to established relational DBMS, systems and concepts of the growing NoSQL area are emphasized.

• The DB-Engines Ranking is a list of DBMS ranked by their current popularity. The list is updated monthly.

Page 22: Solr cloud the 'search first' nosql database   extended deep dive

Popular Search Job Trends

Page 23: Solr cloud the 'search first' nosql database   extended deep dive

Popular Search Solutions (DB-Engines)

Page 24: Solr cloud the 'search first' nosql database   extended deep dive

Popular NoSQL Job Trends

Page 25: Solr cloud the 'search first' nosql database   extended deep dive

Let’s get some context

Page 26: Solr cloud the 'search first' nosql database   extended deep dive

Compare to Java

Page 27: Solr cloud the 'search first' nosql database   extended deep dive

Add in Oracle...

Page 28: Solr cloud the 'search first' nosql database   extended deep dive

NoSQL Database Types

• Key-Value

• Column Family

• Document

• Graph

Page 29: Solr cloud the 'search first' nosql database   extended deep dive

I’m going to ignore Graph...everyone else seems to...

Page 30: Solr cloud the 'search first' nosql database   extended deep dive

Popular NoSQL Document Stores

(DB-Rankings)

Page 31: Solr cloud the 'search first' nosql database   extended deep dive

Key-Value Stores

Page 32: Solr cloud the 'search first' nosql database   extended deep dive

Columnar Stores

Page 33: Solr cloud the 'search first' nosql database   extended deep dive

The Full Popularity Contest

Page 34: Solr cloud the 'search first' nosql database   extended deep dive
Page 35: Solr cloud the 'search first' nosql database   extended deep dive

In case you forgot, Oracle is in the NoSQL game...

• Oracle NoSQL

Page 36: Solr cloud the 'search first' nosql database   extended deep dive

CAP TheoremThe CAP theorem, also known as Brewer's theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:• Consistency (all nodes see the same data at the

same time)• Availability (a guarantee that every request

receives a response about whether it was successful or failed)

• Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

Page 37: Solr cloud the 'search first' nosql database   extended deep dive

CAP

Page 38: Solr cloud the 'search first' nosql database   extended deep dive

Architectures• For NoSQL, generally boils down to AP or CP. CA

does not support partition tolerance.

• You have to trade off consistency versus availability.

• AP favors availability over consistency - the is the eventually consistent architecture.

• CP favors consistency over availability.

• Of course, there is a continuum between AP and CP.

Page 39: Solr cloud the 'search first' nosql database   extended deep dive

Key Design Decisions

• Data Model - how is the data stored/accessed

• Distribution Model - how is the data distributed

• Conflict Resolution - how is it ensured that the same update ‘wins’ on each node.

Page 40: Solr cloud the 'search first' nosql database   extended deep dive

Data Model

• key -> value (opaque)

• key -> document

• column oriented

Page 41: Solr cloud the 'search first' nosql database   extended deep dive

Distributed Model

• Roughly, how is data distributed across the cluster?

• Sharding, replication, etc

Page 42: Solr cloud the 'search first' nosql database   extended deep dive

Data Versioning and Consistency

• Essentially, how is data kept consistent across nodes?

• Sequential consistency—ensuring that all nodes apply operations in the same order.

• Update consistency and read consistency.

Page 43: Solr cloud the 'search first' nosql database   extended deep dive

• Data Model - bson - binary json format

• Distributed Model - sharded asynchronous master/slave replication.

• Data Versioning and Consistency - Master / Slave, per table write lock

Page 44: Solr cloud the 'search first' nosql database   extended deep dive

MongoDB Search

• Built in text search. I think of it like RBDMS built in full text search - major feature gaps with dedicated full text search engines, and likely major performance gaps.

• Common to sit a search engine next to MongoDB

Page 45: Solr cloud the 'search first' nosql database   extended deep dive

• Data Model - column based, like BigTable

• Distributed Updates - similar to Dynamo, consistent hashing, master-master

• Data Versioning and Consistency - timestamps

Page 46: Solr cloud the 'search first' nosql database   extended deep dive

Cassandra Search

• Lucandra

• Solandra

• DataStax Enterprise Search (Solr fields must be strings)

Page 47: Solr cloud the 'search first' nosql database   extended deep dive

• Data Model - Column Store

• Distribution Model - regions served by region servers.

• Versioning and Consistency - strongly consistent

Page 48: Solr cloud the 'search first' nosql database   extended deep dive

HBase Search

• HBasene (dead?)

• HBASE-SEARCH, HBASE-3529 (dead?)

• Solbase

• Lily

Page 49: Solr cloud the 'search first' nosql database   extended deep dive

• Riak is a NoSQL database implementing the principles from Amazon's Dynamo paper

• Data Model - stores key/value pairs in a high level namespace called a bucket.

• Data Versioning and Consistency - Riak uses a data structure called a vector clock to reason about causality and staleness of stored values. (Can also use timestamps). Last write wins, or client resolves conflict.

Page 50: Solr cloud the 'search first' nosql database   extended deep dive

Riak Search

• Riak Search - custom search engine, Solr-like API

• Yokozuna

Page 51: Solr cloud the 'search first' nosql database   extended deep dive

Yokozuna Author Enumerates Common Reasons Custom Search

has Failed

• Pretends to be lucene/solr

• Lack of analyzer/language/features

• Bad performance/resource usage for certain queries

• Basho is not in the business of search

Page 52: Solr cloud the 'search first' nosql database   extended deep dive

• CouchDB’s data format is JSON stored as documents (self-contained records with no intrinsic relationships), grouped into “database” namespaces.

• Conflicts are left to the application to resolve at write time. CouchDB arbitrarily, but deterministically, determines a winner and tracks a conflict. The client must then resolve the conflict.

Page 55: Solr cloud the 'search first' nosql database   extended deep dive

Adding Search to NoSQL

• Hard to do without a lot of compromise

• Build your own, or use Lucene or Lucene based solution

• Nothing has yet set the world on fire...

Page 56: Solr cloud the 'search first' nosql database   extended deep dive

Adding NoSQL to Search

• Search solutions are generally already a Document based NoSQL solution.

• Seems a lot easier to do then the reverse

• Nothing has yet set the world on fire...

Page 57: Solr cloud the 'search first' nosql database   extended deep dive

Solr NoSQL Features

• Realtime-Get

• Update Durability

• Atomic Compare and Set

• Versioning and optimistic locking

Page 58: Solr cloud the 'search first' nosql database   extended deep dive

Schemaless?

• NoSQL databases are generally ‘schemaless’

• In some ways, convenient, in others ways not.

• Implicit schema moves to application code.

• Can’t optimize based on types.

• Note: some are calling ‘guessed’ schemas schemaless.

Page 59: Solr cloud the 'search first' nosql database   extended deep dive

• Most similar to the MongoDB architecture

• A CP system, though currently, eventually consistent.

• The architecture supports adding strong consistency options.

Page 60: Solr cloud the 'search first' nosql database   extended deep dive

SolrCloud

• The length of time an inconsistency is present is called the inconsistency window.

• SolrCloud has a very small inconsistency window.

Page 61: Solr cloud the 'search first' nosql database   extended deep dive

Data Model

• key -> document

• Optionally, column oriented

Page 62: Solr cloud the 'search first' nosql database   extended deep dive
Page 63: Solr cloud the 'search first' nosql database   extended deep dive

Contact Info

• @heismark

[email protected]