23
NoSQL: Apache SOLR Apache Hadoop By Dmitry Kan for NerdCamp, April 23 2011 [email protected]

NoSQL, Apache SOLR and Apache Hadoop

Embed Size (px)

DESCRIPTION

NoSQL (Not Only SQL) is believed to be a superset of, or sometimes an intersecting set with, relational SQL databases. The concept itself is still shaping, but already now we can say for sure: NoSQL addresses the task of storing and retrieving the data of large volumes in the systems with high load. There is another very important angle in perceiving the concept:NoSQL systems can allow storing and efficient searching of the unstructured or semi-unstructured data, like completely raw or preprocessed documents. Using the example of one world-class document retrieval system Apache SOLR (performant HTTP wrapper around Apache Lucene) as a reference we will check upon its use cases, horizontal and vertical scalability, faceted search, distribution and load balancing, crawling, extendability, linguistic support, integration with relational databases and much more.Dmitry Kan will shortly touch upon *hot* topic of cloud computing using the famous project Apache Hadoop and will help the audience to see whether SOLR shines through the cloud.

Citation preview

Page 1: NoSQL, Apache SOLR and Apache Hadoop

NoSQL: Apache SOLR

Ap

ach

eH

ado

op

By

Dm

itry

Kan

fo

r N

erd

Cam

p, A

pri

l 23

20

11

[email protected]

Page 2: NoSQL, Apache SOLR and Apache Hadoop

Dilbert: expert in NoSQL

Page 3: NoSQL, Apache SOLR and Apache Hadoop

•The acronym NoSQL was coined in 1998 (Carlo Strozzi): as the NoSQLmovement "departs from the relational model altogether; it should therefore have been called more appropriately 'NoREL', or something to that effect.“ (wikipedia)•NoSQL = Not Only SQL•Companies: Facebook, Twitter, Digg, Amazon, LinkedIn and Google

•Data storage: billion gigabytes (GB) of data•Interconnected data: hyperlinks, blog pingbacks, social networks•Complex Data structure: hierarchical nested data structures easily(multiple relational tables in SQL)•Performance: the more data in SQL, the likely it to degrade

•NoSQL is not:•… SQL and not relational•… replacement for SQL, but compliment•... There is no fixed schema and no joins•... Does not ”scale-up” (RDBMS, vertical scaling), but rather ”scales-out” (spreading the load over many commodity systems) – horizontal scaling

Page 4: NoSQL, Apache SOLR and Apache Hadoop

NoSQL Categories

•Key-value Stores: bigh hashtable with caching mechanisms•Column Family Stores: keys point to multiple columns (Google’s BigTable)•Document Databases: documents are collections of other key-value collections•Graph Databases: nodes, relationships between nodes and nodes props

Major NoSQL players•Dynamo: Amazon.com, key-value, used in Amazon S3 (simple storage service)•Cassandra: open-sourced by Facebook, column oriented NoSQL DB•BigTable: Google’s proprietary column oriented DB (App Engine)•CouchDB: OS document oriented NoSQL DB (as well as MongoDB)•Neo4j: OS graph DB

Querying NoSQL DB:•Data model specific•RESTful interfaces or query APIs•SPARQL: declarative query specification for graph DBs

Page 5: NoSQL, Apache SOLR and Apache Hadoop

Simple Protocol And RDFQuery Language(courtesy of about.com and IBM)Example of retrieving the URL of a blogger

PREFIX foaf <http://xmlns.com/foaf/0.1/>SELECT ?urlFROM <bloggers.rdf>WHERE {?contributor foaf:name "Jon Foobar" .?contributor foaf:weblog ?url .}

stats!

Page 6: NoSQL, Apache SOLR and Apache Hadoop

Some stats from (Information Week) viaabout.com (2010):•44% biz IT professionals haven’t heard of NoSQL•1%: NoSQL is strategic direction

•Some stats from NerdCamp (April 2011):•10% heard and used the NoSQL•Much more people know about cloud, which can become more and more a driving platform behind NoSQL

Does the world of NoSQL have enough mass to appeal to IT now?

Page 7: NoSQL, Apache SOLR and Apache Hadoop

“Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project.”

Created by Yonik Seeley at CNET

Features:•Full-text search•Hit highlighting•Faceted search (Dynamicclustering)•DB integration•Rich doc handling•Geospatial search•Distributed search•Replicataion•REST-like HTTP/XML & JSON APIS

http://lucene.apache.org/solr/http://lucene.apache.org/solr/tutorial.htmlhttp://lucene.apache.org/java/docs/index.html

Books

Page 8: NoSQL, Apache SOLR and Apache Hadoop

Companies using SOLR

drupal

Page 9: NoSQL, Apache SOLR and Apache Hadoop
Page 10: NoSQL, Apache SOLR and Apache Hadoop

April 2011

Overview of current state

Curent version: Apache Solr 3.1 (March 31, 2011)License: ASL 2.0Features:•Faceted navigation•Hit highlighting•GEO search: filter and sort by distance•Spellcheck and auto suggest•Advanced ranking and sorting•Distributed and replicated search•Structured / unstructured search•Rich plugin architecture, extensible

Operating system supportAll with a Java VM, including:Linux (all versions)Windows (all versions)MacOS (all versions)Unix variantsApp-server supportApache Tomcat, Jetty, Resin,WebLogic™, WebSphere™,GlassFish, dmServer™, JBoss™and many moreJava version requirementJava JDK 1.5 or laterClient API supportJava, .NET, PHP, Python, Ruby(onRails), C++, XML/HTTP,JSON/HTTP ++

Page 11: NoSQL, Apache SOLR and Apache Hadoop

Faceted search

•A technique for refining search results

•Concept composition:

• Article + in English + about nerdcamp

• Finnish rap + < 1 minute + released in 2001

•Types:

• Standard facets (list of facets with values)

• Hierarchical facet values (taxonomy of facet values)

• Range / query facets: by date, by price, by alphabet, by interval

Page 12: NoSQL, Apache SOLR and Apache Hadoop

Spatial Search

Combines location data with text data

•Represent spatial data in the index

•Filter by some spatial concept such as a bounding box or other shape

•Sort by distance

•Score/boost by distance

•<field name="store">45.17614,-93.87341</field> <!-- Buffalo store --> <field name="store">40.7143,-74.006</field> <!-- NYC store -->

<field name="store">37.7752,-122.4232</field> <!-- San Francisco store -->

•bbox: bounding box filter (bbox is a range of lats and lons thatencompasses the circle of radius d)

•geodist: the distance function

Page 13: NoSQL, Apache SOLR and Apache Hadoop

Hit highlighting

Example from solr admin

Page 14: NoSQL, Apache SOLR and Apache Hadoop

Spellcheck and autosuggest

Spellcheck:

•Query suggestion for a missspelled query term

http://localhost:8983/solr/spell?q=hellultrashar&spellcheck=true&spellcheck.collate=true&spellcheck.build=true

<lst name="spellcheck"> <lst name="suggestions"> <lst name="hell"> <intname="numFound">1</int> <int name="startOffset">0</int> <intname="endOffset">4</int> <arr name="suggestion"> <str>dell</str> </arr> </lst> <lst name="ultrashar"> <int name="numFound">1</int> <int name="startOffset">5</int> <int name="endOffset">14</int> <arrname="suggestion"> <str>ultrasharp</str> </arr> </lst> <strname="collation">dell ultrasharp</str> </lst> </lst>

Autosuggest:

Example with solr and jquery

Page 15: NoSQL, Apache SOLR and Apache Hadoop

Advanced sorting, ranking and searching

•sort=score+asc

•sort=Author+desc,score+desc

•boosting single documents

•Term Frequency—tf

•Inverse Document Frequency – idf

•Co-ordination Factor – coord (the greater the # of queried terms match, the greater the score)

•Field Length – fieldNorm (the shorter the matching field is in number ofindexed terms, the greater the document’s score)

•AND, OR, NOT, NEAR, fuzzy search

•Smashing~0.7 yields more results than just Smashing

Page 16: NoSQL, Apache SOLR and Apache Hadoop

Distributed and replicated search

Before doing this:•Consider vertical scaling (faster and better machine)•Rethink the data model (what data goes to which solr index)•Remove logging on updates (and / or searches)•Redesign you index: make as many fields non-indexed and non-stored (use cases)•Check your Internet connection

Page 17: NoSQL, Apache SOLR and Apache Hadoop

Extendability

Plugins:

•Query parser: extend LuceneQParserPlugin

public class NerdCampQParserPlugin extends LuceneQParserPlugin {

public QParser createParser(String qstr, SolrParams localParams,SolrParams params, SolrQueryRequest req) {}

}

Page 18: NoSQL, Apache SOLR and Apache Hadoop

SOLR I/O

•Nutch (crawler)

•CSV, XML, DataImportHandlers, DB import, Apache Tika (rich documentimport, like pdf), your format

•Output: xml, json, python, javabin, csv… , your format

Page 19: NoSQL, Apache SOLR and Apache Hadoop

SOLR Processing Pipeline•On each step, a document gets transformed•Stop words removal•Stemming•(smart) Tokenization•Ngrams (letter level and word level)•Regular expressions•Low casing•Reversed wildcard•Duplicate removal

Page 20: NoSQL, Apache SOLR and Apache Hadoop

Solr on the cloudHadoop: MapReduceZooKeeper: at least 3 Zoo Keepers to have 1-2 managing your ZooBatch indexing, no realtime search yet

Hadoop vital components: Core and API

MapReduce -- computation modelHDFSI/OZooKeeperPig (adds level of abstraction for processinglarge datasets)

Page 21: NoSQL, Apache SOLR and Apache Hadoop

Solr on the cloudDoes it shine? Yes, but not fully

Page 22: NoSQL, Apache SOLR and Apache Hadoop

References[1] Tim Perdue: NoSQL: An Overview of NoSQL Databases, About.com GuideSarah Pidcock (2011-01-31). http://bit.ly/fFQOYI[2] "Dynamo: Amazon’s Highly Available Key-value Store". http://www.cs.uwaterloo.ca/: WATERLOO. p. 2/22. Retrieved 2011-04-05. "Dynamo: a highly available and scalable distributed data store"[3] http://cassandra.apache.org/[4] http://labs.google.com/papers/bigtable.html[5] http://aws.amazon.com/ (look for SimpleDB)[6] http://couchdb.apache.org/[7] http://neo4j.org/[8] Information Week: Surprise: 44% Of Business IT Pros Never Heard Of NoSQLhttp://bit.ly/go5ios[9] http://drupal.org/[10] Mark Miller: Scaling Lucene and Solr // Lucid Imagination[11] http://wiki.apache.org/solr/SpatialSearch[12] http://dmitrykan.blogspot.com/2011/01/solr-speed-up-batch-posting.html[13] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Page 23: NoSQL, Apache SOLR and Apache Hadoop

References[14] Using Nutch with SOLR, http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/[15] http://tika.apache.org/[16] http://lucene.apache.org/solr/