Scaling the Content Repository with Elasticsearch

  • Published on
    11-Apr-2017

  • View
    1.612

  • Download
    0

Embed Size (px)

Transcript

  • SCALINGSCALINGTHE DOCUMENT REPOSITORYTHE DOCUMENT REPOSITORY

    WITH ELASTICSEARCHWITH ELASTICSEARCH

  • SOME CONTEXTSOME CONTEXTWhat we Do and What Problems We Try to Solve

  • NUXEONUXEO

    Nuxeo

    we provide a Platform that developers can use to build highlycustomized Content Applications

    we provide components, and the tools to assemble them

    everything we do is open source (for real)

    various customers - various use cases

    me: developer & CTO - joined the Nuxeo project 10+ years ago

    Track game builds Electronic Flight Bags Central repository for Models Food industry PLM

    https://github.com/nuxeo

  • DOCUMENT REPOSITORYDOCUMENT REPOSITORY

    Store Documents / Assets / Objects

    Blob objects

    Complex data Structures

    Hierarchy, references and links

    Audit trail & VersioningData level security & encryptionLifecycle, workflows ... API (REST, CMIS, Java, JS...)

    CRUD

    Search

    Service API

    Heavily configurable : all data structures are flexible / customizable

    Used by developers to buildContent Applications on top of

    the Nuxeo Repository

  • OUR CHALLENGESOUR CHALLENGES

    CRUD on large repository works

    inject at 6,000 docs/s up to 1 Billion

    not so many companies have that many documents anyway

    Queries are the main scalability issue

    impact of c_ud vs search

    multi-criteria queries + full-textsecurity filtering

    configurable data structures

    user defined queries

    UI heavily depends on search

    Search API is the most used:

    search is the main scalability challenge

  • HISTORY : NUXEO & LUCENEHISTORY : NUXEO & LUCENE

    2006: Nuxeo CPS 3.6

    (Python / Zope based)

    Replace built-in index with

    lucene + XML-RPC server

    pyLucene

    (GCJ build+ python bindings!)

    Complex setup

    2007: Nuxeo Platform 5.1

    JCR : queries (and backup) issues

    Integrate Compass Core

    transactionnal & storage abstraction

    Missing sync & concurrency issues

    2009: Nuxeo 5.2

    VCS : Homebrew SQL based repository

    Search in database but some real limitations

    2013 / 2014: Nuxeo 5.9.3

    Reintroduce Lucene in the stack via elasticsearch

    Learn from our past mistakes

    Leverage elasticsearch architecture

    easy deployment

    safe indexing

    powerful search

    ... we are now happy with Elasticsearch

    Lucene and Nuxeo have a long story ...

  • REPOSITORY & SEARCHREPOSITORY & SEARCHUnderstanding the Issue

  • REPOSITORY & SEARCHREPOSITORY & SEARCH

    Search API is the most used :

    search is the main scalability challenge

  • COMPLEX SQL QUERIESCOMPLEX SQL QUERIES

    Configurable Data Structure+ User defined multi-criteria searches=> multiple & complex SQL queries

    Search API is the most used:

    search is the main scalability challenge

    SELECT "hierarchy"."id" AS "_C1" FROM "hierarchy" JOIN "fulltext" ON "fulltext"."id" = "hierarchy"."id" LEFT JOIN "misc" "_F1" ON "hierarchy"."id" = "_F1"."id" LEFT JOIN "dublincore" "_F2" ON "hierarchy"."id" = "_F2"."id" WHERE ("hierarchy"."primarytype" IN ('Video', 'Picture', 'File', 'Audio')) AND ((TO_TSQUERY('english', 'sydney') @@NX_TO_TSVECTOR("fulltext"."fulltext"))) AND ("hierarchy"."isversion" IS NULL) AND ("_F1"."lifecyclestate" 'deleted') AND ("_F2"."created" IS NOT NULL )

    ORDER BY "_F2"."created" DESC

    LIMIT 201 OFFSET 0;

  • ABOUT SQL LIMITATIONSABOUT SQL LIMITATIONSScaling queries is complex

    depend on indexes, I/O speed and available memory

    can not satisfy all types of queries

    poor performances on unselective multi-criteria queries

    some types of queries can simply not be fast in SQL

    Scalability

    Scale up is expensive

    Scale out is complex at best (XA & MVCC)

    Sharding requires a global index

    Fulltext support is usually poor

    limitations on features & impact on performances

    SQL technology is not the solution

  • IS NOSQL THE SOLUTION!?IS NOSQL THE SOLUTION!?

  • USING NOSQL FOR THE REPOSITORYUSING NOSQL FOR THE REPOSITORY

  • ABOUT THE NOSQL OPTIONABOUT THE NOSQL OPTION(sadly) NoSQL is no magic

    it does work very well for CRUD and it scales easily, but

    query options are limited and performance is not that goodmulti-document transactions is usually not safe

    more adapted for DBs with billions of entries and simple queries

    SQL has some real advantages

    ACID (and MVCC) is good

    Workflows and bulk updates are a typical use case

    (even transient) lack of consistency is complex to explain to users

    lot of existing tools (BI & reporting), lot of existing skills (DBA)PGSQL (or AWS RDS) can be very cost effective

    SQL or NoSQL repository are not the solution

  • KEEP THE REPOSITORYKEEP THE REPOSITORYSQL OR NOSQLSQL OR NOSQL

    BUTBUTFIND A SUPER FAST INDEX ENGINEFIND A SUPER FAST INDEX ENGINE

  • REPOSITORY & ELASTICSEARCHREPOSITORY & ELASTICSEARCHToward an Hybrid Storage

  • HYBRID STORAGEHYBRID STORAGEUse each storage solution for what it does the best

    SQL DB

    store content in an ACID way

    store & retrieve

    queries needed ACID and MVCC

    elasticsearch

    provide powerful and scalable queries

    do the heavy lifting that the RDBMS can not do

    scoring, native full-text, aggregates

    distributed search

    Route the query to the correct index dependingon requirements

  • ELASTICSEARCH & REPOSITORYELASTICSEARCH & REPOSITORY

    One querySeveral possible backends

  • PERFORMANCE RESULTSPERFORMANCE RESULTSFast indexing

    No ACID constraints / No impedance issue

    3,500 documents/s when using SQL backend

    10,000 documents/s when using MongoDB

    Super query performance

    query on term using inverted index

    very efficient caching

    native full text support & distributed architecture

    3,000 queries/s with 1 elasticsearch node

    6,000 queries/s with 2 elasticsearch nodes

  • SOME REAL LIFE FEEDBACKSOME REAL LIFE FEEDBACK

    We are now testing the Nuxeo 6 stack in AWS.DB is Postgres SQL db.r3.8xlarge which is a a 32 cpusBetween 350 and 400 tps the DB cpu is maxed out.

    Please activate nuxeo-elasticsearch !

    We are now able to do about 1200 tps with almost 0 DB activity.Question though, Nuxeo and ES do not seem to be maxed out ?

    It looks like you have some networkcongestion between your client and the servers. ...right... we have pushed past 1900 tps ... I think we are close todeclaring success for this configuration ...

    Customer

    Customer

    Customer

    Nuxeo support

    Nuxeo support

  • SQL VS ELASTICSEARCHSQL VS ELASTICSEARCH

    Scalability is simply fromanother order of magnitude

  • SCALE OUTSCALE OUT

  • UNIFIED INDEX ON SHARDED REPOSITORYUNIFIED INDEX ON SHARDED REPOSITORY

    Tested with 10 PgSQL databases

    10 x 100 Million documents => 1 Billion documents

    1 elasticsearch cluster

  • IS THIS MAGIC?IS THIS MAGIC?

    For users

    it really looks like magic

    For sales guys & solution architects

    it is magic: it unleashes a lot of possibilitiesperformance is just one aspect

    For Nuxeo Core Dev team

    it was almost magic: some integration work was needed

  • INTEGRATING ELASTICSEARCHINTEGRATING ELASTICSEARCHInside nuxeo-elasticsearch Plugin

  • CHALLENGES TO ADDRESSCHALLENGES TO ADDRESS

    Keep index in sync with the repository

    No transaction management

    Do not lose anything

    Without support for update

    Mitigate eventually consistent effect

    Avoid displaying transient inconsistent state

    Handle security filtering

    Without join

    Without post-filtering

  • SECURITY FILTERINGSECURITY FILTERING

    Constraints

    Filtering must be done at index level : no post filtering Join is not an option

    can not join with DB or withing lucene (previously tested without success)

    Solutionindex the ReadACL as part of the JSON Document

    list of groups / users who can read the resource

    automatically add a filter clause on ACL

    Consequences

    Recursive indexing is neededMore pressure to maintain re-indexing procesing

    in last resort: the Document security is checked by the repository anyway

  • SAFE INDEXING FLOWSAFE INDEXING FLOWDo not try to make it Transactionnal

    Collect and de-duplicate Repository Events during Transaction

    Wait for commit to be done at the repository level

    then call elasticsearch

    Do not lose any updaterun Indexing Tasks in a distributed Job infrastructure

    Jobs should be persisted

    Jobs should be retried

    Jobs should be monitored

  • ASYNC INDEXING FLOWASYNC INDEXING FLOW

  • MITIGATE EVENTUALLY CONSISTENTMITIGATE EVENTUALLY CONSISTENT

    In the code :

    use case : need to see results from within the transactionquery directly on the repository

    leverage ACID and MVCC of SQL repository full-text search and facets are usually not needed by the code

    For the users :

    use case : see changes in listings in "real time"use pseudo-real time indexing

    indexing actions triggered by UI threads are flaggedrun as afterCompletion listenerrefresh elasticsearch index

  • PSEUDO-SYNC INDEXING FLOWPSEUDO-SYNC INDEXING FLOW

  • DOES THIS WORK ?DOES THIS WORK ?

    Live for about 18 months now No missing sync issue

    some customers asked for verification toolsbut no problem was foundre-index in bulk mode is very fast anyway

    No consistency issues

    good usage of hybrid query engines

    elasticsearch helped address several scaling challenges

    but elasticsearch brings us much more than just scalability

  • BONUS FROM ELASTICSEARCHBONUS FROM ELASTICSEARCHMore than Raw Speed

  • LEVERAGE AGGREGATESLEVERAGE AGGREGATES

    Leverage elasticsearch aggregates

    integrate with the Query system (PageProvider)integrate with the Listing / UI model (ContentView)

    Allow to easily build and configure faceted search

  • ADVANCED INDEXINGADVANCED INDEXINGFine tuning of elasticsearch indexing

    multi language support using multiple analyzers and copy_to

    compound fields created using groovy scripts

    Introduce elasticsearch hints into NXQL

    select a specific elasticsearch index / analyzer

    leverage elasticseach operators

    do geolocation search

    -- Use an explicit Elasticsearch fieldSELECT * FROM Document WHERE /*+ES: INDEX(dc:title.ngram) */ dc:title = 'foo'

    -- Use ES operators not present in NXQLSELECT * FROM Document WHERE /*+ES: OPERATOR(regex) */ dc:title = 's.*y'SELECT * FROM Document WHERE /*+ES: OPERATOR(fuzzy) */ dc:title = 'zorkspaces'

    -- Use ES for GeoQuery based on geo_hash_cell location near a point using geohash; SELECT * FROM Document WHERE /*+ES: OPERATOR(geo_hash_cell)*/ osm:location IN ('40','-74','5')

    leverage what comes for free with elasticsearch

  • INDEX AUDIT TRAIL WITH ELASTICSEARCHINDEX AUDIT TRAIL WITH ELASTICSEARCHUse elasticsearch to store & index Audit trail

    all events are serialized in JSON and stored inside elasticsearch

    Unleash Audit system powercan store a lot of eventscan store and query arbitrary JSON structure

  • ELASTICSEARCH PASS-THROUGHELASTICSEARCH PASS-THROUGH

    Expose an HTTP pass-through API on top of Nuxeo integration

    Integrate Authentication & Authorization

    not all users can access workflow index

    Integrate Security Filtering

    activate data level security filtering

    Expose "virtual index" via http

    index + filter

    Use elasticsearch API related components on Nuxeo dataDocuments + Audit logWith embedded security

    Easy real time data analytics on business data

  • DATA ANALYTICS WITH ELASTICSEARCHDATA ANALYTICS WITH ELASTICSEARCHQueries on Documents + Audit: flexible reporting on workflows

  • READ DOCUMENTS FROM ELASTICSEARCHREAD DOCUMENTS FROM ELASTICSEARCH

    Full JSONDocument is stored in elasticsearch

    required to be able to do fast re-indexing

    We can retrieve Documents from elasticsearchexecute full search & retrieve without touching the DB

    By controling indexing we can use the elasticsearch indexas a persistent cache on top of the repositoryas a staging area for queries

    _source

  • NEXT STEPSNEXT STEPSLeveraging Even More elasticsearch

  • NEXT STEPSNEXT STEPS

    Leverage elasticsearch percolator

    push update on the nuxeo-drive clients

    notify users about saved search

    automatic categorization

    Search result highlighting

    not sure why it is still not there ...

    Plug automatic denormalization

  • ANY QUESTIONS ?ANY QUESTIONS ?Thank You !

    https://github.com/nuxeo

    http://www.nuxeo.com/careers/

Recommended

View more >