Learning Lessons: Building a CMS on top of NoSQL technologies

Preview:

Citation preview

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Learning LessonsBuilding a content repositoryon top of NoSQL Technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 2

hello,I’m @stevenn from @outerthought

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 3

This story is about

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Complexity

4

complexity

age

1.0

2.0

3.0

software architecture

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Complexity

5

complexity

age

1.0

2.0

3.0

user interest

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

We Prefer Sophistication

6

» the challenge for us was to scale ...without dropping features

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The typical CMS ‘architecture’

7

database (+opt. filesystem) (+ opt. full-text indexes)

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The typical CMS ‘architecture’

8

application

database (+opt. filesystem) (+ opt. full-text indexes)

cache

cacheapplication

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The typical CMS ‘architecture’

9

more cache

database (+opt. filesystem) (+ opt. full-text indexes)

application cache

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The typical CMS ‘architecture’

10

even more cache

more cache

database (+opt. filesystem) (+ opt. full-text indexes)

application cache

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The typical CMS ‘architecture’

11

client

even more cache

more cache

database (+opt. filesystem) (+ opt. full-text indexes)

application cache

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The typical CMS ‘architecture’

12

client (+cache)

even more cache

more cache

database (+opt. filesystem) (+ opt. full-text indexes)

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

What we found hard to scale

» access control

» facet browsing

» all the nifty stuff people were using our software for

» ... anything that required random accessto in-memory-cache data for computations

13

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Beyond the ‘scaling’ problem

» three-prong data layer

» result set merging (between MySQL & Lucene)» happened in appcode/memory

» ‘transactions’, set operations = hard

14

fs

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Beyond the three-prong problem

» errrr..... “Failover” ..... ?

15

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

If we would be able to add more nodes ...

»True Distribution

16

scalability

availability

performance

... in the line of fire

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Solution 1

» do MORE inside the database

17

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Functional

18

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Functional

19

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Infrastructural

2020

more database !

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 21

even more database !

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 22

let’s add message busses !

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 23

RMI! JMS over JDBC! stuff!

w00t !

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 24

http://bigdatamatters.com/bigdatamatters/2010/04/high-availability-with-oracle.html

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Business Development 101

25

budget

user interest

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Solution II

26

sophistication

nosql?

1.0

2.0

3.0

ability to cope

mysql

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Enter The Cambrian Explosion

27

NoSQL

Cassandra

neo4j

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Requirements, phase I

28

» automatic scaling to large data sets

» fault-tolerance: replication, automatic handling of failing nodes

» a flexible data model supporting sparse data

» runs on commodity hardware

» efficient random access to data

» open source, ability to participate in the development thus drive the direction of the project

» some preference for a Java-based solution

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Requirements, phase II

»After careful consideration, we realized the important choices were also:

» consistency: no chance of having two conflicting versions of a row

» atomic updates of a single row, single-row transactions

» bonus points for MapReduce integration» e.g. full-text index rebuilding

29

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

That brought us to HBase, which bought us:

» a datamodel where you can have column families which keep all versions and others which do not, which fits very well on our CMS document model

» ordered tables with the ability to do range scans on them, which allows to build scalable indexes on top of it

» HDFS, a convenient place to store large blobs

» Apache license and community, a familiar environment for us

30

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 31

»OK, so now we had a data store !

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 32

»However, content repository =store + search

ouch!

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 33

That was

easy !

(however ...)

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 34

Search ponderings

»CMS = two types of search

» structured search» numbers, strings» based on logic (SQL, anyone?)

» information retrieval (or: full-text search)» text» based on statistics

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Search ponderings

»All of that, at scale

35

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Structured Search

»HBase Indexing Library

» idea from Google App Engine datastore indexes

» http://code.google.com/appengine/articles/index_building.html

36

rowkey

A

B

col

val3

val2

col

foo6

foo7

content table index table A

rowkey

val2-B

val3-A

col

order

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Full-text / IR search

» Lucene?

» no sharding (for scale)

» no replication (for availability)

» batched index updates (not real-time)

37

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Beyond Lucene» Katta

» scalable architecture, however only search, no indexing

» Elastic Search

» very young (sorry)

» hbasene et al.

» stores inverted index in HBase, might not scale all features

» SOLR

» widely used, schema, facets, query syntax, cloud branch

More info: http://lilycms.org/lily/prerelease/technology.html

38

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 39

+?

=Easy ! O

r ?

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 40

Remember distribution ?Remember secondary indexes ?

➙ Need for reliable queuing

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 41

Connecting things

»we needed a reliable bridge between our main storage (HBase) and our index/search server(s) (SOLR)

» indexing, reindexing, mass reindexing (M/R)

»we need a reliable method of updating HBase secondary indexes

» all of that eventually to run distributed

» distribution means coping with failure

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Solution

»ACMEMessageQueue ? Bzzzzzt.We wanted fault-safe HBase persistence for the queues.Also for ease of administration.

»➙ WAL & Queue implemented on top of HBase tables

42

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

WAL / Queue

» WAL» guaranteed execution

of synchronous actions

» call doesn’t return before secondary action finishes

» e.g. update secondary actions

» if all goes well, size = #concurrent ops

» will be useful/made available outside of Lily context as well!

» Queue» triggering of async

actions

» e.g. (re)index (updated) record with SOLR back-end

» size depends on speed of back-end process

43

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The Sum» Lily model (records & fields)

» mapped onto HBase (=storage)

» indexed and searchable through SOLR

» using a WAL/Queue mechanismimplemented in HBase

» runtime based on Kauri

» with client/server comms via Avro

44

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 45

Architecture

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 46

Architecture

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Roadmap

»Today = release of learning material (architecture, model, API, Javadoc)➥ www.lilycms.org➥ bit.ly/lilyprerelease

»Mid July = ‘proof of architecture’ release

» from there on, ca. 3-monthly releasesleading up to Lily 1.0

47

Nearly there!

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 48

bit.ly/lilyprerelease

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

License

»Apache

49

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Business model

»Consulting, mentoring, turn-key projects

» Strong focus on partner relations

» targeting vertical markets

» geographic coverage

» SaaS offerings

»Markets: media, finance, insurance, govt, heritage ... LOTS of semi-structured data

»Not: OLAP

50

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

More ?

» @outerthought

»www.lilycms.org/lily/prerelease.html

51

Recommended