51
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org Learning Lessons Building a content repository on top of NoSQL Technologies

Learning Lessons: Building a CMS on top of NoSQL technologies

  • Upload
    ngdata

  • View
    18.212

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Learning LessonsBuilding a content repositoryon top of NoSQL Technologies

Page 2: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 2

hello,I’m @stevenn from @outerthought

Page 3: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 3

This story is about

Page 4: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Complexity

4

complexity

age

1.0

2.0

3.0

software architecture

Page 5: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Complexity

5

complexity

age

1.0

2.0

3.0

user interest

Page 6: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

We Prefer Sophistication

6

» the challenge for us was to scale ...without dropping features

Page 7: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The typical CMS ‘architecture’

7

database (+opt. filesystem) (+ opt. full-text indexes)

Page 8: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The typical CMS ‘architecture’

8

application

database (+opt. filesystem) (+ opt. full-text indexes)

cache

Page 9: Learning Lessons: Building a CMS on top of NoSQL technologies

cacheapplication

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The typical CMS ‘architecture’

9

more cache

database (+opt. filesystem) (+ opt. full-text indexes)

Page 10: Learning Lessons: Building a CMS on top of NoSQL technologies

application cache

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The typical CMS ‘architecture’

10

even more cache

more cache

database (+opt. filesystem) (+ opt. full-text indexes)

Page 11: Learning Lessons: Building a CMS on top of NoSQL technologies

application cache

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The typical CMS ‘architecture’

11

client

even more cache

more cache

database (+opt. filesystem) (+ opt. full-text indexes)

Page 12: Learning Lessons: Building a CMS on top of NoSQL technologies

application cache

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The typical CMS ‘architecture’

12

client (+cache)

even more cache

more cache

database (+opt. filesystem) (+ opt. full-text indexes)

Page 13: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

What we found hard to scale

» access control

» facet browsing

» all the nifty stuff people were using our software for

» ... anything that required random accessto in-memory-cache data for computations

13

Page 14: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Beyond the ‘scaling’ problem

» three-prong data layer

» result set merging (between MySQL & Lucene)» happened in appcode/memory

» ‘transactions’, set operations = hard

14

fs

Page 15: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Beyond the three-prong problem

» errrr..... “Failover” ..... ?

15

Page 16: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

If we would be able to add more nodes ...

»True Distribution

16

scalability

availability

performance

... in the line of fire

Page 17: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Solution 1

» do MORE inside the database

17

Page 18: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Functional

18

Page 19: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Functional

19

Page 20: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Infrastructural

2020

more database !

Page 21: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 21

even more database !

Page 22: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 22

let’s add message busses !

Page 23: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 23

RMI! JMS over JDBC! stuff!

w00t !

Page 24: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 24

http://bigdatamatters.com/bigdatamatters/2010/04/high-availability-with-oracle.html

Page 25: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Business Development 101

25

budget

user interest

Page 26: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Solution II

26

sophistication

nosql?

1.0

2.0

3.0

ability to cope

mysql

Page 27: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Enter The Cambrian Explosion

27

NoSQL

Cassandra

neo4j

Page 28: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Requirements, phase I

28

» automatic scaling to large data sets

» fault-tolerance: replication, automatic handling of failing nodes

» a flexible data model supporting sparse data

» runs on commodity hardware

» efficient random access to data

» open source, ability to participate in the development thus drive the direction of the project

» some preference for a Java-based solution

Page 29: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Requirements, phase II

»After careful consideration, we realized the important choices were also:

» consistency: no chance of having two conflicting versions of a row

» atomic updates of a single row, single-row transactions

» bonus points for MapReduce integration» e.g. full-text index rebuilding

29

Page 30: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

That brought us to HBase, which bought us:

» a datamodel where you can have column families which keep all versions and others which do not, which fits very well on our CMS document model

» ordered tables with the ability to do range scans on them, which allows to build scalable indexes on top of it

» HDFS, a convenient place to store large blobs

» Apache license and community, a familiar environment for us

30

Page 31: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 31

»OK, so now we had a data store !

Page 32: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 32

»However, content repository =store + search

ouch!

Page 33: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 33

That was

easy !

(however ...)

Page 34: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 34

Search ponderings

»CMS = two types of search

» structured search» numbers, strings» based on logic (SQL, anyone?)

» information retrieval (or: full-text search)» text» based on statistics

Page 35: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Search ponderings

»All of that, at scale

35

Page 36: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Structured Search

»HBase Indexing Library

» idea from Google App Engine datastore indexes

» http://code.google.com/appengine/articles/index_building.html

36

rowkey

A

B

col

val3

val2

col

foo6

foo7

content table index table A

rowkey

val2-B

val3-A

col

order

Page 37: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Full-text / IR search

» Lucene?

» no sharding (for scale)

» no replication (for availability)

» batched index updates (not real-time)

37

Page 38: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Beyond Lucene» Katta

» scalable architecture, however only search, no indexing

» Elastic Search

» very young (sorry)

» hbasene et al.

» stores inverted index in HBase, might not scale all features

» SOLR

» widely used, schema, facets, query syntax, cloud branch

More info: http://lilycms.org/lily/prerelease/technology.html

38

Page 39: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 39

+?

=Easy ! O

r ?

Page 40: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 40

Remember distribution ?Remember secondary indexes ?

➙ Need for reliable queuing

Page 41: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 41

Connecting things

»we needed a reliable bridge between our main storage (HBase) and our index/search server(s) (SOLR)

» indexing, reindexing, mass reindexing (M/R)

»we need a reliable method of updating HBase secondary indexes

» all of that eventually to run distributed

» distribution means coping with failure

Page 42: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Solution

»ACMEMessageQueue ? Bzzzzzt.We wanted fault-safe HBase persistence for the queues.Also for ease of administration.

»➙ WAL & Queue implemented on top of HBase tables

42

Page 43: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

WAL / Queue

» WAL» guaranteed execution

of synchronous actions

» call doesn’t return before secondary action finishes

» e.g. update secondary actions

» if all goes well, size = #concurrent ops

» will be useful/made available outside of Lily context as well!

» Queue» triggering of async

actions

» e.g. (re)index (updated) record with SOLR back-end

» size depends on speed of back-end process

43

Page 44: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The Sum» Lily model (records & fields)

» mapped onto HBase (=storage)

» indexed and searchable through SOLR

» using a WAL/Queue mechanismimplemented in HBase

» runtime based on Kauri

» with client/server comms via Avro

44

Page 45: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 45

Architecture

Page 46: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 46

Architecture

Page 47: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Roadmap

»Today = release of learning material (architecture, model, API, Javadoc)➥ www.lilycms.org➥ bit.ly/lilyprerelease

»Mid July = ‘proof of architecture’ release

» from there on, ca. 3-monthly releasesleading up to Lily 1.0

47

Nearly there!

Page 48: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 48

bit.ly/lilyprerelease

Page 49: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

License

»Apache

49

Page 50: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Business model

»Consulting, mentoring, turn-key projects

» Strong focus on partner relations

» targeting vertical markets

» geographic coverage

» SaaS offerings

»Markets: media, finance, insurance, govt, heritage ... LOTS of semi-structured data

»Not: OLAP

50

Page 51: Learning Lessons: Building a CMS on top of NoSQL technologies

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

More ?

» @outerthought

»www.lilycms.org/lily/prerelease.html

51