35
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Data Integration and Polyglot Persistence Damon Feldman, Ph.D. Solutions Director – MarkLogic Twitter: @damonfeldman Integration Done Right – Avoiding the Franken-Beast

A Data Integration Case Study - Avoid Creating a “Franken-Beast”

Embed Size (px)

Citation preview

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Data Integration and Polyglot Persistence

Damon Feldman, Ph.D.Solutions Director – MarkLogicTwitter: @damonfeldman

Integration Done Right – Avoiding the Franken-Beast

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 2

Agenda Review a specific data integration project

– The names have been changed to protect the innocent

Why did it become complex?

How does this inform integration generally?

MarkLogic’s vision and features to address this problem.

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 3

Our Data Integration Project Simple need to allow people to apply for mortgages

– Accept binary Excel submissions containing structured data, review and approve.

Became complex

We’ll walk through the various issues and considerations

Finally, we’ll talk about how to simplify these systems.

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 4

Poly-what? Polyglot Persistence

Polyglot - “someone who speaks or writes several languages”

“The term polyglot is redefined for big data as [using] several core database technologies [needed] no matter how narrow your approach to big data.”

– Hurwitz et al: Big Data for Dummies

Rows & columns; documents; binaries; RDF triples; text

Note that MarkLogic handles multiple data forms, within one technology, via universal indexing

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 5

The Requirement Mortgage application system

– Input: Excel worksheet submissions – Business Entities are extracted– Workflow and approval– Binaries and XML documents are both persisted

– This is a NoSQL system, because it is focused on Business Entities

Our customer chose to bifurcate their data– MarkLogic for Documents (Business Entities)– Alfresco for binaries

– Input Excel, PDF notices, some metadata

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 6

Polyglot, or “Poly-Not?”

MarkLogic is the best XML/JSON document store in the world – we get that!

But binaries should go into a “content system….” a CMS or DAM.

Let’s use Alfresco to store the Excel and some generated PDF notices, and put the XML in MarkLogic.

That way we use a best-of-breed system for each type of data!”

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 7

Envisioned Architecture– Store the input– Extract structured data– Store the Business Entity XML

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 8

Coordinating The Parts Export super-jumbo loans from last week in 1GB chunks

– Include binaries and XML Business Entities

Data is bifurcated– MarkLogic knows dates and super-jumbo thresholds per

zip code– Alfresco has the binaries

• Now What?• Who controls paging to hit 1GB per

file?• What knows how to get a record

and then make a REST call to Alfresco?

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 9

We Went with Two Passes Export all for the week

Chunk it in a second pass with Python

Two phases, so two operations

Bonus Questions: how do you monitor the Python output for errors? What if it fails? What if the consumer finds data issues?Is there traceability from the Business Entity query to the temp data, through the Python script to the bundled output?

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 10

The Franken-Beast

The Franken-Beast

OPERATIONAL

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 12

Think DevOps

#DevOps => Simplify, Monitor, Think of the impacts

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 13

Systemic Simplicity Your architecture diagram is a Chimera

– And we want it that way

– A couple more boxes on your architecture diagram may mean a couple dozen boxes in your deployment diagram

Humans create simplified views like architecture diagrams exactly because we are not well suited to deal with this level of complexity

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 14

Building and Development is One Aspect Multiple stores required coordination and extra processing

Architecture and development time were affected

Other aspects of the program were also slowed down

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 15

Operational Components

Alfresco ships as a unit

… but deploys as a set of technologies

…and needs reliable storage

Beneath the Hood

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 16

HA/DR HA means

copies of all persisted data

Many stores, many copies

Many copies, many configs

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 17

HA/DR DR means

copies of entire systems

All with replication

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 18

Is HA/DR Possible? Consistent data requires transactional control

Having two (or more!) persistent components makes this difficult or impossible

Synchronizing data, restoring data, recovering to a point in time? All require a notion of transactional consistency.

This was a huge time- and brain-drain

With MarkLogic it is transactional, fast, correct ,and fully tested under load.

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 19

Monitoring

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 20

Clustered FSSetup

CM, CI/CD Production is reflected in Dev, QA, Stage, etc. Entire process should be automated, repeatable and constant

MarkLogic Code

Alfresco Config

Oracle DDL

Batch Process Code

Python Script

Master Config

Create Directories

Set up initial data or config

Production

QA

DEV

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 21

How does this story end? We are now working to remove much of the complexity

Design is for binaries inside MarkLogic

To reduce outages, operational complexity

And improve performance

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 22

MarkLogic Approach

WHAT ABOUT OTHER DATA SOURCES?

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 24

What about other data types? This use case is not specific to binaries and XML documents

Load and index data “as is” from

varied sources

Binary

RDF

RDB

Deliver Data in Unified Form

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 25

Same complexity applies to other data types Structured data + Semantic data

Structured data + text data

Semantic + text

[ . . . ]

Structured + Semantic + Binary, with mixed text

What would our Mortgage example look like with RDF Triples?

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 26

What is Semantic Triple Data?• AKA RDF. AKA Linked Open Data.

dbr:Kevin_Bacon foaf:knows dbr:Harvey_Keiteldbr:Kevin_Bacon dbo:spouse dbr:Kyra_Sedgwickdbo:spouse rdfs:subPropertyOf dbo:knows

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 27

Back to Polyglot Persistence Documents: Natural Business Entities stored as documents

Triples: Relationships among Business Entities as RDF Triples

Applicantbob-jones-03

ApplicationMTG-0042

CreditHistoryEQFX-9928

PropertyMTG-0042

LoanMTG-0042

bob-jones-03 :appliesOn MTG-0042

bob-jones-03 hasCredit EQFX-9928

… includesDebt…

… hasCollateral…

+Inference: What real-estate exposures does this applicant have?

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 28

Naïve Polyglot Architecture What’s wrong with this picture?

Triple Store

Extract Triples Ingest Process Store JSON

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 29

Customer Use Case: Documents + Triples (RDF)

BPS

Gloss

Impact

Vendors

Client Event-Based Feeds

Faceted Search

Rest APIs

Com

mon

Ser

vice

s

Inte

grat

ion

Laye

r (C

amel

)

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 30

MarkLogic Vision Polyglot Persistence

– Many types of data, not many sub-systems for data– One simplified component

XML, JSON, SQL views, unstructured (full-text search), Semantic data (RDF Triples, SPARQL), Binary data (large binaries, streaming)

Enterprise NoSQL– All transactional. All HA. All with DR. All query-able with

one API. All scalable. All in one backup. All monitored together.

Ingest as-is

Data Services out of the box

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 31

In Summary

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 32

Additional Resources Narrative form of this content:

http://www.marklogic.com/blog/polyglot-persistence-done-right/

Fowler’s early Polyglot Persistence note: http://martinfowler.com/bliki/PolyglotPersistence.html

Structured Document data + Triple/RDF Data presentation: http://www.marklogic.com/resources/data-modeling-in-practice-documents-and-triples/

[email protected]

@damonfeldman

? ? ? ? ?

Questions?

END

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 34

Deliver the right content, to the right user,

in the right format, in real time

Load and index data “as is” from

ever-changing sources

MarkLogic

PDF

RDF

RDB

© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 35

It didn’t have to be that way

Workflow

Persistence(Business Entities + Binaries!)

(Highly-Available)

DR

MonitoringMarkLogic