Upload
dataversity
View
862
Download
0
Embed Size (px)
Citation preview
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Data Integration and Polyglot Persistence
Damon Feldman, Ph.D.Solutions Director – MarkLogicTwitter: @damonfeldman
Integration Done Right – Avoiding the Franken-Beast
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 2
Agenda Review a specific data integration project
– The names have been changed to protect the innocent
Why did it become complex?
How does this inform integration generally?
MarkLogic’s vision and features to address this problem.
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 3
Our Data Integration Project Simple need to allow people to apply for mortgages
– Accept binary Excel submissions containing structured data, review and approve.
Became complex
We’ll walk through the various issues and considerations
Finally, we’ll talk about how to simplify these systems.
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 4
Poly-what? Polyglot Persistence
Polyglot - “someone who speaks or writes several languages”
“The term polyglot is redefined for big data as [using] several core database technologies [needed] no matter how narrow your approach to big data.”
– Hurwitz et al: Big Data for Dummies
Rows & columns; documents; binaries; RDF triples; text
Note that MarkLogic handles multiple data forms, within one technology, via universal indexing
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 5
The Requirement Mortgage application system
– Input: Excel worksheet submissions – Business Entities are extracted– Workflow and approval– Binaries and XML documents are both persisted
– This is a NoSQL system, because it is focused on Business Entities
Our customer chose to bifurcate their data– MarkLogic for Documents (Business Entities)– Alfresco for binaries
– Input Excel, PDF notices, some metadata
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 6
Polyglot, or “Poly-Not?”
MarkLogic is the best XML/JSON document store in the world – we get that!
But binaries should go into a “content system….” a CMS or DAM.
Let’s use Alfresco to store the Excel and some generated PDF notices, and put the XML in MarkLogic.
That way we use a best-of-breed system for each type of data!”
“
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 7
Envisioned Architecture– Store the input– Extract structured data– Store the Business Entity XML
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 8
Coordinating The Parts Export super-jumbo loans from last week in 1GB chunks
– Include binaries and XML Business Entities
Data is bifurcated– MarkLogic knows dates and super-jumbo thresholds per
zip code– Alfresco has the binaries
• Now What?• Who controls paging to hit 1GB per
file?• What knows how to get a record
and then make a REST call to Alfresco?
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 9
We Went with Two Passes Export all for the week
Chunk it in a second pass with Python
Two phases, so two operations
Bonus Questions: how do you monitor the Python output for errors? What if it fails? What if the consumer finds data issues?Is there traceability from the Business Entity query to the temp data, through the Python script to the bundled output?
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 10
The Franken-Beast
The Franken-Beast
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 12
Think DevOps
#DevOps => Simplify, Monitor, Think of the impacts
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 13
Systemic Simplicity Your architecture diagram is a Chimera
– And we want it that way
– A couple more boxes on your architecture diagram may mean a couple dozen boxes in your deployment diagram
Humans create simplified views like architecture diagrams exactly because we are not well suited to deal with this level of complexity
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 14
Building and Development is One Aspect Multiple stores required coordination and extra processing
Architecture and development time were affected
Other aspects of the program were also slowed down
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 15
Operational Components
Alfresco ships as a unit
… but deploys as a set of technologies
…and needs reliable storage
Beneath the Hood
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 16
HA/DR HA means
copies of all persisted data
Many stores, many copies
Many copies, many configs
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 17
HA/DR DR means
copies of entire systems
All with replication
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 18
Is HA/DR Possible? Consistent data requires transactional control
Having two (or more!) persistent components makes this difficult or impossible
Synchronizing data, restoring data, recovering to a point in time? All require a notion of transactional consistency.
This was a huge time- and brain-drain
With MarkLogic it is transactional, fast, correct ,and fully tested under load.
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 20
Clustered FSSetup
CM, CI/CD Production is reflected in Dev, QA, Stage, etc. Entire process should be automated, repeatable and constant
MarkLogic Code
Alfresco Config
Oracle DDL
Batch Process Code
Python Script
Master Config
Create Directories
Set up initial data or config
Production
QA
DEV
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 21
How does this story end? We are now working to remove much of the complexity
Design is for binaries inside MarkLogic
To reduce outages, operational complexity
And improve performance
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 24
What about other data types? This use case is not specific to binaries and XML documents
Load and index data “as is” from
varied sources
Binary
RDF
RDB
Deliver Data in Unified Form
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 25
Same complexity applies to other data types Structured data + Semantic data
Structured data + text data
Semantic + text
[ . . . ]
Structured + Semantic + Binary, with mixed text
What would our Mortgage example look like with RDF Triples?
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 26
What is Semantic Triple Data?• AKA RDF. AKA Linked Open Data.
dbr:Kevin_Bacon foaf:knows dbr:Harvey_Keiteldbr:Kevin_Bacon dbo:spouse dbr:Kyra_Sedgwickdbo:spouse rdfs:subPropertyOf dbo:knows
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 27
Back to Polyglot Persistence Documents: Natural Business Entities stored as documents
Triples: Relationships among Business Entities as RDF Triples
Applicantbob-jones-03
ApplicationMTG-0042
CreditHistoryEQFX-9928
PropertyMTG-0042
LoanMTG-0042
bob-jones-03 :appliesOn MTG-0042
bob-jones-03 hasCredit EQFX-9928
… includesDebt…
… hasCollateral…
+Inference: What real-estate exposures does this applicant have?
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 28
Naïve Polyglot Architecture What’s wrong with this picture?
Triple Store
Extract Triples Ingest Process Store JSON
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 29
Customer Use Case: Documents + Triples (RDF)
BPS
Gloss
Impact
Vendors
Client Event-Based Feeds
Faceted Search
Rest APIs
Com
mon
Ser
vice
s
Inte
grat
ion
Laye
r (C
amel
)
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 30
MarkLogic Vision Polyglot Persistence
– Many types of data, not many sub-systems for data– One simplified component
XML, JSON, SQL views, unstructured (full-text search), Semantic data (RDF Triples, SPARQL), Binary data (large binaries, streaming)
Enterprise NoSQL– All transactional. All HA. All with DR. All query-able with
one API. All scalable. All in one backup. All monitored together.
Ingest as-is
Data Services out of the box
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 32
Additional Resources Narrative form of this content:
http://www.marklogic.com/blog/polyglot-persistence-done-right/
Fowler’s early Polyglot Persistence note: http://martinfowler.com/bliki/PolyglotPersistence.html
Structured Document data + Triple/RDF Data presentation: http://www.marklogic.com/resources/data-modeling-in-practice-documents-and-triples/
@damonfeldman
? ? ? ? ?
Questions?
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 34
Deliver the right content, to the right user,
in the right format, in real time
Load and index data “as is” from
ever-changing sources
MarkLogic
RDF
RDB