If you can't read please download the document
Upload
alex-meadows
View
221
Download
0
Embed Size (px)
Citation preview
Xinxinli Blue Curves
Building Next Generation Data WarehousesAll Things Open 2016Alex Meadows
Principal Consultant (Data and Analytics), CSpring Inc.
Business Analytics Adjunct Professor, Wake Tech
MS in Business Intelligence
Passion in developing BI solutions that provide end users easy access to necessary data to find the answers they demand (even the ones they dont know yet!)
Twitter: @OpenDataAlexLinkedIn: alexmeadows
GitHub: OpenDataAlexEmail: [email protected] AlexSo heres a bit about me. There are three things Im going to ask of you, the first being please feel free to reach out! I love talking and learning about what folks are using out in the wild and sharing. If you want to know more or chat more about any topic within data science/business intelligence just message me via one of the above methods.
The second thing Ill ask is to be aware that some of these solutions may fix your particular problems and youll iterate on them and well find them super-awesome and maybe youll be able to give back and talk about your experiences at a conference or in a trade paper. Note that the business side might not realize the undertaking or super awesome things being done they are designed to be seamless and make users lives easier.
The final ask before we get fully started is please dont be the pointy-haired boss! Were covering a lot of topics at a very high level and a lot of nuances arent being discussed (its only a 40 minute presentation after all). Please dig further and ask plenty of questions.
Agenda(Brief) History of why data warehousing
The challenges
Three paths
Traditional
NoSQL
Hybrid
Q&A
Please feel free to ask questions throughout the presentation!
By the end of this presentation, you will know where traditional data warehousing is failing and have a basic understanding of what technologies and methodologies are helping to address the needs of more data savvy customer bases.
Why Data Warehouses?Started being discussed in 1970
While databases existed, they were not relational/normalized
Network/hierarchical in nature
Design for query, not for data model
Reporting was hardSystem/application queries were not the same as management reporting queries
The concept of data warehouses started in the 1970s and fully came into their own during the late 80s and well into the 90s. Before relational databases, data was stored based on query usage and not necessarily based on the data itself. As a result, reporting was hard. Data would either have to be merged out piece-meal or stored again based on the specific query requirements.
Bill Inmon
Data warehouses: subject-oriented, integrated, time-variant and non-volatile collection of datain support of management's decision making processInto that mess, a gentleman named Bill Inmon created the initial concept of separating reporting and analysis needs away from the OLTP layer.
Bill Inmon
Bottom-up design
Integration of source systems
Third Normal Form
His approach, now considered a bottom-up design integrates data from various OLTP systems, tie them together in a 3NF data model and make those data sets available for reporting. This ties in with the other process that came along a bit after Inmon the star schema.
Ralph KimballMake data accessible
Top-down approach
Dimensional models (star schema)
Another gentleman named Ralph Kimball took the data warehousing concept a step further. Considered a top-down approach, the data from the data warehouse is now transformed or conformed to match the reporting and analysis needs of the business. While many arguments were had and many organizations went pure conformed dimensional model for their data warehouse, the correct way to model is with both a 3NF backend and a star schema on top.
Traditional Model
With that said, here is a typical model/workflow. From OLTP systems, Excel files, etc. The data is moved into a 3NF model. From the 3NF model, star schema are built on top to handle all the reporting/analytics requirements. This model has worked very well but there are several problems that have come out with this model. While I dont have an exact number, a high number of data warehouse projects are considered failures due to these issues. What are they? Glad you asked!
Traditional Model Challenges
How can I get my data integrated faster?Speed of integrating data is a huge problem. Working to cleanse, conform, and process all those different sources into a single warehouse is one thing. Getting the business agreeing on logic and formulas to populate the star schema is also a challenge, triggering many iterations on the integration layer.
Traditional Model Challenges
How long to get new data sources online?How to handle business logic changes?How can I get my data integrated faster?Its also a challenge to bring new sources online. Because of the nature of an Inmon data warehouse, its typical to bring the entire source over so that history is tracked across the entire source. In addition, how will logical changes be managed both from the source to warehouse and the warehouse to star schema? Without the 3NF layer, the star schema cant be reloaded without losing all the history that was collected.
Traditional Model Challenges
What about all thatunstructured data?How long to get new data sources online?How to handle business logic changes?How can I get my data integrated faster?Then of course the other question is how to handle the big yellow elephant?
Traditional Model Challenges
What about all thatunstructured data?How long to get new data sources online?How to handle business logic changes?How can I get my data integrated faster?
What about data scientists?On top of all those other problems, we also have to address a whole new customer base data scientists! They need to have access to data faster and more broader than any other customer base before. Yet, they cant just access data from the data warehouse because the data is too clean to be of real use.
A New Use CaseTraditional DW doesnt meet the demand of the data science workforce
Only gets to the what happened and why.
There are distinct groups of requirements that business intelligence tries to answer. Traditional data warehousing can answer the first two what happened and why it did happen. Where it starts to fail is in the predictive analytics space where again, data scientists want data that is not cleansed and conformed, but still easy to access. Then there is proscriptive analytics applying the predictions found and making automated decisions based on them.
Graph Source: http://www.odoscope.com/technology/prescriptive-analysis/
Traditional
Iterations On Existing Architecture
Data VaultHybrid between 3NF and star schema
Created by Dan Linstedt
Persistent data layer keep everything
Bring data over as needed
Once touching an object, bring it all over
Can be hybrid between relational databases and Hadoop
Massive parallel loading, eventual consistency (with Hadoop)
Of the newer architectures, Data Vault is one of the easier to implement because it is a combination of both the Kimball and Inmon methods. Data is only brought over from source systems as needed as opposed to bringing everything from the source all at once.The other really cool thing about Data Vault is that data can be offloaded into Hadoop as it ages and becomes non-volitile.
Image Source: https://pixabay.com/en/vault-strongbox-security-container-154023/
Here is our basic example that well be using through the rest of this presentation. Its a simple student/teacher/class model that, while not modeled 100% correct, will provide a good example going forward.
Here is that same model in data vault form. Business entities become hub tables. Relationships between hubs get stored in many to many relationship tables called links. Off both hubs and links are dimension-like tables called satellites that store all relative information of their related hub or link. Satellites version data as changes occur.
Pros and ConsPRO
Easily leverage existing infrastructure
Faster iterations between source and solution
Especially as objects are brought over
Can offload historical data into Hadoop
Learning curveSimple to pick up
CON
Table joins
Inter-dependencies between objects
Documentation not widely available (outside of commercial website and book)
1.0 documentation found at:
TDAN Article
2.0 documentation ->
Certification/training:
http://learndatavault.com/
Theres not a large amount of information publicly available outside the book, shown above. The original series of articles can be found on TDAN. There is also certification thru the learn data vault website.
Anchor ModelingStore data how it is and how it was
Structural changes and content changes
Created by Lars Rnnbck
Persistent data layer keep everything, including how the data was structured
Highly normalized (6NF)
Another of the new modeling techniques is anchor modeling. In this model, data is stored in a highly normalized format that focuses both on the actual data but also the context and model over time. As the data model changes, new structures can be made in the anchor model.
There is only one open modeling tool that supports anchor modeling, and thats on the anchor modeling website directly. Some commercial tools do provide support, but there arent many. That said, this is one of the many examples from the website. Each entity becomes an anchor and data about the entity is tied together. This model also removes duplication of data. For instance, if a teacher and a student both had the name Mary, it would only be stored once and be referenced to both anchors.
Documentation:
Anchor Modeling Website
Quite a few presentations, no formal texts outside academic papers
Theres not a lot of documentation out in the wild, with the exception of the website and many presentations and white papers.
Pros and ConsPRO
Stores data and data structure temporally
Designed to be agile
Reduces storage
CON
Joins
High normalization makes for difficult usage
Views mask this complexity
Some data stores arent able to handle this normalization level
BI tools arent designed for this type of modeling
NoSQLVolume, Velocity, Variety, Veracity
Linked Data Stores (Triple Stores)Store data with semantic information
Created by Tim Berners-Lee
Removes/eliminates ambiguity in data
Standardizes data querying (SPARQL)
Can interface with all other linked data sources
Public sources referenced and integrated by calling them
Private sources work the same way, provided permissions allow
Graph data stores are a specialized type of triple storeStore data on edges
Linked data (also known as triple stores) was created by Tim Berners-Lee around the same time as the web was created. Linked Data removes the ambiguity of typical data stores by translating the data model into a clear vocabulary. The other bonus is that there is only one single, unified querying language. When it comes to other linked data sources, its easy to join data sets together by adding a new prefix to a query.Graph data stores are a subtype of triple store in which data is stored in a network graph think seven degrees of Keven Bacon.
Again, using the example from before.
Valerie
ArnoldStudent
Teacher
enrolledInteachesClasshasFirstName
Is a
Third Grade
hasFirstNamePerson
isSubClassOfisSubClassOf
Is aIs aUsing that model, here we have an example of triples. A triple is made up of three parts: subject, predicate, and object. For instance, a student has a first name of Arnold. Another would be that Arnold is a Student and that a Student is a subclass of Person.
RDF/XML
RDF is another way of formatting the data in triples. Now there are other formats, but RDF/XML is one of the more common transport mechanisms since most tools can read XML. The same kinds of triples mentioned in the previous slide can be seen here.
SPARQLPREFIX: school: SELECT ?s ?nameWHERE {?s school:isEnrolledIn ?class .?s school:hasFirstName ?name .?class school:hasCourseName "Third Grade" .?s?name
school:Student#493Arnold
school:Student#494Carlos
school:Student#495Phoebe
school:Student#496Ralphie
school:Student#497Wanda
SPARQL is similar enough to SQL to be familiar but different enough to require some tutorials ;) Here we are looking at our school data (as noted by the school prefix) and retrieving all students first names that are in Third Grade. The WHERE clause has three triple statements to bring the result set back. Each triple is denoted by a period.
There are a few books on linked data but these are two of the better of the bunch. The Manning publication is a great overview of Linked Data while the Semantic Web book focuses on building web ontologies (the vocabulary like we discussed earlier).
Pros and ConsPRO
Clearly defined business logic
Fast iterations on ontology
Single, unified querying language
Can join datasets via PREFIX with no additional work
CON
BI tools still playing catch-up
Tool ecosystem is small
But Awesome!
Few organizations have adopted (but this is changing)
Other NoSQLColumnar
Designed with queries in mind
Some are tuned for star schema performance
Document StoresDesigned with data/queries in mind
Key-value stores
Object Stores
Data stored as objects
Merger of database and programming
OthersNew types are still being created
Watch out for flavors of the month
The are many other types of NoSQL databases, but not enough time to cover here. They can still be useful in augmenting traditional data warehouses.
Hybrid
Data VirtualizationIntegration is logical not physical
Doesnt matter what type of data is being integrated*
NoSQL
Relational
Allows for more traditionally designed tools to access more modern data stores
Allows for easier, more iterative work flows
Business logic lives in the integration layer
Data Virtualization is a great way to bridge the gap between NoSQL and SQL based tools. This allows for traditional business intelligence tools to access data stores that they wouldnt normally be able to. The cool thing about virtualization tools is that business logic lives in this integration layer allowing for faster changes to the process that builds the data endpoint.
Logical Layering
With all the various sources, the virtualization tool will have one or many translation layers. These translation layers interpret the data between the source system and SQL. Between the initial translation layer and the final virtual data marts are any number of rules layers. These rules layers act in a similar manner to ETL (data integration) but they are inside the virtualization tool. From there, data marts can be created virtually as well. At any of those layers changes can be made quickly and will immediately impact the layers above the one where changes are made.
With data virtualization, traditional tools can continue to access data marts, both virtual and real. In addition, tools that can access the source systems can go either into the virtualized layers or access the systems directly, depending on the use case/need.
Going through my slide deck, I realized that I forgot to mention Mr. van der Lans book really the only book in the space thats tool agnostic and discusses the concept of a logical data warehouse definitely check it out!
Pros and ConsPRO
Easily leverage existing infrastructure
Faster iterations between source and solution
Integration between NoSQL and RDBS simplified
Can keep data warehouse and augment as needed
Uses SQL
Self-documenting
CON
Joining can be intensive
Large memory, compute requirements
Heavy loads on source systems
Can offload to virtualization shards
Textual DisambiguationTake unstructured data and interpret context
Store disambiguated data in RDBMS (9th normal form)
Augment traditional data warehouse with new unstructured data.
The final method well be discussing is some of Inmons latest work Textual Disambiguation. At its basic core, the methodology takes unstructured data and interprets it into its language components and defines its textual context. From there, the data can be stored in an even higher normalized form that we discussed with anchor modeling and augment the traditional warehouse with a veritable cornucopia of new information that can be utilized using SQL.
Image Source: http://www.datatransformed.com.au/textual%20etl.htm#.WArA3RIrLRZ
Pros and ConsPRO
Easily leverage existing infrastructure
Closes the gap between unstructured data and traditional data
Clear understanding and interpretation of unstructured data
CON
Full language context required
Slang, acronyms, etc. can be a problem
Time to delivery varies
Multiple language barrier
Defining context
Non-agileHard to break data down into smaller components
ConclusionBusiness Intelligence has to move forward
Remove legacy tools that havent evolved past reporting
Tweak platform to support agile, incremental change
Businesses are already demanding moreFaster turn around
More access
Deeper insights
Is your team ready to make the move?
Image Source: https://pixabay.com/p-1014060/?no_redirect