10
Solving Data Discovery in the Enterprise: Building an Enterprise Data Catalog

Solving data discovery in the enterprise

Embed Size (px)

Citation preview

Page 1: Solving data discovery in the enterprise

Solving Data Discovery in the Enterprise: Building an Enterprise Data Catalog

Page 2: Solving data discovery in the enterprise

Contents Overview ....................................................................................................................................................... 3

The Business Challenges of Data Discovery in the Enterprise ...................................................................... 3

Why MDM is not the Answer ........................................................................................................................ 4

Introducing the Enterprise Data Catalog ...................................................................................................... 4

Getting Technical: Building an Enterprise Data Catalog ............................................................................... 6

Catalog Portal ........................................................................................................................................... 7

Catalog Mobile ......................................................................................................................................... 8

Catalog Store ............................................................................................................................................ 8

Data Source Publishing API ...................................................................................................................... 8

Data Source Discovery API ....................................................................................................................... 8

Data Source Notifications API .................................................................................................................. 8

Data Source Search API ............................................................................................................................ 8

Data Governance API ............................................................................................................................... 9

Metadata Connectors............................................................................................................................... 9

Data Collaboration System and APIs ....................................................................................................... 9

Putting all Together ...................................................................................................................................... 9

Summary ....................................................................................................................................................... 9

Page 3: Solving data discovery in the enterprise

Overview Data discovery, understanding and governance is becoming one of the key

elements of data architectures in the enterprise. The explosion in the volumes of

data produced and consumed by organizations have exponentially increased the

complexities related to discovering and understanding data in an efficient manner.

Despite its relevance, data discovery and governance often tends to be an

overlooked aspect of enterprise big data solutions more focused in sexy areas such

as analytics, machine learning etc. However, more and more organizations are

realizing that data discovery is an essential component to effectively enable

analytics, visualizations and general data consumption capabilities in the enterprise.

However, the road of enabling data discovery in the enterprise is plagued with

challenges as we will explore in the next section.

The Business Challenges of Data Discovery in the Enterprise As data grows in the enterprise so are the initiatives to gather intelligence about

that data. In that sense, the efforts around big data, analytics, visualizations, etc.

have increased exponentially during the last few years. In that sense, data

discovery has become a foundation block to any enterprise data initiative. However,

in order to enable efficient data discovery models, enterprises need to address

some of the following challenges:

Increasing Data Volume: The increasing volume of data produced in the

enterprise has drastically degraded the ability of information workers for

quickly finding and consuming different data sources from enterprise

applications.

Lack of Metadata Management: Even when data can be found,

information workers struggle to understand the specific semantics of

enterprise data sources. This is due to the lack of metadata management

solutions implemented in enterprise environments.

Different Data Access Interfaces: One of the biggest challenges for

accessing data in the enterprise is the proliferation of heterogeneous data

Page 4: Solving data discovery in the enterprise

access protocols and APIs introduced by new line of business solutions. In

that sense, organizations struggle with the lack of consistent protocols and

models to access data from different business applications.

Lack of Established Data Stewardship: Complementing the previous

point, the lack of mainstream data stewardship models make it challenging

for applications trying to access enterprise data sources.

Limited Collaboration Interfaces: Top-down data stewardship is just a

mechanism for establishing contextual information about enterprise data

sources. A lot of the knowledge about business data lives with business users

who actively interact with it. However, enterprises rarely implement the

collaboration interfaces that capture the knowledge of those domain experts

in order to add contextual information to corporate data sources.

Why MDM is not the Answer Master data management (MDM) platforms has been traditionally seen as a

mechanism to keep a record of data sources in an enterprise environment.

However, over the years MDM solutions have become extremely heavy, complicated

and very limited to address some of the mainstream scenarios of data discovery in

the enterprise. Additionally, MDM solutions struggle to quickly integrate with

modern SaaS, cloud and mobile platforms which are becoming a significant source

of data in the enterprise.

As a result of the limitations of MDM platforms, organizations have started to adopt

lighter, simpler and more modern data discovery models that are optimized for the

modern technology ecosystem. From the different models used to enabled data

discovery in the enterprise there is one we’ve seen been incredibly successful in

organizations of all sizes: the enterprise data catalog.

Introducing the Enterprise Data Catalog A data catalog is a simple but incredibly effective and robust model to enable data

discovery in the enterprise. From a functional standpoint, an enterprise data catalog

should provide a global repository that registers data sources from different line of

Page 5: Solving data discovery in the enterprise

business systems as well as the corresponding metadata and contextual

information associated with it.

Conceptually, a data catalog borrows elements from popular repositories such as

mobile app stores or ecommerce marketplaces. In that sense, an enterprise data

catalog goes beyond the classification and organization of enterprise data sources

and enables capabilities such as search, collaboration, alerting and other features

that can be combine to provide a fresh, modern experience to discover data sources

in an enterprise environment.

From the functional standpoint, an enterprise data catalog should enable some of

the following capabilities:

Data Source Discovery: An enterprise data catalog should allow

information workers to browse, and discover different data sources business

data sources linked to line of business systems. Additionally, the catalog

should allow a simple registration for new data sources.

Data Source Publishing: Complementing the previous point, an enterprise

data catalog should allow information workers to register new data sources

using simple interfaces both visually and programmatically.

Metadata Management: Enterprise data catalog solutions should allow data

stewards to provide adequate metadata related to business data sources.

Simple metadata such as field descriptions or other contextual information

can be incredibly relevant to correctly understand business data sources.

Tagging and Classification: An enterprise data catalog should allow users

to classify the different data sources using tags or simple hierarchical

categories.

Search: Finding data using simple keyword and facet search should be one

of the key capabilities of an enterprise data catalog solution.

Testability: An enterprise data catalog should allow users to test and

validate the different data sources exposed in the catalog.

Collaboration: An enterprise data catalog should facilitate the collaboration

between information workers working on specific data sources.

Page 6: Solving data discovery in the enterprise

Governance: Access control, SLAs, exception management are just some of

the key governance and data stewardship capabilities that should be enabled

by enterprise data catalogs.

Alerts: Throughout the lifetime of a data source, information workers might

want to receive alerts about relevant events such as schema data changes of

performance degradations. An enterprise data catalog should provide a

simple interface for power users to configure alert conditions on specific data

sources.

Getting Technical: Building an Enterprise Data Catalog As explained in the previous sections, enterprise data catalogs have become one of

the most popular solutions to enable data discovery in the enterprise. In the last

couple of years, we have implemented several enterprise data catalogs for dozens

of organizations. As a result, there are a few reference architectures that you can

implement with today’s technology. The following diagram illustrates a reference

architecture model for an enterprise data catalog solution.

Page 7: Solving data discovery in the enterprise

The previous diagram includes highlights some of the following functional

components:

Catalog Portal The catalog portal is the main user interface to register, browse and discover data

sources in an enterprise environment. From the architecture standpoint, the catalog

portal will interact with the different APIs of the solution to perform operations on

data sources. The catalog could be implemented using any web development

platform such as NodeJS express, ASP.NET or Python Django.

Page 8: Solving data discovery in the enterprise

Catalog Mobile Similar to the portal interface, users will be able to interact with data sources from

smartphones or tablets using the catalog mobile interface. This component of the

platform provides a mobile-first, simple functionality to enable data discovery from

mobile devices.

Catalog Store The catalog store is the main data repository for maintaining the metadata

associated with different data sources. Considering the arbitrarily nature of

information related to business data sources we have typically preferred to leverage

NOSQL databases such as MongoDB or Couchbase when implementing this type of

solution.

Data Source Publishing API The data source publishing API provides the interfaces for publishing and managing

business data sources from different applications including the catalog portal. This

API should handle all aspects related to data source management such as

categorization, tagging, metadata management etc.

Data Source Discovery API The data source discovery API provides the interfaces required to dynamically query

and discover data sources registered on the platform. Typically, we have leveraged

industry standards such as OData or GraphQL as the main protocol for these

interfaces.

Data Source Notifications API The data source notifications API provides the mechanisms for third party

applications to dynamically subscribe to changes on specific data sources. The API

should be able to deliver notifications via traditional channels such as email, SMS or

push notifications as well as via programmatic interfaces.

Data Source Search API The data source search API is responsible for providing traditional search

capabilities to enterprise data sources registered in the catalog. The search

capabilities should focus on the data source metadata and not on the data itself.

Search techniques like facet searching and proximity algorithms are very relevant

Page 9: Solving data discovery in the enterprise

for this API. Typically we rely on search platforms like Elastic to implement this

capability.

Data Governance API The data governance API is responsible for enabling data governance and

stewardship capabilities such as access control, data privacy, data ownership, SLA

monitoring etc. These APIs can be integrated with existing security and access

control platforms in the enterprise.

Metadata Connectors The connectors are responsible for abstracting the integration with the different line

of business systems hosting the data sources will be discovered via the catalog.

From the functional standpoint, the connectors should provide the authentication

and data querying capabilities required to register a data source in the enterprise

data catalog.

Data Collaboration System and APIs The data collaboration system and APIs provides the interfaces for teams

collaborate around specific data sources stored in the data catalog. This interface

can be the main gateway to capture contextual information related to data sources

such as comments, documents, etc.

Putting all Together As simple as the previous architecture model seems, it contain the fundamental

building blocks to enable robust data discovery scenarios in enterprise

environments. This architecture model is based on our experience implementing

dozens of similar solutions and can be easily extended with other relevant aspects

such as data quality rules, data access optimization, etc.

Summary Data discovery is one of the most important elements of enterprise data solutions

and one that is frequently ignored. This paper has provided a reference architecture

to enable data discovery in the enterprise environments. The reference architecture

covers relevant aspects of data discovery solutions such as metadata management,

governance, alerting, discovery, etc. The reference architecture described in this

Page 10: Solving data discovery in the enterprise

project has been implemented dozens of times using commodity technology stacks

available to any organization in the world.