8
Integration of Cloudera Navigator Enables Data Governance with StreamAnalytix WHITE PAPER

Enabling Data Governance with StreamAnalytix v2 · such as security teams, compliance groups, business users, etc., establishing data integrity in the system. Integrating Cloudera

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Enabling Data Governance with StreamAnalytix v2 · such as security teams, compliance groups, business users, etc., establishing data integrity in the system. Integrating Cloudera

Integration of ClouderaNavigator Enables DataGovernance with StreamAnalytix

WHITE PAPER

Page 2: Enabling Data Governance with StreamAnalytix v2 · such as security teams, compliance groups, business users, etc., establishing data integrity in the system. Integrating Cloudera

As organizations increasingly rely on data as a core asset for decision making, pressures arise to manage growing volumes of data, monitor access to sensitive assets, and seamlessly enforce policies across the enterprise. Therefore, data governance today plays a key role in formulating effective data strategies.

Data governance involves everything that ensures your data is well managed, from securing it to making it accessible. One of the fundamental requirements in any data governance strategy is being able to trace data back to its origin. However, with complex operations taking place within multiple batches and real-time data flow, tracing data origins is increasingly difficult.

Therefore, understanding data lifecycles and the ability to visualize complete data flows is critical. This allows self-service access and offers full visibility to multiple users such as security teams, compliance groups, business users, etc., establishing data integrity in the system.

Integrating Cloudera Navigator within StreamAnalytix enables a unified view of all incoming data and allows monitoring access to sensitive assets, including the ability to identify the data entity lineage of all the streaming and batch complex data pipelines.

Summary

2

Current data governance challenges Disparate dataPetabytes of data are coming into the Hadoop ecosystem of various formats and structures. This data can be structured data such as CSV, or semi-structured data such as XML or JSON (in the form of logs, image files, and more). Additionally, the data is often transformed and stored in other formats along the way. There can be multiple data flows transforming the data in their custom form and storing it.

Page 3: Enabling Data Governance with StreamAnalytix v2 · such as security teams, compliance groups, business users, etc., establishing data integrity in the system. Integrating Cloudera

3

StreamAnalytix Cloudera Navigator Lineage integration offers a solutionIntegrating StreamAnalytix with Cloudera Navigator Lineage provides an optimal

solution to address the data governance challenges stated above.

This integration combines a visual UI and end-to-end big data analytics functionality

with a complete view of the entire data lifecycle. The integration allows the following:

• Extraction of technical, managed and custom metadata for all the data pipelines running in the cluster on top of Cloudera Navigator features

• Data cataloging and tagging of data entities in the pipelines with relevant metadata properties

• A global view of the cluster data flow in tune with the StreamAnalytix implemented data pipelines

The system allows you to have an at-a-glance view from the Cloudera Navigator

console of all the cluster entities, including drill down options into details such as

multiple metadata views. The following figure provides an architectural overview of

the integration.

Data provenanceManaging the control and flow of sensitive data access within the workflows is essential. For example, a sensitive data field such as social security numbers require stringent access control.

VisibilityAs data flows into the system, visibility is required into its movement, frequency, quality thresholds, applicable rules, and more. Similarly, after the data flows have been processed, the enterprise requires the ability to review and analyze the data flow effect on the system as well as the overall life cycle of the data.

Page 4: Enabling Data Governance with StreamAnalytix v2 · such as security teams, compliance groups, business users, etc., establishing data integrity in the system. Integrating Cloudera

4

To enable the data navigator for a data pipeline, do the following:

1. Start with a blank canvas, and build a pipeline. StreamAnalytix offers a visual

pipeline designer that includes a blank canvas, a plethora of operators, and drag

and drop functionality to stitch the pipeline visually. Every data pipeline running in

the system is associated with complex data operations on the input data,

including data ingestion, transformations, analytics, machine learning, actions and

alerts, visualization, and data persisting.

Figure 1: Architectural overview of StreamAnalytix and Cloudera integration

Steps to enable the Cloudera Data Navigatorwithin StreamAnalytix

Fetching Meta data from Multiple StreamAnalytix Data Pipelines

Navigator API Cloudera NavigatorMetadata Server

Cloudera NavigatorConsoleVISUAL BIG DATA ANALYTICS PLATFORM

HADOOP DISTRIBUTION

CHANNELS PROCESSORS ANALYTICS EMITTORS

QL

ML MLibH O2

Page 5: Enabling Data Governance with StreamAnalytix v2 · such as security teams, compliance groups, business users, etc., establishing data integrity in the system. Integrating Cloudera

5

Figure 2: StreamAnalytix visual pipeline designer and operators

Save the pipeline. This is where you can enable the navigator option.

Figure 3: Configure pipelines and operators

Page 6: Enabling Data Governance with StreamAnalytix v2 · such as security teams, compliance groups, business users, etc., establishing data integrity in the system. Integrating Cloudera

6

2. After you submit the pipeline, you can go to the Cloudera Navigator UI to search all

the pipelines under the tags category. All the operators in the data pipelines, which

we will refer to as ‘Data Entity,’ have been tagged with the pipeline name and can

be searched from the Navigator UI.

Figure 4: Cloudera Navigator UI

The screenshot displays all the data entities for the data pipelines with the tags

alerttest and hive_lineage.

3. The next step is viewing the lineage associated with the pipelines and operators.

If you click any of the operations, you will be able to view the lineage and the

complete life cycle of the data flow associated with it.

Lineage of enricher data entity

The following screenshot is an example of the lineage of the enricher data entity and the complete life cycle of the data flow associated with it. In this example, data is emitted to multiple data files on HDFS (cricket_input & parq in this case) and other data pipelines are consuming and applying filter transformations to it.

Page 7: Enabling Data Governance with StreamAnalytix v2 · such as security teams, compliance groups, business users, etc., establishing data integrity in the system. Integrating Cloudera

7

Figure 5: Lineage of enricher data entity and the data flow lifecycle

Figure 6: View of the RabbitMQ metadata

The entity lineage view allows you to trace the entity back to the source and precisely

evaluate any transformations. You can also view metadata information for each entity.

For example, if you click the Details option tab, you can view the following details

about RabbitMQ:

Page 8: Enabling Data Governance with StreamAnalytix v2 · such as security teams, compliance groups, business users, etc., establishing data integrity in the system. Integrating Cloudera

StreamAnalytix is an enterprise grade, visual, big data analytics platform for unified streaming and batch data processing based on best-of-breed open source technologies. It supports the end-to-end functionality of data ingestion, enrichment, machine learning, action triggers, and visualization. StreamAnalytix offers an intuitive drag-and-drop visual interface to build and operationalize big data applications five to ten times faster, across industries, data formats, and use cases.

Visit www.streamanalytix.com or write to us at [email protected]

© 2018 Impetus Technologies, Inc.All rights reserved. Product and companynames mentioned herein may be trademarksof their respective companies. July 2018

StreamAnalytix is an enterprise-grade visual platform for all your batch and stream processing and analytics needs.

Ingest, blend, and process high-velocity big data streams as they arrive, run machine learning models, visualize results on real-time dashboards, and train and refresh models in real-time or in batch mode.

Build and operationalize big data applications five to ten times faster using a visual drag-and-drop interface, an exhaustive set of pre-built operators, full application lifecycle support, and one-click options for on-premise and cloud deployments.

With support for multiple big data engines and built-in extensibility, StreamAnalytix gives you full flexibility and control to work with the technology stack of your choice.

About StreamAnalytix