Summer Shorts: Big Data Integration

Big Data Integration

Marcelo LitovskyNational Solutions Architect – Information Builders

Why are people buying Apache Hadoop?

• Load, Transform, Syndicate – Use the power of Apache Hadoop to pre-process large amounts of data at a low cost, then transform it into what is needed on the warehouse

• Archive/Offload - Do not discard any data. Use Apache Hadoop to archive/offload useful data. Whether driven by government regulations or to add business value, the information is readily available in Apache Hadoop.

Data Warehousing – Paradigm Shift from ETL to ELT

• Load data from external sources (Social Media, Machine data…)• Conform datasets to enterprise standards• Integrate the disparate data sources to extract value from the incoming data• Relate streaming and unstructured data, social data with transactional and traditional

operational data sources

External Data Integration

The Evolution of Integration

Hand CodedIntegration

ETL MessagingBus

ESBEAI Apache Hadoop-BasedIntegration

Traditional in Transition to Modern

Fewer use cases

More use cases

ModernTraditional

Apache Hadoop

Streaming

Virtual DW

Data Lake

OLTPOLAP

Data warehousesData marts

Point-to-pointIntegration

We Have Some Pretty Simple Problems…

According to a May 2015 Gartner Survey…• 26% are deploying Apache Hadoop, 11% in 12 months, 7% in 24

months• 49% cite trying to find value as their biggest problem• 57% cite the Apache Hadoop skills gap as their biggest problem

To summarize…• Companies are investing in Apache Hadoop, but not sure why• Companies are investing in Apache Hadoop, but don’t know how to

use it

Information Builders Big Data ArchitectureUse Case for Apache Hadoop

Sqoop, Flume…

Avro, JSON

Traditional applications and data stores

iWay Big Data IntegratorSimplified, modern, native Apache Hadoop integration

Big Data Apache HadoopAny distribution, Any data

BI & Analytics WebFOCUS BI and analytics platform

Self-service for EveryoneWebFOCUS access, ETL, metadata

WebFOCUS access, ETL, metadata

Data Ingestion – Enterprise Data Hub

ETL / ELT

Predictive Analytics - RStat

Business Intelligence - WebFocus

Low-cost storage of large data volumes

iWay Big Data Integrator100% Run “in” Apache Hadoop architecture

Simplifiedinterface

Native Apache Hadoopscript generation

Process mgmt. & governance Simplified easy-to-use interface

to integrate in Apache Hadoop Marshals Apache Hadoop

resources and standards Takes advantage of performance

and resource negotiation Includes sophisticated process

management and governanceSqoop, Flum

e…Avro, JSO

Traditional applications and data stores

iWay Big Data IntegratorSimplified, modern, native Apache Hadoop integration

Big Data Apache HadoopAny distribution, Any data

iWay Big Data IntegratorKey Features

Eclipse-based User Friendly Interface

Data ingestion using abstraction above Sqoop®, Flume®, Spark®, and proprietary streaming channel content

Transformation & Mapping

Publish to non-Apache Hadoop data sources

Auto-generated scripts/jobs based on configuration

iWay Big Data Integrator

Notable Features in 2016

• Data Profiling, Data Preparation, Master Data Management• Analyze patterns, data types, sparsity, cardinality of Apache

Hadoop datasets• Generation of data cleansing rules based on pattern analysis• Auto generation of remediation tickets for non-cleansable records• Ability to transpose (wide to deep, deep to wide) data in parallel• Missing value imputation, data scaling, data categorization• Streaming and in-process predictive model scoring (PMML and

native code)• “Natively” Match and Merge

Data Governance

Notable Features in 2016

• Full capture of data lineage for BDI ingestion, transform, data prep, cleansing

• Integration with Cloudera Navigator, to give holistic data lineage view for non-BDI sources

• User interface to interactively display information

Data Lineage

iWay Big Data IntegratorData Ingestion

Graphical Sqoop and Flume configuration

•Replace•Change Data Capture•Native “Roll your own”

•Flume editor with validation•Graphical wizard in the works•Templates

•Legacy formats (Streaming channel, Mumps, etc)

Proprietary “channel” ingestion – iWay Service Manager

Structured data standardized on Avro format

Late-binding data “wrangler” for unstructured content

iWay Big Data IntegratorData Ingestion

Graphical Sqoop and Flume configuration

iWay Big Data IntegratorTransformation

• Join (inner, left, right, full, outer)• Group by• Aggregate functions as defined by cluster

Drag and drop data transformation designer

Any data on cluster can be transformed, provided it is described in Hive metastore

Logic preview

Transformations performed 100% in Apache Hadoop

Kerberos compliant

• Relational targets on remote RDBMS• XML definitions• Custom-defined on design canvas

Mapping

• Publish to any JDBC-compliant MPP or RDBMS

• Staging table or direct-to-target load

Publish

iWay Big Data IntegratorTransformation

Drag and drop data transformation designer

iWay Big Data IntegratorTransformation – underlying scriptUnderlying script generation view

iWay Big Data IntegratorJob Execution

Multiple job executions in a defined order

Real-World Strategies for Deploying Big DataData Quality and MDM – iWay Big Data IntegratorEdge Node Deployment of DQ Services

Real-World Strategies for Deploying Big DataData Quality and MDM – iWay Big Data Integrator

Native Spark Interface to DQ

Real-World Strategies for Deploying Big DataSpark Integration – iWay Big Data Integrator

• Spark Streaming• SparkSQL• SparkR• MLLib

Full Integration of Apache® Spark Stack

Fully Automated project setup, dependency management, Scala version detection

Code, build, test, deploy – all from within Big Data Integrator

Real-World Strategies for Deploying Big Data

Predictive Model Development and Deployment

Spark Integration – iWay Big Data Integrator

Predictive Model Development and Deployment

iWay Big Data IntegratorCloudera Certified

• Easy to use interface for deploying and integrating data on Apache Hadoop distributions of all flavors, ensuring portability.

• Ingests, transforms, and cleanses traditional RDBMS, mobile, social media, sensor, and other data in batch or streams, using native Apache Hadoop facilities.

• 100% YARN compliant, taking advantage of native Apache Hadoop performance and resource negotiation.

• Simplifies the use of Apache Hadoop ecosystem technologies such as: MapReduce, Sqoop, Flume, Hive®, and Spark®.

iWay Big Data Integrator is CLOUDERA CERTIFIED!!

Summer Shorts: Big Data Integration

Data & Analytics

Integration guide of GetSocials "Big Tasty"

Big PanDA integration with Titan LCF

BIG DATA INTEGRATION RESEARCH AT SCADS · BIG DATA INTEGRATION RESEARCH AT SCADS Erhard Rahm Eric Peukert Alieh Saeedi Marcel Gladbach . BIG DATA CHALLENGES Big Data Volume Petabytes

SnapLogic Big Data Integration

Rewardable Wins Big with AerServ Integration

BIG Data Integration - Paris Descarteshelios.mi.parisdescartes.fr/~themisp/publications/... · 2016-07-18 · Blocking for BIG Data Integration Challenges, Algorithms, Practical Examples

Medical Shorts

Big Data and Oracle Data Integration - Citia · PDF fileOracle Data Integration Solutions . ... XML Enterprise ... ODI for Big Data Heterogeneous Integration to Hadoop Environments

Llama - Big Data Integration and Analysis

Big Data and Oracle Tools Integration: Kafka, Cassandra ...vlamiscdn.com/papers2017/BigDataandOracleToolsIntegration... · Big Data and Oracle Tools Integration: Kafka, Cassandra

Big Data and Oracle Data Integration

SnapLogic Live: Big Data Integration

Sailor Shorts

Sample PDF Making It Big in Shorts

Big Data Infrastructure for source integration and

Horizontal Integration of Big Intelligence Data

Parsha Shorts

Plot - Pixar Shorts (Canadian version, eh) 6...fall naked. Big bird laughs. ... RISING ACTION – Alien boy gets ... Plot - Pixar Shorts (Canadian version, eh) Author: Polsky, Daniel

Short Shorts

Boys' Cargo Shorts, Plaid Shorts - Kapkids.in