View
343
Download
3
Category
Preview:
Citation preview
1
Big Data Integration
Marcelo LitovskyNational Solutions Architect – Information Builders
Why are people buying Apache Hadoop?
• Load, Transform, Syndicate – Use the power of Apache Hadoop to pre-process large amounts of data at a low cost, then transform it into what is needed on the warehouse
• Archive/Offload - Do not discard any data. Use Apache Hadoop to archive/offload useful data. Whether driven by government regulations or to add business value, the information is readily available in Apache Hadoop.
Data Warehousing – Paradigm Shift from ETL to ELT
• Load data from external sources (Social Media, Machine data…)• Conform datasets to enterprise standards• Integrate the disparate data sources to extract value from the incoming data• Relate streaming and unstructured data, social data with transactional and traditional
operational data sources
External Data Integration
3
The Evolution of Integration
Hand CodedIntegration
ETL MessagingBus
ESBEAI Apache Hadoop-BasedIntegration
4
Traditional in Transition to Modern
Fewer use cases
More use cases
ModernTraditional
Apache Hadoop
IoT
Streaming
Virtual DW
Data Lake
OLTPOLAP
Data warehousesData marts
Point-to-pointIntegration
EII
5
We Have Some Pretty Simple Problems…
According to a May 2015 Gartner Survey…• 26% are deploying Apache Hadoop, 11% in 12 months, 7% in 24
months• 49% cite trying to find value as their biggest problem• 57% cite the Apache Hadoop skills gap as their biggest problem
To summarize…• Companies are investing in Apache Hadoop, but not sure why• Companies are investing in Apache Hadoop, but don’t know how to
use it
6
Information Builders Big Data ArchitectureUse Case for Apache Hadoop
Sqoop, Flume…
Avro, JSON
…
Traditional applications and data stores
iWay Big Data IntegratorSimplified, modern, native Apache Hadoop integration
Big Data Apache HadoopAny distribution, Any data
BI & Analytics WebFOCUS BI and analytics platform
Self-service for EveryoneWebFOCUS access, ETL, metadata
WebFOCUS access, ETL, metadata
Data Ingestion – Enterprise Data Hub
ETL / ELT
Predictive Analytics - RStat
Business Intelligence - WebFocus
Low-cost storage of large data volumes
7
iWay Big Data Integrator100% Run “in” Apache Hadoop architecture
Simplifiedinterface
Native Apache Hadoopscript generation
Process mgmt. & governance Simplified easy-to-use interface
to integrate in Apache Hadoop Marshals Apache Hadoop
resources and standards Takes advantage of performance
and resource negotiation Includes sophisticated process
management and governanceSqoop, Flum
e…Avro, JSO
N…
Traditional applications and data stores
iWay Big Data IntegratorSimplified, modern, native Apache Hadoop integration
Big Data Apache HadoopAny distribution, Any data
8
iWay Big Data IntegratorKey Features
Eclipse-based User Friendly Interface
Data ingestion using abstraction above Sqoop®, Flume®, Spark®, and proprietary streaming channel content
Transformation & Mapping
Publish to non-Apache Hadoop data sources
Auto-generated scripts/jobs based on configuration
iWay Big Data Integrator
9
Notable Features in 2016
• Data Profiling, Data Preparation, Master Data Management• Analyze patterns, data types, sparsity, cardinality of Apache
Hadoop datasets• Generation of data cleansing rules based on pattern analysis• Auto generation of remediation tickets for non-cleansable records• Ability to transpose (wide to deep, deep to wide) data in parallel• Missing value imputation, data scaling, data categorization• Streaming and in-process predictive model scoring (PMML and
native code)• “Natively” Match and Merge
Data Governance
iWay Big Data Integrator
10
Notable Features in 2016
• Full capture of data lineage for BDI ingestion, transform, data prep, cleansing
• Integration with Cloudera Navigator, to give holistic data lineage view for non-BDI sources
• User interface to interactively display information
Data Lineage
11
iWay Big Data IntegratorData Ingestion
Graphical Sqoop and Flume configuration
•Replace•Change Data Capture•Native “Roll your own”
Sqoop
•Flume editor with validation•Graphical wizard in the works•Templates
Flume
•Legacy formats (Streaming channel, Mumps, etc)
Proprietary “channel” ingestion – iWay Service Manager
Structured data standardized on Avro format
Late-binding data “wrangler” for unstructured content
12
iWay Big Data IntegratorData Ingestion
Graphical Sqoop and Flume configuration
13
iWay Big Data IntegratorTransformation
• Join (inner, left, right, full, outer)• Group by• Aggregate functions as defined by cluster
Drag and drop data transformation designer
Any data on cluster can be transformed, provided it is described in Hive metastore
Logic preview
Transformations performed 100% in Apache Hadoop
Kerberos compliant
14
iWay Big Data Integrator
• Relational targets on remote RDBMS• XML definitions• Custom-defined on design canvas
Mapping
• Publish to any JDBC-compliant MPP or RDBMS
• Staging table or direct-to-target load
Publish
15
iWay Big Data IntegratorTransformation
Drag and drop data transformation designer
16
iWay Big Data IntegratorTransformation – underlying scriptUnderlying script generation view
17
iWay Big Data IntegratorJob Execution
Multiple job executions in a defined order
18
Real-World Strategies for Deploying Big DataData Quality and MDM – iWay Big Data IntegratorEdge Node Deployment of DQ Services
19
Real-World Strategies for Deploying Big DataData Quality and MDM – iWay Big Data Integrator
Native Spark Interface to DQ
20
Real-World Strategies for Deploying Big DataSpark Integration – iWay Big Data Integrator
• Spark Streaming• SparkSQL• SparkR• MLLib
Full Integration of Apache® Spark Stack
Fully Automated project setup, dependency management, Scala version detection
Code, build, test, deploy – all from within Big Data Integrator
21
Real-World Strategies for Deploying Big Data
Predictive Model Development and Deployment
Spark Integration – iWay Big Data Integrator
Predictive Model Development and Deployment
22
iWay Big Data IntegratorCloudera Certified
• Easy to use interface for deploying and integrating data on Apache Hadoop distributions of all flavors, ensuring portability.
• Ingests, transforms, and cleanses traditional RDBMS, mobile, social media, sensor, and other data in batch or streams, using native Apache Hadoop facilities.
• 100% YARN compliant, taking advantage of native Apache Hadoop performance and resource negotiation.
• Simplifies the use of Apache Hadoop ecosystem technologies such as: MapReduce, Sqoop, Flume, Hive®, and Spark®.
iWay Big Data Integrator is CLOUDERA CERTIFIED!!
Recommended