12
© 2016 DataTorrent Chinmay Kolhatkar ([email protected]) Committer, Apache Apex Engineer, DataTorrent July 21, 2016 Data Ingestion Dedup-Enrich-ETL

Ingesting Data from Kafka to HDFS with Dedupper & Enrichment using JDBC

Embed Size (px)

Citation preview

Page 1: Ingesting Data from Kafka to HDFS with Dedupper & Enrichment using JDBC

© 2016 DataTorrent

Chinmay Kolhatkar ([email protected])Committer, Apache Apex

Engineer, DataTorrentJuly 21, 2016

Data Ingestion Dedup-Enrich-ETL

Page 2: Ingesting Data from Kafka to HDFS with Dedupper & Enrichment using JDBC

© 2016 DataTorrent

Agenda

2

•About Apache Apex•Apex Platform Overview•Apex - Native Hadoop Integration•Apex Malhar Library•What is Data Ingestion?•Data Ingestion - Use Cases•Dedup-Enrich ETL Demo

Page 3: Ingesting Data from Kafka to HDFS with Dedupper & Enrichment using JDBC

© 2016 DataTorrent

About Apache Apex

3

•Platform and runtime engine that enables development of scalable and fault-tolerant distributed applications

•Hadoop native (Hadoop >= 2.2)No separate service to manage stream processingStreaming Engine built into Application Master and

Containers•Process streaming or batch big data•High throughput and low latency•Library of commonly needed business logic•Write any custom business logic in your application

Page 4: Ingesting Data from Kafka to HDFS with Dedupper & Enrichment using JDBC

© 2016 DataTorrent

Apex Platform Overview

4

Page 5: Ingesting Data from Kafka to HDFS with Dedupper & Enrichment using JDBC

© 2016 DataTorrent

Apex - Native Hadoop Integration

5

• YARN is the resource manager

• HDFS used for storing any persistent state

Page 6: Ingesting Data from Kafka to HDFS with Dedupper & Enrichment using JDBC

© 2016 DataTorrent

Apex Malhar Library

6

RDBMS• Vertica• MySQL• Oracle• JDBC

NoSQL• Cassandra, Hbase• Aerospike, Accumulo• Couchbase/ CouchDB• Redis, MongoDB• Geode

Messaging• Kafka• Solace• Flume, ActiveMQ• Kinesis, NiFi

File Systems• HDFS/ Hive• NFS• S3

Parsers• XML • JSON• CSV• Avro• Parquet

Transformations• Filters• Rules• Expression• Dedup• Enrich

Analytics• Dimensional Aggregations

(with state management for historical data + query)

Protocols• HTTP• FTP• WebSocket• MQTT• SMTP

Other• Elastic Search• Script (JavaScript, Python, R)• Solr• Twitter

Page 7: Ingesting Data from Kafka to HDFS with Dedupper & Enrichment using JDBC

© 2016 DataTorrent

What is Data Ingestion?

7

•Data IngestionA process of obtaining, importing, and analyzing data for

later use or storage in a database•Big Data Ingestion

Reading from data sources Importing the data Processing data to produce intermediate data Sending data out to durable data stores

•ETL + Big Data => Data ingestion

Page 8: Ingesting Data from Kafka to HDFS with Dedupper & Enrichment using JDBC

© 2016 DataTorrent

Data Ingestion - Use cases

8

•Data SyncRead data from sourceWrite to destinationKeep syncing data as per rules

•Real-time IoT Data ProcessingRead sensor data from sourcesDo some processing over the received dataStore/Publish the results over destination

Page 9: Ingesting Data from Kafka to HDFS with Dedupper & Enrichment using JDBC

© 2016 DataTorrent

Dedup-Enrich-ETL Application

9

•KafkaInput - Reads data from Kafka•CSVParser - Parses CSV data and converts to POJO•Dedup - Deduplicate the Data•Enrich - Enrich the data using external source•HDFSOut - Writes the data out to HDFS

Page 10: Ingesting Data from Kafka to HDFS with Dedupper & Enrichment using JDBC

© 2016 DataTorrent

Dedup-Enrich-ETL Live Demo

10

Page 11: Ingesting Data from Kafka to HDFS with Dedupper & Enrichment using JDBC

© 2016 DataTorrent

Resources

11

• Apache Apex website - http://apex.apache.org/• Subscribe - http://apex.apache.org/community.html• Download - http://apex.apache.org/downloads.html• Twitter - @ApacheApex; Follow - https://twitter.com/apacheapex• Facebook - https://www.facebook.com/ApacheApex/• Meetup - http://www.meetup.com/topics/apache-apex•SlideShare - http://www.slideshare.net/ApacheApex/presentations•More Examples - https://github.com/DataTorrent/examples• Startup Program – Free Enterprise License for Startups, Educational

Institutions, Non-Profits - https://www.datatorrent.com/startups/•Cloud Trial - https://www.datatorrent.com/download/cloud-trial/

Page 12: Ingesting Data from Kafka to HDFS with Dedupper & Enrichment using JDBC

© 2016 DataTorrent

We Are Hiring

12

• Back-End Engineers• QA Automation Engineers• Solutions Engineers•Apply at: https://www.datatorrent.com/careers/