23
Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector Guglielmo Iozzia, Big Data Infrastructure Engineer @ IBM Ireland

Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Embed Size (px)

Citation preview

Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data CollectorGuglielmo Iozzia, Big Data Infrastructure Engineer @ IBM Ireland

Data Ingestion for Analytics: a real scenario

In the business area (cloud applications) to which my team belongs there were so many questions to be answered. They were related to:

● Defect analysis● Outage analysis● Cyber-Security

“Data is the second most important thing in analytics”

Data Ingestion: multiple sources...

● Legacy systems● DB2● Lotus Domino● MongoDB● Application logs● System logs● New Relic● Jenkins pipelines● Testing tools output● RESTful Services

… and so many tools available to get the data

What are we going to do with all those data?

Issues

● The need to collect data from multiple sources introduces redundancy, which costs additional disk space and increases query times.

● A small team.● Lack of skills and experience across the team (and the business area in

general) in managing Big Data tools.● Low budget.

Alternatives

#1 Panic

Alternatives

#2 Cloning team members

Alternatives

#3 Find a smart way to simplify the data ingestion process

A single tool needed...

● Design complex data flows with minimal coding and the maximum flexibility.● Provide real-time data flow statistics, metrics for each flow stage.● Automated error handling and alerting.● Easy to use by everyone.● Zero-downtime when upgrading the infrastructure due to logical isolation of

each flow stage.● Open Source

… something like this

Streamsets Data Collector

Streamsets Data Collector

Streamsets Data Collector: supported origins

Streamsets Data Collector: available destinations

Streamsets Data Collector: available processors

● Base64 Field Decoder● Base64 Field Encoder● Expression Evaluator● Field Converter● JavaScript Evaluator● JSON Parser● Jython Evaluator● Log Parser● Stream Selector● XML Parser

...and many others

Streamsets Data Collector

Demo

Streamsets DC: performance and reliability

● Two available execution modes: standalone or cluster● Implemented in Java: so any performance best practice/recommendation for

Java applications applies here● REST services for performance monitoring available● Rules and alerts (metric and data both)

Streamsets Data Collector: security

● You can authenticate user accounts based on LDAP● Authorization: the Data Collector provides several roles (admin, manager,

creator, guest)● You can use Kerberos authentication to connect to origin and destination

systems● Follow the usual security best practices in terms of iptables, networking, etc.

for Java web applications running on Linux machines.

Useful Links

Streamsets Data Collector:

https://streamsets.com/product/

Thanks!

My contacts:

Linkedin: https://ie.linkedin.com/in/giozzia

Blog: http://googlielmo.blogspot.ie/

Twitter: https://twitter.com/guglielmoiozzia