21
Transforming data processing at Penton Raj Nair Director, Data Platform@Penton

Transforming data processing at Penton - Meetupfiles.meetup.com/2824692/Transforming data processing at Penton - Final.pdf · Transforming data processing at Penton Raj Nair Director,

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Transforming data processing at Penton - Meetupfiles.meetup.com/2824692/Transforming data processing at Penton - Final.pdf · Transforming data processing at Penton Raj Nair Director,

Transforming data processing at Penton

Raj Nair Director, Data Platform@Penton

Page 2: Transforming data processing at Penton - Meetupfiles.meetup.com/2824692/Transforming data processing at Penton - Final.pdf · Transforming data processing at Penton Raj Nair Director,

About Penton

• Professional information services company

• Provide actionable information to five core markets

Agriculture Transportation Natural Products Infrastructure Industrial Design &

Manufacturing

EquipmentWatch.com - Prices, Specs, Costs, Rental

Govalytics.com - Analytics around Gov’t capital spending down to county level

SourceESB - Vertical Directory, electronic parts

NextTrend.com - Identify new product trends in the natural products industry

Page 3: Transforming data processing at Penton - Meetupfiles.meetup.com/2824692/Transforming data processing at Penton - Final.pdf · Transforming data processing at Penton Raj Nair Director,

Practical Hadoop: Hadoop at Penton

Page 4: Transforming data processing at Penton - Meetupfiles.meetup.com/2824692/Transforming data processing at Penton - Final.pdf · Transforming data processing at Penton Raj Nair Director,

What got us thinking?

• Business units process data in silos

• Heavy ETL – Hours to process, in some cases days

• Not even using all the data we want • Not logging what we needed to • Can’t scale for future requirements

Page 5: Transforming data processing at Penton - Meetupfiles.meetup.com/2824692/Transforming data processing at Penton - Final.pdf · Transforming data processing at Penton Raj Nair Director,

Data Processing Pipeline New features

New Insights

New Products

Biz Value Assembly Line processing

Data Processing Pipeline

Page 6: Transforming data processing at Penton - Meetupfiles.meetup.com/2824692/Transforming data processing at Penton - Final.pdf · Transforming data processing at Penton Raj Nair Director,

Penton Examples

• Daily Inventory data, ingested throughout the day (tens of thousands of parts)

• Auction and survey data gathered daily

• Aviation Fleet data, varying frequency

Ingest, store Clean, validate

Apply Business Rules Map

Analyze Report Distribute

Slow Extract, Transform and Load = Frustration + missed business SLAs Won’t scale for future

Various data formats, mostly unstructured

Page 7: Transforming data processing at Penton - Meetupfiles.meetup.com/2824692/Transforming data processing at Penton - Final.pdf · Transforming data processing at Penton Raj Nair Director,

Two use cases

• Daily model data – upload and map

– Ingest data, build buckets

– Map data (batch and interactive)

– Build Aggregates (dynamic)

• Inventory data for electronics parts

– Hundreds of thousands of parts daily

– Ingest, map, apply biz rules, distribute

Issues: - Mapping Time

Issues: - Biz Rules Processing - Indexing time - Little insight into data quality - Little insight into failures

Page 8: Transforming data processing at Penton - Meetupfiles.meetup.com/2824692/Transforming data processing at Penton - Final.pdf · Transforming data processing at Penton Raj Nair Director,

Up until today… • Ingest raw CSVs as tables in RDBMS • Run stored procedures over batches of data • Build new tables for website queries • Build new tables for loading Solr/Search • Set retention dates to reduce database “clog”

Challenges for Models - Mapping, batch and interactive - On the fly aggregations - Post mapping distribution of data

Challenges for Inventory - Windows based systems - A good number of small files daily - File names contain metadata

What are the options ? And keep in mind …

Page 9: Transforming data processing at Penton - Meetupfiles.meetup.com/2824692/Transforming data processing at Penton - Final.pdf · Transforming data processing at Penton Raj Nair Director,

Where did we land ?

Adopt Hadoop Ecosystem - M/R: Ideal for Batch Processing - Flexible for storage - NoSQL: scale, usability and flexibility

Expand RDBMS options - Expensive - Complex

HBASE Oracle SQL

Server

Drools

Page 10: Transforming data processing at Penton - Meetupfiles.meetup.com/2824692/Transforming data processing at Penton - Final.pdf · Transforming data processing at Penton Raj Nair Director,

Models

Type Number of files

Total number of

records

Projected time to map

1 Auction 5,000 1,400,000 3 days

2 Rental Rate 5,000 3,700,000 8 days

3 Resale 5,000 6,535,000 2 days

4 Serial Number - Manufacturer

5,000 13,700,000 4 days

5 Serial Number - Web

5,000 8,220,000 12 days

Totals 25,000 33,555,000

Mapping Operations: 1. By File : Run map operation by single file 2. By type: Run map operations for all files of a specific type

4-Node Hadoop/HBase cluster

Entire 25,000 files map in 52 minutes

On the fly aggregations were materialized views in Oracle - Required some complicated coprocessor coding in Hbase for performance

Page 11: Transforming data processing at Penton - Meetupfiles.meetup.com/2824692/Transforming data processing at Penton - Final.pdf · Transforming data processing at Penton Raj Nair Director,

Architecture (Models)

REST API

CSV and Rule Management Endpoints

HBASE

HADOOP HDFS

CSV

Files

Master database

of Products/ Parts

Current Oracle

Schema

Push

Updates

Insert

Accepted Data

Existing Business

Applications

Data Upload UI

API

calls MR Jobs

Launch

R

E

S

T

Page 12: Transforming data processing at Penton - Meetupfiles.meetup.com/2824692/Transforming data processing at Penton - Final.pdf · Transforming data processing at Penton Raj Nair Director,

But we are not done …

• Our vision is a data platform

– More on that later

• First, the practical aspects of this journey

Page 13: Transforming data processing at Penton - Meetupfiles.meetup.com/2824692/Transforming data processing at Penton - Final.pdf · Transforming data processing at Penton Raj Nair Director,

Allocate Time for Detailed Research • Know, know, know your source details

- What’s the source of your source?

- What are the different formats?

- What’s the frequency?

- What are their vectors (web, ftp, e-mail, streaming) ?

- What metadata do you need or have currently?

- Where’s your metadata?

- What lookup data do you need? What format are they in?

- What data “sinks” do you distribute to (post processing) ?

For Instance: - In Inventory processing, some metadata was part of the filename – had a big influence - We had lookup data in SQL Server - We had to distribute data out to SQL Server, SOLR, data mart - We had a very large number of small files - We get files via e-mail, web and ftp – for simplicity we converge all vectors to ftp

Page 14: Transforming data processing at Penton - Meetupfiles.meetup.com/2824692/Transforming data processing at Penton - Final.pdf · Transforming data processing at Penton Raj Nair Director,

Allocate Time for Detailed Research

• Understand your processing patterns – In detail

- What portions are batch-processing vs interactive?

- Do you need to deal with joins, merges and updates?

- Do you need to process in “near” real-time?

- Are you going to reuse any existing processing workflows ?

- How much logging do you want to capture?

- How much of the processing do you want analytics on?

- Revisit with business owners – very important

For Inventory: - Logging of inventory rejection by business rule - Operational tracking of processing performance

For Models: - Business team needed to interactively perform mapping functions - Aggregations had to be built real-time after upload

Page 15: Transforming data processing at Penton - Meetupfiles.meetup.com/2824692/Transforming data processing at Penton - Final.pdf · Transforming data processing at Penton Raj Nair Director,

Allocate Time for Detailed Research

• Investigate the different workloads

- What’s the volume of your transactional workloads?

- What’s the nature of the workloads (Read, Write, Read-heavy, Write-heavy..)

- Is there a requirement for Exploratory BI / DW?

- Is there a requirement for high performance BI ?

- What’s the expected data growth rate?

Page 16: Transforming data processing at Penton - Meetupfiles.meetup.com/2824692/Transforming data processing at Penton - Final.pdf · Transforming data processing at Penton Raj Nair Director,

Skills and Expertise

Invest in learning

Get used to: - File-based processing - Key value pairs - Distributed computing

Pay special attention to: - InputFormats - InputSplits - OutputFormats - Small Files Problem - Controlling output - “Append only”

HADOOP

Keep an open mind

Get acquainted with: - Flexible Schemas - Less joins - CAP Theorem - constraints re: RDBMS - sharding, clustering

NoSQL

Make sure you understand: - When you reap benefits - Indexing or the lack of it - performance benchmarks from unbiased studies

And good luck finding skills in the market

Be prepared to do POCs – the dangers of not using your own data set to test are many

Page 17: Transforming data processing at Penton - Meetupfiles.meetup.com/2824692/Transforming data processing at Penton - Final.pdf · Transforming data processing at Penton Raj Nair Director,

Enough already.. No more MapReduce

Page 18: Transforming data processing at Penton - Meetupfiles.meetup.com/2824692/Transforming data processing at Penton - Final.pdf · Transforming data processing at Penton Raj Nair Director,

Watch the Trends !!

Hadoop is becoming the OS of Big Data

Production use cases are still MR for batch

Abstractions will rise to save the day

Page 19: Transforming data processing at Penton - Meetupfiles.meetup.com/2824692/Transforming data processing at Penton - Final.pdf · Transforming data processing at Penton Raj Nair Director,

Some interesting challenges lie ahead

• Infusing more context in the processing pipeline

Data Enrichment

• Content recommendations

Machine Learning

• Pipeline moves from batch to more near real-time

Real-time Ingestion and processing

Real-time Solr indexing

Page 20: Transforming data processing at Penton - Meetupfiles.meetup.com/2824692/Transforming data processing at Penton - Final.pdf · Transforming data processing at Penton Raj Nair Director,

Incubate, Innovate … But keep application integration seamless

Hadoop Ecosystem Data Marts OLTP Store NoSQL

Data Flow

Data Infrastructure

Ingestion ETL Meta Logging

Data as a Platform

We are building a data platform

Page 21: Transforming data processing at Penton - Meetupfiles.meetup.com/2824692/Transforming data processing at Penton - Final.pdf · Transforming data processing at Penton Raj Nair Director,

And so…

We are hiring !!!

Java, Hadoop, ETL, Data Warehousing, NoSQL, Machine Learning, Python, PHP, Spark, Drools

Comp Sci, Comp Eng.

[email protected]