Transforming data processing at Penton - Meetupfiles.meetup.com/2824692/Transforming data processing...

Preview:

Citation preview

Transforming data processing at Penton

Raj Nair Director, Data Platform@Penton

About Penton

• Professional information services company

• Provide actionable information to five core markets

Agriculture Transportation Natural Products Infrastructure Industrial Design &

Manufacturing

EquipmentWatch.com - Prices, Specs, Costs, Rental

Govalytics.com - Analytics around Gov’t capital spending down to county level

SourceESB - Vertical Directory, electronic parts

NextTrend.com - Identify new product trends in the natural products industry

Practical Hadoop: Hadoop at Penton

What got us thinking?

• Business units process data in silos

• Heavy ETL – Hours to process, in some cases days

• Not even using all the data we want • Not logging what we needed to • Can’t scale for future requirements

Data Processing Pipeline New features

New Insights

New Products

Biz Value Assembly Line processing

Data Processing Pipeline

Penton Examples

• Daily Inventory data, ingested throughout the day (tens of thousands of parts)

• Auction and survey data gathered daily

• Aviation Fleet data, varying frequency

Ingest, store Clean, validate

Apply Business Rules Map

Analyze Report Distribute

Slow Extract, Transform and Load = Frustration + missed business SLAs Won’t scale for future

Various data formats, mostly unstructured

Two use cases

• Daily model data – upload and map

– Ingest data, build buckets

– Map data (batch and interactive)

– Build Aggregates (dynamic)

• Inventory data for electronics parts

– Hundreds of thousands of parts daily

– Ingest, map, apply biz rules, distribute

Issues: - Mapping Time

Issues: - Biz Rules Processing - Indexing time - Little insight into data quality - Little insight into failures

Up until today… • Ingest raw CSVs as tables in RDBMS • Run stored procedures over batches of data • Build new tables for website queries • Build new tables for loading Solr/Search • Set retention dates to reduce database “clog”

Challenges for Models - Mapping, batch and interactive - On the fly aggregations - Post mapping distribution of data

Challenges for Inventory - Windows based systems - A good number of small files daily - File names contain metadata

What are the options ? And keep in mind …

Where did we land ?

Adopt Hadoop Ecosystem - M/R: Ideal for Batch Processing - Flexible for storage - NoSQL: scale, usability and flexibility

Expand RDBMS options - Expensive - Complex

HBASE Oracle SQL

Server

Drools

Models

Type Number of files

Total number of

records

Projected time to map

1 Auction 5,000 1,400,000 3 days

2 Rental Rate 5,000 3,700,000 8 days

3 Resale 5,000 6,535,000 2 days

4 Serial Number - Manufacturer

5,000 13,700,000 4 days

5 Serial Number - Web

5,000 8,220,000 12 days

Totals 25,000 33,555,000

Mapping Operations: 1. By File : Run map operation by single file 2. By type: Run map operations for all files of a specific type

4-Node Hadoop/HBase cluster

Entire 25,000 files map in 52 minutes

On the fly aggregations were materialized views in Oracle - Required some complicated coprocessor coding in Hbase for performance

Architecture (Models)

REST API

CSV and Rule Management Endpoints

HBASE

HADOOP HDFS

CSV

Files

Master database

of Products/ Parts

Current Oracle

Schema

Push

Updates

Insert

Accepted Data

Existing Business

Applications

Data Upload UI

API

calls MR Jobs

Launch

R

E

S

T

But we are not done …

• Our vision is a data platform

– More on that later

• First, the practical aspects of this journey

Allocate Time for Detailed Research • Know, know, know your source details

- What’s the source of your source?

- What are the different formats?

- What’s the frequency?

- What are their vectors (web, ftp, e-mail, streaming) ?

- What metadata do you need or have currently?

- Where’s your metadata?

- What lookup data do you need? What format are they in?

- What data “sinks” do you distribute to (post processing) ?

For Instance: - In Inventory processing, some metadata was part of the filename – had a big influence - We had lookup data in SQL Server - We had to distribute data out to SQL Server, SOLR, data mart - We had a very large number of small files - We get files via e-mail, web and ftp – for simplicity we converge all vectors to ftp

Allocate Time for Detailed Research

• Understand your processing patterns – In detail

- What portions are batch-processing vs interactive?

- Do you need to deal with joins, merges and updates?

- Do you need to process in “near” real-time?

- Are you going to reuse any existing processing workflows ?

- How much logging do you want to capture?

- How much of the processing do you want analytics on?

- Revisit with business owners – very important

For Inventory: - Logging of inventory rejection by business rule - Operational tracking of processing performance

For Models: - Business team needed to interactively perform mapping functions - Aggregations had to be built real-time after upload

Allocate Time for Detailed Research

• Investigate the different workloads

- What’s the volume of your transactional workloads?

- What’s the nature of the workloads (Read, Write, Read-heavy, Write-heavy..)

- Is there a requirement for Exploratory BI / DW?

- Is there a requirement for high performance BI ?

- What’s the expected data growth rate?

Skills and Expertise

Invest in learning

Get used to: - File-based processing - Key value pairs - Distributed computing

Pay special attention to: - InputFormats - InputSplits - OutputFormats - Small Files Problem - Controlling output - “Append only”

HADOOP

Keep an open mind

Get acquainted with: - Flexible Schemas - Less joins - CAP Theorem - constraints re: RDBMS - sharding, clustering

NoSQL

Make sure you understand: - When you reap benefits - Indexing or the lack of it - performance benchmarks from unbiased studies

And good luck finding skills in the market

Be prepared to do POCs – the dangers of not using your own data set to test are many

Enough already.. No more MapReduce

Watch the Trends !!

Hadoop is becoming the OS of Big Data

Production use cases are still MR for batch

Abstractions will rise to save the day

Some interesting challenges lie ahead

• Infusing more context in the processing pipeline

Data Enrichment

• Content recommendations

Machine Learning

• Pipeline moves from batch to more near real-time

Real-time Ingestion and processing

Real-time Solr indexing

Incubate, Innovate … But keep application integration seamless

Hadoop Ecosystem Data Marts OLTP Store NoSQL

Data Flow

Data Infrastructure

Ingestion ETL Meta Logging

Data as a Platform

We are building a data platform

And so…

We are hiring !!!

Java, Hadoop, ETL, Data Warehousing, NoSQL, Machine Learning, Python, PHP, Spark, Drools

Comp Sci, Comp Eng.

rajesh.nair@penton.com

Recommended