23

#GeodeSummit - Large Scale Fraud Detection using GemFire Integrated with Greenplum

Embed Size (px)

Citation preview

2 © 2016 Pivotal Software, Inc. All rights reserved. 2 © 2016 Pivotal Software, Inc. All rights reserved.

Large Scale Fraud Analytics GemFire Greenplum Connector (G2C)

3 © 2016 Pivotal Software, Inc. All rights reserved.

Background

� Government fraud revenue retention program

� Detecting & retaining ~$5B annually –  Primary focus on identity theft –  Processes up to 8 million cases per day –  Current & historic data size ~60 TB (compressed)

� Modifying architecture to integrate GemFire for scalable Java-based business logic, web service integration, and event driven design

4 © 2016 Pivotal Software, Inc. All rights reserved.

Fraud Systems Simplified

Prepare

•  Ingest •  Restructure (ETL)

Score •  Model Evaluation

Disposition

•  Business Logic •  Prioritization

Respond

•  Investigation •  Stop Payments

Business Logic Engine

ETL

Reporting

In-db Analytics

Application Services

5 © 2016 Pivotal Software, Inc. All rights reserved.

Case Study Architecture – Scaling Up

GemFire

Greenplum

Spring Boot App Services

Informatica w/ PWX (ETL)

Business Objects (Reporting)

Legacy Logic Implementation

Logic Engine

In-db Analytics

Greenplum

Prepare

•  Ingest •  Restructure (ETL)

Score •  Model Evaluation

Disposition

•  Business Logic •  Prioritization

Respond

•  Investigation •  Stop Payments

6 © 2016 Pivotal Software, Inc. All rights reserved.

Pivotal Greenplum (GPDB)

� Postgres Community OSS –  Original fork of 8.2.15 –  Massively parallel processing

database

� Master coordinates queries across segments databases

� Supports in-database model evaluation –  MadLib, PL/R, SAS

GPDB

Logical

GPDB

Physical

GPDB

Software

Master

Segments

7 © 2016 Pivotal Software, Inc. All rights reserved.

Initial Implementation

� Fraud model results evaluated by business logic engine

� Flat file data extraction –  Significant custom code to

construct required object model –  Table à CSV à POJO

� Shared element in an otherwise distributed system –  Performance considerations

GPDB

Legacy Logic Implementation

8 © 2016 Pivotal Software, Inc. All rights reserved.

Architecture Adjustments

� New requirements introduced external integrations –  Drives desire for web-services

� Desire to improve performance & simplify codebase

� Expanding business logic –  Logic engine run as a GemFire

function

GemFire

GPDB

Legacy Logic Implementation

Spring Boot (App Services)

9 © 2016 Pivotal Software, Inc. All rights reserved. 9 © 2016 Pivotal Software, Inc. All rights reserved.

GemFire Greenplum Connector

10 © 2016 Pivotal Software, Inc. All rights reserved.

Context

Greenplum!

ANSI SQL

Analytical

Parallel Configurable Data

Load

GemFire!App 1 App 1 App 1

App 1 App 1 App 2

Native API Rest / HTTP

Transactional

Custom Apps

Transactional data write

behind

Data Science, Analytics & ML

11 © 2016 Pivotal Software, Inc. All rights reserved.

GemFire Greenplum Connector (G2C)

� Extension package for GemFire

� Provides simple import and export of data between GemFire regions & Greenplum tables –  Parallel data motion leveraging Greenplum’s external table interface

� Simple mapping between table rows and PdxInstance –  Flat object relational mapping –  Set of predefined type conversions –  Configurable GemFire data collocation

12 © 2016 Pivotal Software, Inc. All rights reserved.

Greenplum

Master

Segments GemFire

G2C Data Interfaces

JDBC / ODBC

Data Node

Data Node

Control Logic

13 © 2016 Pivotal Software, Inc. All rights reserved.

GpdbService is the primary entry point for explicitly invoked data motion

1.  Import - loads the full table contents from Greenplum

2.  Export - sends region contents to Greenplum

Sample Data Import / Export Cache cache = CacheFactory.getAnyInstance(); GpdbService gpdb = GpdbService.getInstance(cache); long count; count = gpdb.importRegion(region); count = gpdb.exportRegion(region);

12

14 © 2016 Pivotal Software, Inc. All rights reserved.

Basic Cache Configuration Configured via GemFire extension framework •  1) Each region maps to a jndi data

source back by Greenplum •  2) Link an entity type and table •  3) Declare a field to be used as the key

•  Compound keys supported •  4) Define a mapping between the table

columns •  Default auto-configuration •  Optional name and column attributes for

naming convention changes •  Class used to control type conversion •  Set of built in types

<region name="Parent"> <region-attributes refid="PARTITION"> <partition-attributes/> </region-attributes> <gpdb:store datasource="datasource"> <gpdb:types> <gpdb:pdx name="io.pivotal...entity.Parent" table="parent"> <gpdb:id field="id" /> <gpdb:fields> <gpdb:field name="name" /> <gpdb:field name="id" column="id" /> <gpdb:field name="income"

class="java.math.BigDecimal" /> </gpdb:fields> </gpdb:pdx> </gpdb:types> </gpdb:store> </region>

2

1

3

4

15 © 2016 Pivotal Software, Inc. All rights reserved.

Configuring Collocation Parent-child foreign key relationships

supported through collocation 1.  Compound keys configurations

result in a HashMap based key in GemFire

2.  Provided partition resolver works with compound keys

<region name="Child"> <...> <partition-resolver> <class-name> io.pivotal.gemfire.gpdb.IdPartitionResolver

</class-name> <parameter name="field"> <string>parentId</string> </parameter> </...> <gpdb:id> <gpdb:field ref="parentId" /> <gpdb:field ref="id" /> </gpdb:id> <gpdb:fields>

<gpdb:field name="parentId"/> <gpdb:field name="id" />

</...>

1

2

16 © 2016 Pivotal Software, Inc. All rights reserved.

Configuring Automatic Synchronization ●  Data exported to Greenplum via

asynchronous eventing ○  Time and batch size triggers

available

●  Causes each GemFire member to independently interact with Greenplum ○  Configure GPDB resource queues

accordingly

<region name="Child"> <...> <gpdb:store datasource="datasource"> <gpdb:synchronize mode="automatic"

time-interval="3000" persistent="false" />

<gpdb:types> <...>

17 © 2016 Pivotal Software, Inc. All rights reserved.

Case Study G2C Configuration Details

� Existing required domain objects –  Multiple many-to-one groupings

� Wide tables / objects (500+ fields)

� Data Collocation configured on caseId

� Source tables wrapped in views

CaseWrapper

-  caseId -  …

ModelScores

-  caseId -  …

Documents

-  caseId -  …

PriorHistory

-  caseId -  …

OtherData…

-  caseId -  …

* *

* *

1

LogicResults

-  caseId -  …

18 © 2016 Pivotal Software, Inc. All rights reserved.

Simple Loading – Single Table per Object :LoadTrigger :GPDBService :Region :AsyncEventLister :LogicEngine results:Region

Import() put()

processEvents()

process()

put()

19 © 2016 Pivotal Software, Inc. All rights reserved.

Complex Loading – Multiple Tables per Object :MergeLoader :GPDBService :Region :LogicEngine results:Region

Import() put()

process()

put()

par

assemble()

:LoadTrigger

executeFunction()

20 © 2016 Pivotal Software, Inc. All rights reserved.

Impacts & Results

� Simplified implementation & code reduction

� Maintained or improved data motion rates –  Case study CPU bound –  Additional improvements in the backlog

�  Improved system throughput

21 © 2016 Pivotal Software, Inc. All rights reserved. 21 © 2016 Pivotal Software, Inc. All rights reserved.

Questions?

Join the Apache Geode Community!

•  Check out: http://geode.incubator.apache.org

•  Subscribe: [email protected]

•  Download: http://geode.incubator.apache.org/releases/