1

Delivering Actionable Insights on Large-scale Data Sets · • 3+ years building scalable platforms with Akka Streams, Squbs, Hadoop Map Reduce, Spark, Elasticsearch, Druid • Machine

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Delivering Actionable Insights on Large-scale Data Sets · • 3+ years building scalable platforms with Akka Streams, Squbs, Hadoop Map Reduce, Spark, Elasticsearch, Druid • Machine
Page 2: Delivering Actionable Insights on Large-scale Data Sets · • 3+ years building scalable platforms with Akka Streams, Squbs, Hadoop Map Reduce, Spark, Elasticsearch, Druid • Machine

Delivering Actionable Insights on Large-scale Data Sets

Page 3: Delivering Actionable Insights on Large-scale Data Sets · • 3+ years building scalable platforms with Akka Streams, Squbs, Hadoop Map Reduce, Spark, Elasticsearch, Druid • Machine

2

Sakshi Ganeriwal

• Senior Software Engineer, Lighthouse Analytics • 3+ years building scalable platforms with Akka Streams, Squbs,

Hadoop Map Reduce, Spark, Elasticsearch, Druid • Machine Learning and Data Science Enthusiast • Contributed to Squbs for the Spark Streaming integration

About Me

Page 4: Delivering Actionable Insights on Large-scale Data Sets · • 3+ years building scalable platforms with Akka Streams, Squbs, Hadoop Map Reduce, Spark, Elasticsearch, Druid • Machine

Agenda

© 2019 PayPal Inc. Confidential and proprietary.

Use cases Tracking Platform solves

Challenges faced

Solutions incorporated

Connecting pieces together

Page 5: Delivering Actionable Insights on Large-scale Data Sets · • 3+ years building scalable platforms with Akka Streams, Squbs, Hadoop Map Reduce, Spark, Elasticsearch, Druid • Machine

PayPal is more than a button

Loyalty

Faster Conversion

Reduction in Cart

Abandonment

Credit

Customer Acquisition

APV Lift

Invoicing

Offers

CBT Mobile In-Store Online

4

Velocity and Scale

~267 M Active users

Q4’18

~164 Billion TPV

Q4’18

~250 M Fast

Decisions/Day

~25 Billion Computations/day

~60 Billion Queries/day

~150 PB Data

Page 6: Delivering Actionable Insights on Large-scale Data Sets · • 3+ years building scalable platforms with Akka Streams, Squbs, Hadoop Map Reduce, Spark, Elasticsearch, Druid • Machine

PayPal Datasets

5

Social Media

Demo-graphics

Xoom

Disputes

Email Application Logs

Invoice

Credit

Reversals

User Behavior Tracking

CBT

Risk

Consumer

Merchants

Partners

Venmo Marketing

Transaction

Spending

Page 7: Delivering Actionable Insights on Large-scale Data Sets · • 3+ years building scalable platforms with Akka Streams, Squbs, Hadoop Map Reduce, Spark, Elasticsearch, Druid • Machine

Use Cases

Page 8: Delivering Actionable Insights on Large-scale Data Sets · • 3+ years building scalable platforms with Akka Streams, Squbs, Hadoop Map Reduce, Spark, Elasticsearch, Druid • Machine

Insights

Real Time

Direct

Detect Anomalies in • Infosec • New Experiments • Site Speed • Campaign behavior • Merchant Integration

ML Model Based

• Bot • Intent • Promotions and

marketing

Batch

Direct

• Funnel Analysis • Flow Comparisons • Notification • Segment

comparisons • Trending reports

ML Model Based

Build Descriptive, Predictive and Prescriptive models

Page 9: Delivering Actionable Insights on Large-scale Data Sets · • 3+ years building scalable platforms with Akka Streams, Squbs, Hadoop Map Reduce, Spark, Elasticsearch, Druid • Machine

What Data Do We Process?

© 2019 PayPal Inc. Confidential and proprietary.

Explore

Enroll – New Acct

Types of data affect choice of modeling methods and frameworks

Manage via Self-Service

Transact

Resolve

Structured data… Numbers Strings Data

Geo..

… + Unstructured data

ChatBot

Text – emails, customer interaction records

Voice - IVR

Social Media Features

Images

Page 10: Delivering Actionable Insights on Large-scale Data Sets · • 3+ years building scalable platforms with Akka Streams, Squbs, Hadoop Map Reduce, Spark, Elasticsearch, Druid • Machine

ML Models

© 2019 PayPal Inc. Confidential and proprietary.

Inferencing in Production Ecosystem

Segment Model Model on different subsets of

data

Model Ensemble Different models on same

data

Model Composition Model sequencing and

selection

Model 1

Model 1

Model 1 Model 1

Model 1

Model 1

Model 1

Model 1

Model 1

Model 1 or Segment the data

Multiple models at checkpoint (Acct Takeover’ Card Auth; Linkage…) Analysis of models’ performance (sample group)

Page 11: Delivering Actionable Insights on Large-scale Data Sets · • 3+ years building scalable platforms with Akka Streams, Squbs, Hadoop Map Reduce, Spark, Elasticsearch, Druid • Machine

Our Platform

10

Page 12: Delivering Actionable Insights on Large-scale Data Sets · • 3+ years building scalable platforms with Akka Streams, Squbs, Hadoop Map Reduce, Spark, Elasticsearch, Druid • Machine

Acquisition Layer

© 2019 PayPal Inc. Confidential and proprietary.

Challenges & Solution

Responsive

Scalable

50 rps

25 rps

25 rps

75 rps

Server A

Server B

Server C

Server D Server A

Resilient

Message - Driven

25 rps * 60 sec/min * 60 min/h * = 90,000 rph !!

Page 13: Delivering Actionable Insights on Large-scale Data Sets · • 3+ years building scalable platforms with Akka Streams, Squbs, Hadoop Map Reduce, Spark, Elasticsearch, Druid • Machine

Req

Resp

Extract

Respond

MergeHub

Req

Resp

Extract

Respond

Enrich Flow

Kafka Sink

Enrichment Flow

HTTP Flows

Collector & Enricher Flow

Page 14: Delivering Actionable Insights on Large-scale Data Sets · • 3+ years building scalable platforms with Akka Streams, Squbs, Hadoop Map Reduce, Spark, Elasticsearch, Druid • Machine

Collector & Enricher Flow

Req

Resp

Extract

Respond

MergeHub

Req

Resp

Extract

Respond

Enrich Flow

Kafka Sink

Persistent Buffer

Deals with Kafka Rebalancing/Unavailability

Enrichment Flow

HTTP Flows

Page 15: Delivering Actionable Insights on Large-scale Data Sets · • 3+ years building scalable platforms with Akka Streams, Squbs, Hadoop Map Reduce, Spark, Elasticsearch, Druid • Machine

// PerpetualStream: Enrichment Flow def streamGraph = MergeHub.source[Beacon] .via(enrichFlow) .via(PersistentBuffer[EnrichedBeacon](new File("/var/tmp/pb"))) .map(bcn => new ProducerRecord[Array[Byte], EnrichedBeacon]("beacons", bcn)) .toMat(Producer.plainSink(settings))(Keep.both)

// FlowDefinition: HTTP Flow val (enrichStream, _) = matValue("/user/enrich/enrichstream") def flow = Flow[HttpRequest] .mapAsync(1)(Unmarshal(_).to[Beacon]) .alsoTo(enrichStream) .map(beacon => HttpResponse(entity = s"Received Id: ${beacon.id}”))

Page 16: Delivering Actionable Insights on Large-scale Data Sets · • 3+ years building scalable platforms with Akka Streams, Squbs, Hadoop Map Reduce, Spark, Elasticsearch, Druid • Machine

Messaging & Processing

© 2019 PayPal Inc. Confidential and proprietary.

Messaging Layer • Fault Tolerant • Low latency • Streaming support • Scalable

Input data stream Input Table

(rawRecords DataFrame Processing Layer • Filter, Transform, Cleanup • Convert to Efficient Storage format • Partition by important columns • Fault Tolerant • Streaming support

Page 17: Delivering Actionable Insights on Large-scale Data Sets · • 3+ years building scalable platforms with Akka Streams, Squbs, Hadoop Map Reduce, Spark, Elasticsearch, Druid • Machine

Aggregation & Visualization

© 2019 PayPal Inc. Confidential and proprietary.

Aggregation Layer • Support massive real time data ingestion

• Sub Second Queries

• Flexible Data Exploration

• Support Business Intelligence

Visualization Layer

• Interfaces with the Aggregation Layer

• Varied kinds of graphs

• Low Latency

• Support custom queries

• Supports custom dashboards

Page 18: Delivering Actionable Insights on Large-scale Data Sets · • 3+ years building scalable platforms with Akka Streams, Squbs, Hadoop Map Reduce, Spark, Elasticsearch, Druid • Machine

Connecting the Streaming Architectures together

© 2019 PayPal Inc. Confidential and proprietary.

Constellation of Microservices

PayPal’s Analytics

Dashboard - Herald

Page 19: Delivering Actionable Insights on Large-scale Data Sets · • 3+ years building scalable platforms with Akka Streams, Squbs, Hadoop Map Reduce, Spark, Elasticsearch, Druid • Machine

Conclusion

18

Page 20: Delivering Actionable Insights on Large-scale Data Sets · • 3+ years building scalable platforms with Akka Streams, Squbs, Hadoop Map Reduce, Spark, Elasticsearch, Druid • Machine

Takeaways

Modeling: Review business performance of DL vs simple model Model deployment: Choose Real-time vs Near Real-time vs Offline Data: Have a data store strategy with clearly defined data processing flows, and know your data Infrastructure: Analyze ROI for GPU inferencing (unlike training) DevOps: Automated deployment & config management Architecture: Make the pipeline modular and reactive.

© 2019 PayPal Inc. Confidential and proprietary.

To be continued …

Page 21: Delivering Actionable Insights on Large-scale Data Sets · • 3+ years building scalable platforms with Akka Streams, Squbs, Hadoop Map Reduce, Spark, Elasticsearch, Druid • Machine

QUESTIONS?

Page 22: Delivering Actionable Insights on Large-scale Data Sets · • 3+ years building scalable platforms with Akka Streams, Squbs, Hadoop Map Reduce, Spark, Elasticsearch, Druid • Machine

Thank You!! [email protected]

Page 23: Delivering Actionable Insights on Large-scale Data Sets · • 3+ years building scalable platforms with Akka Streams, Squbs, Hadoop Map Reduce, Spark, Elasticsearch, Druid • Machine