Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Delivering Actionable Insights on Large-scale Data Sets
2
Sakshi Ganeriwal
• Senior Software Engineer, Lighthouse Analytics • 3+ years building scalable platforms with Akka Streams, Squbs,
Hadoop Map Reduce, Spark, Elasticsearch, Druid • Machine Learning and Data Science Enthusiast • Contributed to Squbs for the Spark Streaming integration
About Me
Agenda
© 2019 PayPal Inc. Confidential and proprietary.
Use cases Tracking Platform solves
Challenges faced
Solutions incorporated
Connecting pieces together
PayPal is more than a button
Loyalty
Faster Conversion
Reduction in Cart
Abandonment
Credit
Customer Acquisition
APV Lift
Invoicing
Offers
CBT Mobile In-Store Online
4
Velocity and Scale
~267 M Active users
Q4’18
~164 Billion TPV
Q4’18
~250 M Fast
Decisions/Day
~25 Billion Computations/day
~60 Billion Queries/day
~150 PB Data
PayPal Datasets
5
Social Media
Demo-graphics
Xoom
Disputes
Email Application Logs
Invoice
Credit
Reversals
User Behavior Tracking
CBT
Risk
Consumer
Merchants
Partners
Venmo Marketing
Transaction
Spending
Use Cases
Insights
Real Time
Direct
Detect Anomalies in • Infosec • New Experiments • Site Speed • Campaign behavior • Merchant Integration
ML Model Based
• Bot • Intent • Promotions and
marketing
Batch
Direct
• Funnel Analysis • Flow Comparisons • Notification • Segment
comparisons • Trending reports
ML Model Based
Build Descriptive, Predictive and Prescriptive models
What Data Do We Process?
© 2019 PayPal Inc. Confidential and proprietary.
Explore
Enroll – New Acct
Types of data affect choice of modeling methods and frameworks
Manage via Self-Service
Transact
Resolve
Structured data… Numbers Strings Data
Geo..
… + Unstructured data
ChatBot
Text – emails, customer interaction records
Voice - IVR
Social Media Features
Images
ML Models
© 2019 PayPal Inc. Confidential and proprietary.
Inferencing in Production Ecosystem
Segment Model Model on different subsets of
data
Model Ensemble Different models on same
data
Model Composition Model sequencing and
selection
Model 1
Model 1
Model 1 Model 1
Model 1
Model 1
Model 1
Model 1
Model 1
Model 1 or Segment the data
Multiple models at checkpoint (Acct Takeover’ Card Auth; Linkage…) Analysis of models’ performance (sample group)
Our Platform
10
Acquisition Layer
© 2019 PayPal Inc. Confidential and proprietary.
Challenges & Solution
Responsive
Scalable
50 rps
25 rps
25 rps
75 rps
Server A
Server B
Server C
Server D Server A
Resilient
Message - Driven
25 rps * 60 sec/min * 60 min/h * = 90,000 rph !!
Req
Resp
Extract
Respond
MergeHub
Req
Resp
Extract
Respond
Enrich Flow
Kafka Sink
Enrichment Flow
HTTP Flows
Collector & Enricher Flow
Collector & Enricher Flow
Req
Resp
Extract
Respond
MergeHub
Req
Resp
Extract
Respond
Enrich Flow
Kafka Sink
Persistent Buffer
Deals with Kafka Rebalancing/Unavailability
Enrichment Flow
HTTP Flows
// PerpetualStream: Enrichment Flow def streamGraph = MergeHub.source[Beacon] .via(enrichFlow) .via(PersistentBuffer[EnrichedBeacon](new File("/var/tmp/pb"))) .map(bcn => new ProducerRecord[Array[Byte], EnrichedBeacon]("beacons", bcn)) .toMat(Producer.plainSink(settings))(Keep.both)
// FlowDefinition: HTTP Flow val (enrichStream, _) = matValue("/user/enrich/enrichstream") def flow = Flow[HttpRequest] .mapAsync(1)(Unmarshal(_).to[Beacon]) .alsoTo(enrichStream) .map(beacon => HttpResponse(entity = s"Received Id: ${beacon.id}”))
Messaging & Processing
© 2019 PayPal Inc. Confidential and proprietary.
Messaging Layer • Fault Tolerant • Low latency • Streaming support • Scalable
Input data stream Input Table
(rawRecords DataFrame Processing Layer • Filter, Transform, Cleanup • Convert to Efficient Storage format • Partition by important columns • Fault Tolerant • Streaming support
Aggregation & Visualization
© 2019 PayPal Inc. Confidential and proprietary.
Aggregation Layer • Support massive real time data ingestion
• Sub Second Queries
• Flexible Data Exploration
• Support Business Intelligence
Visualization Layer
• Interfaces with the Aggregation Layer
• Varied kinds of graphs
• Low Latency
• Support custom queries
• Supports custom dashboards
Connecting the Streaming Architectures together
© 2019 PayPal Inc. Confidential and proprietary.
Constellation of Microservices
PayPal’s Analytics
Dashboard - Herald
Conclusion
18
Takeaways
Modeling: Review business performance of DL vs simple model Model deployment: Choose Real-time vs Near Real-time vs Offline Data: Have a data store strategy with clearly defined data processing flows, and know your data Infrastructure: Analyze ROI for GPU inferencing (unlike training) DevOps: Automated deployment & config management Architecture: Make the pipeline modular and reactive.
© 2019 PayPal Inc. Confidential and proprietary.
To be continued …
QUESTIONS?
Thank You!! [email protected]