Xoom.com
• Digital remittance
• Founded in 2001
• Acquired bluekite.com in 2014
• Acquired by PayPal in 2016
2
A little history
Xoom.com
• Digital remittance Highly regulated environment
• Founded in 2001
• Acquired bluekite.com in 2014
• Acquired by PayPal in 2016
3
A little history (translated)
Xoom.com
• Digital remittance Highly regulated environment
• Founded in 2001 16 years of code and data
• Acquired bluekite.com in 2014
• Acquired by PayPal in 2016
4
A little history (translated)
Xoom.com
• Digital remittance Highly regulated environment
• Founded in 2001 16 years of code and data
• Acquired bluekite.com in 2014 Polyglot code and persistence
• Acquired by PayPal in 2016
5
A little history (translated)
Xoom.com
• Digital remittance Highly regulated environment
• Founded in 2001 16 years of code and data
• Acquired bluekite.com in 2014 Polyglot code and persistence
• Acquired by PayPal in 2016 New rules
6
A little history (translated)
Throwing down the gauntlet
• Decouple teams
• Reduce time to build and deploy
• Understand our resource needs
• Scale appropriately
7
Break up the monolith(s)
Microservices to the rescue
• Programming paradigms and idioms
• Service discovery
• Monitoring
• Performance
• Infrastructure as code
• Build and deployment pipeline
• Data ownership
8
Challenges and risks
Programming paradigms and idioms
• Network operations
• Circuit breakers
• Aggressive timeouts
• Retries
• Throttles
• API designs
• RPC vs REST
• Batch operations
• Response code granularity
• Contracts
• Packaging
• Metadata
• Management uniform
9
Service discovery
• Custom, local, layer seven load balancers
• Zookeeper back-end
• Apache Curator
• Registration, health checks, and routing
• Service Portal
• Integrating with linkerd.io
10
The service-proxy solution
Zookeeper
Host
Service-proxy
App A App B
Monitoring
• Define required measurements
• persistence operations
• remote calls
• service endpoints
• 3rd party service endpoints
• Define metric types
• gauges
• counters
• histograms
• Standard naming scheme
• Self-service dashboards
• Time series explosion
11
Grafana and InfluxDB
Performance
• Additional network latency has been offset by:
• Reduced contention on datastores
• Limiting the scope of database transactions
• Optimization through observability
• Throughput has improved dramatically
• Latency distribution is wider
• Latency sensitive APIs are deployed nearby
12
Throughput and response latency
Infrastructure as code
• TDD isn’t just for applications
• Terraform and Packer for host provisioning on AWS and Vsphere
• Puppet and Ansible acceptance testing using beaker
• Network gear
• Standardize app packaging
• Docker
• Contracts for deployment
• Application control plane
13
Build and deploy pipeline
• Git-flow
• Branch per feature
• Docker-flow
• Container per branch
• Seed jobs
• Build job per branch
• Automated and self service deployments
• Dev and QA teams can choose branches to deploy and test
• Fidelity of environments
• Environment fidelity ∝ automation success
14
Data ownership
• Hard problem
• Start eliminating cross-domain joins now
• Two years on, we are just now migrating the last auth-server client from tables to APIs
• Analytics becomes more complicated
16
Current status
• ~100 distinct microservices across 3 production data centers
• Most new features are developed as microservices
• Monoliths still exist, but are being chipped away
17
Lessons learned
• Measure everything, and be prepared to scale your monitoring system
• Application packaging contracts and delivery pipelines are mandatory
• Staff a tooling team for build, test, and deployment automation
• Enroll your network operations team
• The infrastructure and culture we built in order to move to microservices has paid off
• Elimination of the monoliths isn’t that important
18