Upload
tomer-gabel
View
1.049
Download
0
Embed Size (px)
Citation preview
Service Scheduling
Topology
Service→host
mapping
Server inventory
Service catalogue
Formally,
“scheduling”
Service Scheduling
• A hard problem!
• Multiple dimensions:
– Resource utilization(disk space, I/O, RAM,
network, power…)
– Resource availability
– Failover(physical server, rack, row…)
– Custom constraints(zoning, e.g. PCI compliance)
Service Scheduling
• A hard problem!
• Multiple dimensions:
– Resource utilization(disk space, I/O, RAM,
network, power…)
– Resource availability
– Failover(physical server, rack, row…)
– Custom constraints(zoning, e.g. PCI compliance)
Service Scheduling
• The middle ground:
– Naïve automatic
scheduler
– Human-configured
overrides for zoning,
optimization
• Easy but limited scale
– A few hundred servers
In practice
• Static topology
– Managed with Frying Pan
– Exported to Chef
– Deployed via
configuration files
• Live registry in
Zookeeper
– Deployment only
– … for now
Protocol
• Style– RPC
– Message passing
• Transport– TCP
– HTTP
• Serialization– JSON
– Protocol Buffers
– Thrift
– Avro
Load balancing
• Centralized
– Simple
– Limited flexibility
– Limited scale
– Thin implementation
highly portable
– Suitable for static
topologies
• Distributed
– Highly scalable
– Flexible
– Fully dynamic
– Fat implementation
difficult to port
• Quasi-distributed
– e.g. Synapse
– Best of both worlds?
Frying Pan
Chef
Nginx
To our shame
• There’s always IDL.
• Informal– Usually ad-hoc
documentation
• Formal– Swagger, Apiary etc.
– ProtoBuf, Thrift, Avro
– WSDL, god forbid!
• … or– Ad-hoc
public interface SiteMembersService {
SiteMemberDto getMemberById(
Guid<SiteMember> memberId,
UserGuid userId);
SiteMemberDto getMemberOrOwnerById(
Guid<SiteMember> memberId,
Guid<SMCollection> collectionId);
SiteMemberDto getMemberDtoByEmailAndCollectionId(
String email,
Guid<SMCollection> collectionId);
List<SiteMemberDto> listMembersByCollectionId(
Guid<SMCollection> collectionId);
}
In Detail
• Java interfaces?
+ Ridiculously simple
+ Lend well to RPC
– Coupled to JVM
• JSON serialization
+ Jackson-based
+ Custom, extensible
mapping
– Reflection-based
• Server stack (JVM)
– Jetty
– Spring + Spring MVC
– Custom handler
• Client stack (JVM)
– Spring
– Proxy classes
generated at runtime
– AsyncHttpClient
Cascade Failures
• What is a
cascade failure?
• Mitigations
– Bulkheading
– Circuit breakers
– Load shedding
• We don’t do any
of that.
Does it go?
• Short answer: yes.
• Battle-tested– Evolving since 2010.
– 200 services in production.
• Known quantity– Easy to operate
– Performs well enough
– Most problems have easy workarounds
Not all is well, though
• Polyglot development
– Custom client stack
– Expensive to port!
• Implicit state
– Transparently handled
by the framework
– Thread local storage
– Hard to go async!
Client Proxy
Service A
Service B
Session info
Session info
Transaction ID
Session info
Transaction ID
A/B experiment
Transaction ID
A/B experiment
Codebase modeling
• A product comprises multiple services
• Services have dependencies– Creating a DAG
– Tends to cluster around domains
• Org structure reflects the clustering (Conway)
Codebase modelingRepository-per-
Service
• Small repositories
• Artifacts built
independently
• Binary dependencies
• Requires specialized
tools to manage:
– Versions
– Build dependencies
Mono-repo
• Repository contains
everything
• Code is built atomically
• Source dependencies
• Requires a specialized
build tool
At Wix
• One repo per domain
• Dependencies:
– Declared in POMs
– Version management
via custom plugin
– Builds managed by
custom tool*
• Custom dashboard,
“Wix Lifecycle”
* Lifecycle – Dependency Management Algorithm
Version management
[INFO] QuickRelease/home/builduser/agent01/work/d9922a1c87aee4bb bf1bc8bcfb2eccebc4268651c5f19faa689be6e4
[08:10:55][INFO] Adding tag RC;.;1.20.0
[08:10:56][INFO] Tag RC;.;1.20.0 added successfully
[08:10:56][INFO] Working on onboarding-server-web
[08:10:56][INFO] onboarding-server-web-1.19.0-SNAPSHOT jar deployable copied
[08:10:56][INFO] onboarding-server-web-1.19.0-SNAPSHOT jar sources copied
[08:10:56][INFO] onboarding-server-web-1.19.0-SNAPSHOT jar copied
[08:10:56][INFO] onboarding-server-web-1.19.0-SNAPSHOT jar tests copied
[08:10:56][INFO] onboarding-server-web pomdeployed
[08:10:57][INFO] Deploying artifacts to release artifacts repository
[08:10:57][INFO] Deploying onboarding-server-web to RELEASE
[08:10:57][INFO] pushing new pom
[08:10:59]2016-02-22 08:10:39 [INFO ] /usr/bin/gitpush --tag origin master exitValue = 0
• All artifacts share a
common parent
– Master list of versions
• Manually-triggered
release builds
– Custom release plugin
– Increments version
– Updates master
– Pushes changes to git
Health
• Host monitoring
– Nagios Sensu alerts
– Usual host metrics
– Health-check endpoint
in framework
• End-to-end
– Pingdom
• Business
– Custom BI toolchain
Instrumentation
• Metrics– Codahale Metrics
– Reporting toGraphite and Anodot
– Implicit collection (e.g. RPC)
– APIs for custom metrics
• Alerts via Anodot
• Custom NewRelicerror reporting
Debugging
• Logs
– Good old Logback
– No centralized
aggregation
– Not particularly useful
• Feature toggle
overrides
• Distributed tracing
WE’RE DONE HERE!… AND YES, WE’RE HIRING :-)
Thank you for listening
@tomerg
http://il.linkedin.com/in/tomergabel
Wix Engineering blog:
http://engineering.wix.com