Upload
walter-heck
View
820
Download
2
Embed Size (px)
DESCRIPTION
Wai Keen Woon, CTO CDN Division OnApp Malaysia, gave an interesting overview of what the Puppet architecture at OnApp looks like. The CDN division at OnApp is a large provider of CDN services, and as such makes a very interesting candidate for a case study.
Citation preview
Puppet Deployment at OnApp
Wai Keen Woon CTO, CDN Division [email protected]
WARNING
<ObligatoryPlug>
About OnApp
OnApp launched July 1st 2010
Backed by LDC
The leading cloud management software for hosts
The instant global CDN for hosts
Deep industry knowledge
100+ employees in US, EU, APAC
A leading provider of software for hosts
Vital Statistics
1 in 3 public clouds
cloud deployments
global clients
800+
300+
Customer Stories
paid for idle capacity
get low
PoPs
Instant CDN that gives you…
75+ cost, high margin
OK.
</ObligatoryPlug>
Systems Overview
l Core & Development l ~20 physical servers l ~200 VMs l Homogeneous environment – 64-bit Debian everywhere l Mainly use OpenVZ and KVM for virtualization
l CDN Delivery Edge Servers l 100+ servers in 60+ cities l Running on the OnApp platform – either Xen or KVM
l Puppet integral to our setup – since day 1
Why Puppet?
l More reliable configuration of servers. Less need to “run ssh in a for loop” and miss out something.
l Self-documenting – our manifests are almost able to bootstrap an empty server. l Our manifests can't bootstrap an empty environment yet. l Limitation – manifests describe what/where/how something
is setup, but doesn't describe *why*. l Nice syntax – easy on the eyes. Comprehensive builtin
resource types. Able to fallback to dumb ways of doing things if required (use file, exec et al).
Core Infra Environments
l Systems manifest describes everything. l Three environments:
β
What Would OnApp Setup...
l Essential utilities (tcpdump, less, vim, etc). l Users & their SSH keys, sudoers.
l Developer's shell => /bin/false if production l Base firewall rules. l Nagios agent. l Set uniform locality settings: UTC timezone,
en_US.UTF-8 locale. l SMTP that smarthosts to our central relay. l Syslogd for remote logs to central logging server. l Finally, the services.
Core Infra Manifest Excerpt
$portal_domain = "portal.alpha.onappcdn.com"
$portal_db_host = "portal.alpha.onappcdn.com"
$portal_db_user = "aflexi_webportal"
$auth_nameservers = { "ns1" => "175.143.72.214",
"ns2" => "175.143.72.214",
"ns3" => "175.143.72.214",
"ns4" => "175.143.72.214",
}
$monitoring_host_server =
[ "monitoring.alpha.onappcdn.com",
"dns.alpha.onappcdn.com" ]
node "monitoring.alpha.onappcdn.com" {
include base
include s_db_monitoring
include s_monitoring_server
include collectd::rrdcached
include s_munin
include s_monitoring_alerts
include s_monitoring_graph
} class collectd::rrdcached {
package { "rrdcached":
ensure => latest,
}
service { "rrdcached":
ensure => running,
}
}
BLUE – env config definitions RED – node definitions GREEN – class definitions
Package Repo Integration
l Jenkins builds debs of our code and stores it into an apt repository for the environment it is built for.
l Puppet keeps packages up-to-date (ensure => latest) and restarts services on package upgrades. Puppet-agent[25431]: (/Stage[main]/Debian/Exec[apt-get-update]/returns) executed successfully puppet-agent[25431]: (/Stage[main]/Python::Aflexi::Mq/Package[python-aflexi-mqcore]/ensure) ensure changed '7065.20120530.113915-1' to '7066.20120604.090916-1' puppet-agent[25431]: (/Stage[main]/S_mq/Service[worker-rabbitmq]) Triggered 'refresh' from 1 events puppet-agent[25431]: Finished catalog run in 16.08 seconds
Nagios Integration
l Plugs into nagios – uses “exported resources”
Nagios Integration
Server manifest
*exports the service that is checked @@nagios_service { "check_load_$fqdn":
check_command => "check_nrpe_1arg!check_load", use => "generic-service", host_name => $fqdn, service_description => "check_load", tag => $domain, }
Nagios service manifest *collects the resources to check
Nagios_service <<| tag == "onappcdn.cm" |>> { target => "/etc/n3/conf.d/services.cfg", require => Package["nagios3"], notify => Exec["reload-nagios"], }
Nagios Integration
l What's logged on the nagios server when puppet runs? puppet-agent[15293]: (/Stage[main]/Nagios::Monitor_private/Nagios_host[hrm.onappcdn.com]/ensure) created puppet-agent[15293]: (/Stage[main]/Nagios::Monitor_private/Nagios_service[check_load_hrm.onappcdn.com]/ensure) created nagios3: Nagios 3.2.1 starting... (PID=5601) puppet-agent[15293]: (/Stage[main]/Nagios::Base/Exec[reload-nagios]) Triggered 'refresh' from 8 events
Monitoring Puppet Itself
l Lots of tools/dashboards out there to achieve this. l For us: “grep -i err */syslog”. Dumb, but works until we
need to Really Address it. l Common issues:
l Puppet gets “stuck”. And only one puppet instance can run at any one time.
l Manifest errors – syntax, merge issues. l Badly-written manifests (vague dependencies,
conditions/commands not robust enough). l An important dependent resource failing (e.g. apt-get
install fails due to dpkg-configure error).
File/Dir Organization
l We use git to revision control our puppet manifests.
l Style we adopted mainly comes from Hunter Haugen*
l A branch for each environment, plus a “common” branch.
l Each branch checked out as a separate directory in /etc/puppet/environments/$env
l And puppetmaster's includedir configured to that directory.
* - http://hunnur.com/blog/2010/10/dynamic-git-branch-puppet-environments/
l Common branch Manifests/ alpha.pp beta.pp Modules/ Base/ Users/
l Alpha env branch Modules/ Python/ Services/ Nameserver/
l Beta env branch Modules/ Python/ Services/ Nameserver/
File/Dir Organization
l Common goes into its own branch – for convenience; less merging needed for manifests that we are Really Sure won't differ between environments.
l System manifest into common/manifests/$env.pp l Initially tried putting manifest into alpha/beta/omega
branches as site.pp – merge hell. l Introduced extra variable - $effective_env
l Abstracts the puppet environment name, from the environment that the manifest runs in.
File/Dir Organization
l Hotfixes branch off omega and merged to alpha/beta/omega.
l Development branches off alpha l This branch can be trialed as a separate environment (use
--environment to specify custom env on puppet client). l Merge to alpha → beta → omega. l Or merge as feature branch to any other environment.
l “git diff branchA branchB” - differences are shown clearly between environments.
Edge Servers
l Our edge servers are hosted on OnApp cloud (only). l When creating an edge server, the cloud control panel
l Instantiates a VM from a lightly-customized Debian image. l Configures the package repositories. l Issues a puppet run to set up.
l Advantage of setting it up through puppet instead of a “gold image” - our system can be installed on bare metal if needed, can be reproducibly installed on $future_debian_release
Edge Servers
l Our edge servers are hosted on OnApp cloud (only). l When creating an edge server, the control panel
instantiates a VM from a lightly-customized Debian image, and issues a puppet run to set it up.
Edge Servers – External Node Classifier
l No text manifest – all code, using “external node classifier”.
l Assign variables and classes specific to the edge server through node classifier. E.g. its password, the services it runs.
l In python, output = {} output[“classes”] = [ “class1”, “class2” ] output[“parameters”] = { “param1”: “value1” } print yaml.dump(output)
Edge Servers – External Node Classifier
l This YAML-encoded structure... $ puppet-nodeclassifier 85206671.onappcdn.com classes: [base, nginx ] parameters: { edge_secret_key: 86zFsrM7Ma, monitoring_domain: monitoring.alpha.onappcdn.com }
l … is equivalent to this textual manifest: node 85206671.onappcdn.com { $edge_secret_key = “86zFsrM7Ma” $monitoring_domain = “monitoring.alpha.onappcdn.com” include base include nginx }
Edge Servers Storedconfigs
l Puppet stores facts about the edge servers into MySQL.
l We make minimal use of this – for example sizing nginx's in-memory cache depending on the amount of memory it has.
l Could probably use more e.g. set # threads based on cpu core count.
l The data's always there if we ever want to query it...
Q&A
l Questions? Comments? l P/S – final plug – we're hiring sysadmins!