Symantec Confidential – Cloud Platform Engineering 1
Continuous Validation at scale
Vijay SeshadriCloud Platform Engineering (CPE), Symantec
Agenda
CPE Overview1
What is Continuous Validation?2
SCTF Overview & Usage3
SCTF Design and Roadmap4
2
CPE Overview
• CPE Charter– Consolidated cloud infrastructure that offers platform services for Symantec cloud applications
• Symantec Cloud Infrastructure already operating at scale
– Compute – Reputation based security– Storage – Consumer and Enterprise backup– Network – Hosted email security
• How do we leverage the best practices/insights from operating at scale to the new platform?
• Core objectives– Secure, scalable and reliable OpenStack based cloud platform
Cloud Platform Engineering (CPE)
Core Services
CPE Platform Architecture
2
Compute Networking Storage
CLIs ScriptsCloud Applications
Big Data Messaging
Identity & Access
(Keystone)Supporting Services
Authn
Roles
User Mgmt
Tenancy
Quotas
Logging
Metering
Monitoring
Deployment
Compute (Nova)
Image (Glance)
SDN (Neutron)
Load Balancing
DNS SQL
Batch Analytics
Stream Processing
Msg Queue
Mem Cache
Email Relay
SSL
K/V Store
Web Portal
Object Store
REST/JSON API
CPE Reference App #1 - Log Collection service
CPE Cloud
Object Store (Swift)
Compute
VM0
VM1
LB
Container
DNS queries
KeystoneAuthentication
Log Collection AppLog Sources(e.g security metadata, install logs, telemetery)
1 Acquire an authentication token
2Create two VMs, associate a network and start them using a CentOS image
3
Create a LB endpoint, place the two VMs in it and configure a DNS entry
4Provision a container in the Object store
5Deploy and start the flask application
6 Fetch log files from Object store
Problem Statement
• Cloud infrastructure at scale is a highly dynamic environment
– Diversity of cloud workloads • Cannot predict application behaviors and patterns
– Addition and removal of resources (machines, network equipment etc.)
– Configuration drift over a period of time– External events causing huge variations in network, compute and storage consumption
– Stability issues occur when you cross scale boundaries (jump an order of magnitude)
• Key Question – What validation tools/frameworks do we need to identify issues at scale and remediate them?
What capabilities do we need in a validation framework?
• Ability to test generic REST/JSON endpoints (services)– Including OpenStack and platform services
• Ability to quickly create tests for functionality, stability and performance
– Should not be burdensome for developers• Ability to customize/extend test conditions and/or verification functions
• Independent channel of verification– Higher order verification
• E.g Just don’t check for return status from individual services, but verify end-to-end function
– Extensible, pluggable design
• Provide continuous visibility into the health and performance of production cloud
– Proactively monitor transient and persistent errors
Continuous Validation State Transitions
Symantec Cloud Test Framework (SCTF)
• What is SCTF? – A set of python libraries, scripts and simple text files (YAML) that facilitate the validation of a cloud infrastructure
– Primitives for expressing REST requests and validating responses
Built in exec function
Test Command
Validation condition
How to run SCTF?Input YAML
fileTest case
name
Validation summary
SCTF Usage – Simple web request
Built in Web service function
Request URL and Method
Response Code
SCTF Usage – Reusable Primitives
Test Procedure Name
Variable definitions
Test case definition
SCTF Usage – Independent channel of verification
Built in exec function
started after VM create
ssh command
line
Retry args
SCTF Design
SCTF Roadmap
• Stream files– enable large file downloads• Test Runner – execute all test files in a directory hierarchy• Preserve comments – retain comments after programmatic manipulation
• Improve error reporting - make stack traces and error reporting more descriptive
• Incorporate salt to allow remote execution and job management
• Allow tests to be run in parallel multiple ( possible ways )– Use pykka ( https://github.com/jodal/pykka ) for actors in single
process– Call out to julia ( http://julialang.org/ ) and use the parallel facilities
SCTF Roadmap – Cont’d
•Allow test results to be written to files and databases.•Allow test documentation to be queried.•Determine why the test failed
– Diagnosis – Remediation– Validate remediation
•Add timing and meta data to test output. •Performance as test criteria
•Add extension type to allow type handlers to be added at run-time
Summary/Conclusion
• We plan use SCTF as a primary means of functional and performance validation
– Enable continuous monitoring of the stability and performance of the CPE cloud
– Ability to associate diagnosis and remediation with failing functional tests
– Scale the ability to generate tests along with the cloud– Enable shorter mean time to resolution
• Planning to collaborate with other similar open source projects
• Our primary motivation is to ensure the stability of an OpenStack based cloud when deployed at scale