PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Cry in the dojo, laugh in the battlefield: how we constantly
try to bring Scylla to its knees so you don't have to.
QA Manager, Scylla
Roy Dahan
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Roy Dahan
2
Roy has over of 10 years of experience testing
large-scale distributed systems, with a focus on
storage/data systems, and managing small to large
teams responsible for all testing aspects using a
highly automated approach.
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Our Goal
▪ Achieving Highest Levels of System Stability & Availability
▪ Maintaining Data Integrity
▪ Prevent Performance Degradations Over Time
▪ Increase Users Confidence
All of the above, even when BAD THINGS happen on
“Production-like Environments”
3
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
How We Test Scylla
4
ScyllaTesting
Unit✓ scylla-unittest
Functional✓ dtest
Compatibility✓ dtest✓ Driver Tests
Integration✓ Janus-Graph
Tests✓ Titan-test ✓ Spark
Scale / Performance
✓ S-C-T
Stress / Load✓ S-C-T✓ Cassandra
Stress
System / Longevity
✓ S-C-T✓ Jepsen
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Distributed Tests (dtest)
▪ Functional “Black Box” Tests
▪ Verifies our Compatibility with Cassandra
▪ Enhanced & Extended to Catch Scylla Regressions
▪ Around 10% (208) of the Reported Issues on the Scylla Project
reference a dtest - (Detected/Reproduced by dtest)
▪ About 675 Tests Runs Regularly as part of “Regression Suite”
5
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Scylla-Cluster-Tests (SCT)
▪ Automation Library and Test Collection for Scylla & Cassandra
Clusters
▪ Supports Multiple Backends such as: AWS / GCE / OpenStack /
Libvirt
▪ Tests are Based on Chaos Engineering Principles:o Build a Hypothesis around Steady State Behavior
o Vary Real-world Events
o Automate Experiments to Run Continuously
▪ Around 4% (105) of the Reported Issues on the Scylla Project
Reference SCT test - (Detected/Reproduced by SCT test)
6
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
SCT Longevity Testing
7
Test Setup (Our Defaults):
▪ Cluster of N Scylla DB nodes (N=6)
▪ Set of X Loaders Nodes (x=2)
▪ Scylla Monitoring Server
client
Cluster of nodes
client
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
SCT Longevity Testing
8
Test Setup - Example on GCE:
▪
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
SCT Longevity Testing
9
The Test flow:
▪ Client Side Loaders Run Workloads(Set of Cassandra-Stress loads run on the loaders (Write,
Mixed, Counters, User Profiles)
▪ During X hours / days / weeks
▪ A “Nemesis” Out of the Predefined List is
Randomly Selected
o Some Nemesis Disrupts Nodes in the
Cluster.
o Someone Runs Standard Cluster
Operations
Current Nemesis types:StopStartServiceStopWaitStartServiceDrainerDecommissionCorruptThenRepairCorruptThenRebuildNoCorruptRepairRefreshMajorCompactionModifyTablePropertiesEnospc
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
SCT Longevity Testing
10
Test Fixture Example:test_duration: 5760stress_cmd: ["cassandra-stress write cl=QUORUM duration=5760m -schema 'replication(factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000 -pop seq=1..100000000 -log interval=5",
"cassandra-stress counter_write cl=QUORUM duration=5760m -schema 'replication(factor=3) compaction(strategy=DateTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000 -pop seq=1..1000000",
"cassandra-stress user profile=/tmp/cs_mv_profile.yaml ops'(insert=3,read1=1,read2=1,read3=1)' cl=QUORUM duration=5760m -port jmx=6868 -mode cql3 native -rate threads=100"]n_db_nodes: 6n_loaders: 2n_monitor_nodes: 1nemesis_class_name: 'ChaosMonkey'nemesis_interval: 5failure_post_behavior: keepspace_node_threshold: 644245094ip_ssh_connections: 'private'experimental: 'true'
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
SCT Longevity Testing
11
Test Fixture Example:test_duration: 5760stress_cmd: ["cassandra-stress write cl=QUORUM duration=5760m -schema 'replication(factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000 -pop seq=1..100000000 -log interval=5",
"cassandra-stress counter_write cl=QUORUM duration=5760m -schema 'replication(factor=3) compaction(strategy=DateTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000 -pop seq=1..1000000",
"cassandra-stress user profile=/tmp/cs_mv_profile.yaml ops'(insert=3,read1=1,read2=1,read3=1)' cl=QUORUM duration=5760m -port jmx=6868 -mode cql3 native -rate threads=100"]n_db_nodes: 6n_loaders: 2n_monitor_nodes: 1nemesis_class_name: 'ChaosMonkey'nemesis_interval: 5failure_post_behavior: keepspace_node_threshold: 644245094ip_ssh_connections: 'private'experimental: 'true'
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
SCT Longevity Testing
12
Nemesis Code Examples:def disrupt_destroy_data_then_repair(self): self._set_current_disruption('CorruptThenRepair %s' % self.target_node) # Delete set of sstables from data directory self._destroy_data() # Try to save the node self.repair_nodetool_repair()
def disrupt_stop_wait_start_scylla_server(self, sleep_time=300): self._set_current_disruption('StopWaitStartService %s' % self.target_node) self.target_node.remoter.run('sudo systemctl stop scylla-server.service') self.target_node.wait_db_down() self.log.info("Sleep for %s seconds", sleep_time) time.sleep(sleep_time) self.target_node.remoter.run('sudo systemctl start scylla-server.service') self.target_node.wait_db_up()
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
SCT Longevity Testing
13
Test Verification & Analysis:
▪ Application Load (cassandra-stress) Doesn’t Stop
▪ Auto Detection of:
• Coredumps
• Errors
• Exceptions
• Operations failures (repair, add node, refresh, compaction, etc.)
▪ Auto Detection of Performance Degradations (unexpected lower throughput
/ higher latencies due to operations)
▪ Compare Nemesis Execution Durations Across Builds to Detect Possible
Regressions
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
SCT Longevity Testing
14
Longevity monitoring example:
“Total Requests Served” (op/s) correlated with Nemesis executions.
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
SCT Longevity Testing
15
Longevity monitoring example:
“Requests Rate Served” (op/s per instance) correlated with Nemesis executions.
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
SCT Longevity Testing
16
Longevity monitoring example:
“CPU utilization” (% per instance) correlated with Nemesis executions.
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
SCT Longevity Testing
17
Test Summary Output - Nemesis Execution:
50GB DataSet Test: (Nemesis every 5 minutes, 4 days)
--------------------------------------------| Nemesis Type |Count | Avg Time(s) | -------------------------------------------| CorruptThenRebuild | 103 | 93.79 || Decommission | 111 | 231.89 || Drainer | 109 | 48.27 || CorruptThenRepair | 113 | 285.71 || Refresh | 95 | 7.72 || NoCorruptRepair | 97 | 331.73 || StopStartService | 133 | 26.92 || MajorCompaction | 134 | 20.63 || ModifyTable | 197 | 1.50 || Enospc | 114 | 26.33 || StopWaitStartService| 98 | 66.30 |--------------------------------------------
1TB DataSet Test: (Nemesis every 30 minutes, 6 days)
--------------------------------------------| Nemesis Type |Count | Avg Time(s) | -------------------------------------------| CorruptThenRebuild | 2 | 732.50 || Decommission | 7 | 2913.86 || Drainer | 6 | 213.00 || CorruptThenRepair | 5 | 4942.60 || Refresh | 6 | 10.50 || NoCorruptRepair | 3 | 2835.33 || StopStartService | 2 | 195.00 || MajorCompaction | 3 | 663.33 || ModifyTable | 6 | 4.67 || Enospc | 6 | 221.00 || StopWaitStartService| 6 | 492.17 |--------------------------------------------
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
18
SCT Longevity Testing Nemesis Execution Analysis:
Auto-analysis and reports based on test
statistics stored automatically in ElasticSearch
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Example of Issue detected by Longevity
19
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Example of Nemesis Added due to Issue
20
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Example of Nemesis Added due to Issue
21
def disrupt_modify_table_comment(self): self._set_current_disruption('ModifyTableProperties %s' % self.target_node) comment = ''.join(random.choice(string.ascii_letters) for i in xrange(24)) cmd = "ALTER TABLE keyspace1.standard1 with comment = '{}';".format(comment) self.target_node.remoter.run('cqlsh -e "{}" {}'.format(cmd, self.target_node.private_ip_address), verbose=True)
def disrupt_modify_table_gc_grace_time(self): self._set_current_disruption('ModifyTableProperties %s' % self.target_node) gc_grace_seconds = random.choice(xrange(216000, 864000)) cmd = "ALTER TABLE keyspace1.standard1 with comment = 'gc_grace_seconds changed' AND" \ " gc_grace_seconds = {};".format(gc_grace_seconds) self.target_node.remoter.run('cqlsh -e "{}" {}'.format(cmd, self.target_node.private_ip_address), verbose=True)
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Multi DC Longevity - The plot thickens
22
Test Setup (Our Defaults):
▪ Cluster of N Scylla DB nodes (N=15)
▪ Across M “Data Centers” (M=3)
▪ Set of X Loaders nodes. (X=3)
▪ Scylla Monitoring Server.
▪ Set of Cassandra-Stress commands
running on the loaders (Write,
Mixed, Counters, User Profiles).
The tc utility is being used to impose random network delays,
packet drops and reorder packets between Data Centers.
DC1client
DC2client
DC3client
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Performance Regression
23
▪ Set of Predefined Workloads & Setups○ Write
○ Read
○ Mixed
○ Customers Workloads
▪ Storing Results (Op/s, Throughput, Latency) in ElasticSearch
▪ Master Daily Regression Suite - Automatically Compare Results
with a Previous Build & “Best” Build
▪ Release Regression Suite - Automatically Compare Results with
Previous Releases (including RCs)
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Performance Regression
24
Test-Write - Total Op rate (op/s) by Release:
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Performance Regression
25
Test-Write - 99th Percentile Latency (ms) by Release:
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
Large Scale Tests
26
▪ 100’s of Nodes Clusters
▪ 10’s TB DataSets
▪ Multi-Core Scylla nodes
▪ Many sstables
Sample of 101 nodes Scylla cluster running on AWS.
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
On QA Roadmap Longevity:
▪ Embed CharybdeFS (fault injection FS) in Longevity
▪ Extend workload types
▪ Two+ Nemesis in Parallel
▪ Adding more “Sudden Death” Types of Nemesis
▪ Enable “sstables integrity checker”
Load & Scale
▪ XXL Clusters Sizes (1000+ nodes)
▪ Enhance Load Testing to More Server Dimensions (network, Disk)
27
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
On QA Roadmap Performance:
▪ Add more “Real World Workloads” to Daily Regressions
▪ Performance Impact Per Operation (e.g. repair, majorCompaction)
▪ Collecting Latency Histograms for Various Load Types
3rd Party Integration:
▪ Spark & Titan Integration Suites
▪ Java & Golang Driver Integration Suites
Tools & Infrastructure:
▪ Enhance auto analysis based on Statistics in ElasticSearch
▪ Running SCT using an Existing Env28
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES
First and last namePosition, company
THANK YOU
Please stay in touch
Any questions?