25
Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril

Scaling Developer Products to the Millions - USENIX · Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril Challenges Used by 20+ million

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scaling Developer Products to the Millions - USENIX · Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril Challenges Used by 20+ million

Support Operations Engineering:Scaling Developer Products to the Millions

Junade Ali - @IcyApril

Page 2: Scaling Developer Products to the Millions - USENIX · Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril Challenges Used by 20+ million
Page 3: Scaling Developer Products to the Millions - USENIX · Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril Challenges Used by 20+ million

Challenges

● Used by 20+ million web properties○ Free, self-service, and Enterprise service levels○ Pro-bono enterprise-grade protection to at-risk Public Interest Groups

● Ever growing customer support requests○ ~15,000 customer support tickets per month○ Complex and varied web hosting environments○ Everyone from florists to Fortune 1000 companies○ 24x7 TSE coverage

Page 4: Scaling Developer Products to the Millions - USENIX · Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril Challenges Used by 20+ million

First Support Operations (SOPS) service

● HelperBot Stateless○ A diagnostics API

● Exposed in many contexts○ Internal service-to-service○ API Gateway○ Customer communication webhooks

● Uses many data sources & active tests

Page 5: Scaling Developer Products to the Millions - USENIX · Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril Challenges Used by 20+ million
Page 6: Scaling Developer Products to the Millions - USENIX · Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril Challenges Used by 20+ million

Campaign Metrics

● Chrome 68 Release● 91,895 daily tests● 1 month of human

manual testing

Page 7: Scaling Developer Products to the Millions - USENIX · Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril Challenges Used by 20+ million

The Need for Automation

● Customer Tooling > Agent Tooling● Tooling != Automation● Automation > Customer Tooling

Page 8: Scaling Developer Products to the Millions - USENIX · Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril Challenges Used by 20+ million

NLP is far from perfect...

● State of the Art NLP wasn’t suitable○ ~70-80% accuracy○ ~50% for best commercial POC

● Tolerances for false positives vary○ Free or paid?○ General question or sensitive issue?

Page 9: Scaling Developer Products to the Millions - USENIX · Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril Challenges Used by 20+ million
Page 10: Scaling Developer Products to the Millions - USENIX · Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril Challenges Used by 20+ million

Scope for Failure

NLP Pipeline1. NER2. Multi-Classifier3. Over-Engineering*4. Formal Contracts** applied depending on risk sensitivity

False Positive Rate:● Multi-Classifier: 21%● Over-Engineering: 1-2%● Formal Contracts: 0%

Page 11: Scaling Developer Products to the Millions - USENIX · Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril Challenges Used by 20+ million

Novel Safety Engineering Approaches

● Baseline○ Failure is tolerable due to majority benefit○ I.e. Low risk & free user wait time for response

● Binary Classifier○ Higher risk, but not sensitive

● Formally Defined Safety Checks○ Sensitive requests○ May require customer validation actions

Page 12: Scaling Developer Products to the Millions - USENIX · Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril Challenges Used by 20+ million

https://ieeexplore.ieee.org/abstract/document/8820497/

Page 13: Scaling Developer Products to the Millions - USENIX · Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril Challenges Used by 20+ million

Over-Engineering for Safety

● Binary Classification○ Cascading failure to reduce false positives○ Non-sensitive requests by paying users○ Convolutional Neural Network

● Use of Diagnostics○ Corresponding failed diagnostics is also tolerable

Page 14: Scaling Developer Products to the Millions - USENIX · Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril Challenges Used by 20+ million

Cascading Failure can be a good thing...

Page 15: Scaling Developer Products to the Millions - USENIX · Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril Challenges Used by 20+ million

Formally Defined Run-Time Contracts

How?1. Contracts + data stored2. Customer validation3. Contracts revalidated4. Downstream APIs revalidateFailure cases halt processing and remove data fields to prevent software errors.Expected failures linked to JIRAs, unexpected to Sentry/PagerDuty.

Page 16: Scaling Developer Products to the Millions - USENIX · Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril Challenges Used by 20+ million

Data Matters

● Simplified taxonomy○ Encourages greater accuracy

● Classification to fill in the gaps○ Used to add additional dimensions to reporting

● Make everything self-serve○ Attach repeat configuration change items to JIRAs

Page 17: Scaling Developer Products to the Millions - USENIX · Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril Challenges Used by 20+ million
Page 18: Scaling Developer Products to the Millions - USENIX · Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril Challenges Used by 20+ million

Next-Gen Security Operations Centre

● Proactive messaging for self-serve users● Can same be applied to a SOC?

○ Active testing○ Analysis of passive traffic data flow

Page 19: Scaling Developer Products to the Millions - USENIX · Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril Challenges Used by 20+ million

Multi-Dimensional Visibility

Colour = path ratio < 0.554 in redScatter size = UA ratio*2500

Page 20: Scaling Developer Products to the Millions - USENIX · Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril Challenges Used by 20+ million

Additional properties for disambiguation

Page 21: Scaling Developer Products to the Millions - USENIX · Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril Challenges Used by 20+ million

Intelligent Threat Fingerprinting

Εtot - success rate of brute force attackRV - abnormal HTTP status (429, 5xx, etc)𝓝zones - normalised sites attacked

Page 22: Scaling Developer Products to the Millions - USENIX · Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril Challenges Used by 20+ million

Intelligent Threat Fingerprinting

On these 3 aggregate properties, unsupervised clusterization is able to correlate to fingerprint of attack. E.g. Cluster 1 (highest success):● median success rate of 30.5%● 99.5% req from same UA● 99.45% same country

Page 23: Scaling Developer Products to the Millions - USENIX · Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril Challenges Used by 20+ million

Current state

● HelperBot formed of 6 services○ From chatbot to SOC anomaly detection○ 10 ancillary SOPS services

● Metrics○ TSF: 57.3% deflection (excl. email tickets)○ HelperBot: ~60% free ticket automation○ ~78% without human interaction

● Plenty more to do○ 24% of all tickets automated ○ 35% planned EOY ‘19, 50% in ‘20○ Groundwork laid to drive ever greater automation

Page 24: Scaling Developer Products to the Millions - USENIX · Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril Challenges Used by 20+ million

SOPS Principles

● Favour automation over tooling● Question the fundamentals● Context-Sensitive Safety● Be diligently data driven● Build services as an asset

Page 25: Scaling Developer Products to the Millions - USENIX · Support Operations Engineering: Scaling Developer Products to the Millions Junade Ali - @IcyApril Challenges Used by 20+ million

Thank you!Get in contact:@[email protected]