Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Engineering Your Startup to Innovate at Scale
Randy Shoup @randyshoup
linkedin.com/in/randyshoup
Background• VP Engineering at Stitch Fix
o Combining “Art and Science” to revolutionize apparel retail
• Consulting “CTO as a service”o Helping companies scale engineering organizations and technology
• Director of Engineering for Google App Engineo World’s largest Platform-as-a-Service
• Chief Engineer / Distinguished Architect at eBayo Multiple generations of eBay’s infrastructure
@randyshoup linkedin.com/in/randyshoup
Stitch Fix
@randyshoup linkedin.com/in/randyshoup
Stitch Fix
@randyshoup linkedin.com/in/randyshoup
Stitch Fix
@randyshoup linkedin.com/in/randyshoup
Stitch Fix
@randyshoup linkedin.com/in/randyshoup
Combining Humans and Data Science• 1:1 Ratio of Data Science to Engineering
o Almost 100 software engineerso Almost 100 data scientists and algorithm developerso Unique in our industry
• Apply intelligence to *every* part of the businesso Buyingo Inventory managemento Logistics optimizationo Styling recommendationso Demand prediction
• Humans and machines augmenting each other
@randyshoup linkedin.com/in/randyshoup
Styling at Stitch Fix
Personal styling
Inventory
@randyshoup linkedin.com/in/randyshoup
PersonalizedRecommendations
InventoryAlgorithmic
recommendations
Machine learning
@randyshoup linkedin.com/in/randyshoup
Expert HumanCuration
Human curation
Algorithmic recommendations
@randyshoup linkedin.com/in/randyshoup
Data-DrivenExperimentation
• Experimenting with …o Algorithmso Client Experienceo Stylist Interactionso Growth engineering
@randyshoup linkedin.com/in/randyshoup
Faster is Better
Willingness to go fast
Ability to go fast+
Lack of Fear
Capability+
High-Performing Organizations
• Multiple deploys per day vs. one per month
• Commit to deploy in less than 1 hour vs. one week
• Recover from failure in less than 1 hour vs. one day
• Change failure rate of 0-15% vs. 31-45%
@randyshoup linkedin.com/in/randyshoup
https://puppet.com/resources/whitepaper/state-of-devops-report
High-Performing Organizations
è2.5x more likely to exceed business goalso ProfitabilityoMarket shareo Productivity
@randyshoup linkedin.com/in/randyshoup
https://puppet.com/resources/whitepaper/state-of-devops-report
Speed vs. Stability?
Faster is Better
Innovating at Scale
•Organizing for Speed
•What to Build / What NOT to Build
•When to Build
•How to Build
•Delivering and Operating
Innovating at Scale
•Organizing for Speed
•What to Build / What NOT to Build
•When to Build
•How to Build
•Delivering and Operating
Conway’s Law• Organization determines architecture
o Design of a system will be a reflection of the communication paths within the organization
• Modular system requires modular organizationo Small, independent teams lead to more flexible, composable systemso Larger, interdependent teams lead to larger systems
• We can engineer the system we want by engineering the organization
@randyshoup linkedin.com/in/randyshoup
Small “Service” Teams
• Full-Stack, “2 Pizza” Teamso No team should be larger than can be fed by 2 large pizzaso Typically 4-6 peopleo All disciplines required for the team to function
• Aligned to Business Domainso Clear, well-defined area of responsibilityo Single service or set of related serviceso Deep understanding of business problems
• Growth through “cellular mitosis”
@randyshoup linkedin.com/in/randyshoup
Ideally, 80% of project work should be within a team boundary.
Autonomy and Accountability
• Give teams autonomy• Freedom to choose technology, methodology, working environment• Responsibility for the results of those choices
• Hold team accountable for *results*• Give a team a goal, not a solution• Let team own the best way to achieve the goal• Innovate and experiment to achieve the goal
Innovating at Scale
•Organizing for Speed
•What to Build / What NOT to Build
•When to Build
•How to Build
•Delivering and Operating
“Building the wrong thing is the biggest waste in software development.”
-- Mary and Tom Poppendieck, Lean Software Development
What problem are you trying to solve?
“A problem well-stated is a problem half-solved.”
-- Charles Kettering, former head of research for General Motors
What Problem Are You Trying to Solve?
• Focus on what is important for your business
• Problem might be solved without any technology at allo Redefine the problemo Change the business processo Implement manually for a while before automating in an application
@randyshoup linkedin.com/in/randyshoup
Buy, Not Build• Use Cloud Infrastructure
o Faster, cheaper, better than we can do ourselveso Stitch Fix has no owned physical infrastructure anywhere in the world
• Prefer Open Sourceo Kubernetes, Docker, Istioo MySQL, Postgres, Redis, Elastic Searcho Machine learning modelso Etc.o Usually better than the commercial alternatives (!)
@randyshoup linkedin.com/in/randyshoup
Buy, Not Build• Third-Party Services
o Stitch Fix uses >50 third party serviceso Logging, monitoring, alertingo Project management, bug trackingo Billing, fraud detectiono Etc.
• Focus on our core competencyo Use services for everything else (!)
@randyshoup linkedin.com/in/randyshoup
Soon it will be just as common to run your own data center as it is to run your own electrical power generation.
Experimental Discipline
• State your hypothesiso What metrics do you expect to move and whyo Understand your baseline
• Run a real A | B testo Sample sizeo Isolated treatment and control groupso No peeking or quitting early!
• Obsessively log and measureo Understand customer and system behavioro Understand why this experiment worked or did not
Experimental Discipline
• Listen to the datao Data trumps hope and intuitiono Develop insights for next experiment
• Thinking of the experiment is art; evaluating it is science
• Rinse and Repeato This is a journey, not a single step
eBay Machine-Learned Ranking
• Ranking function for search resultso Which item should appear 1st, 10th, 100th, 1000th
o Before: Small number of hand-tuned factorso Goal: Thousands of factors
• Incremental Experimentationo Predictive models: query->view, view->purchase, etc.o Hundreds of parallel A | B testso Full year of steady, incremental improvements
è 2% increase in eBay revenue (~$120M / year)
eBay Site Speed
• Reduce user-experienced latency for search results
• Iterative Processo Implement a potential improvemento Release to the site in an A | B testo Monitor metrics –time to first byte, time to click, click rate, purchase rate
è 2% increase in eBay revenue (~$120M / year)
Innovating at Scale
•Organizing for Speed
•What to Build / What NOT to Build
•When to Build
•How to Build
•Delivering and Operating
Prioritization• Scarce resources require prioritization
o We always have more to do than resources to do ito Opportunity cost -- deciding to do X means deciding not to do Yo Every decision is a tradeoff
• Priority ← Return on Investment o Impact / Effort
• Prioritization is a business decision, not a technical decision
@randyshoup linkedin.com/in/randyshoup
Fewer Things,More Done
Fewer Things,More Done
• Maximize resources applied too Priority 1, then o Priority 2o etc.
• Incremental Deliveryo Deliver increments along the way instead of everything at the end
• Deliver Value Fastero Time Value of Moneyo Benefit now is worth more than benefit in the future
@randyshoup linkedin.com/in/randyshoup
“When you solve problem one, problem two gets a promotion.”
Innovating at Scale
•Organizing for Speed
•What to Build / What NOT to Build
•When to Build
•How to Build
•Delivering and Operating
Microservices
• Single-purpose• Simple, well-defined interface• Modular and composable• Independently deployable
A
C D E
B
Evolution toMicroservices
• eBay • 5th generation today• Monolithic Perl à Monolithic C++ à Java à microservices
• Twitter• 3rd generation today• Monolithic Rails à JS / Rails / Scala à microservices
• Amazon• Nth generation today• Monolithic Perl / C++ à Java / Scala à microservices
No one starts with microservices…
Past a certain scale, everyone ends up with microservices
If you don’t end up regretting your early technology decisions, you probably over-engineered.
QualityDiscipline
• Quality and Reliability are “Priority-0 features”o Equally important to users as product features and engaging user
experience
• Developers responsible for o Featureso Qualityo Performanceo Reliabilityo Manageability
Test-Driven Development
• Tests help you go fastero Tests “have your back”o Development velocity
• Tests make better codeo Confidence to break thingso Courage to refactor mercilessly
• Tests make better systemso Catch bugs earlier, fail faster
@randyshoup linkedin.com/in/randyshoup
OptimizingDeveloper Effort
@randyshoup linkedin.com/in/randyshoup
• 75% reading existing code
• 20% modifying existing code
• 5% writing new code
https://blogs.msdn.microsoft.com/peterhal/2006/01/04/what-do-programmers-really-do-anyway-aka-part-2-of-the-yardstick-saga/
OptimizingDeveloper Effort
@randyshoup linkedin.com/in/randyshoup
• 75% reading existing code
• 20% modifying existing code
• 5% writing new code
https://blogs.msdn.microsoft.com/peterhal/2006/01/04/what-do-programmers-really-do-anyway-aka-part-2-of-the-yardstick-saga/
“Do you have time to do it twice?”
“We don’t have time to do it right!”
The more constrained you are on time or resources, the more important it is to build it right the first time.
Build It Right (Enough)The First Time
• Build one great thing instead of two half-finished things
• Right ≠ Perfect (80 / 20 Rule)
• è Basically no bug tracking system (!)o Bugs are fixed as they come upo Backlog contains features we want to buildo Backlog contains technical debt we want to repay
@randyshoup linkedin.com/in/randyshoup
Innovating at Scale
•Organizing for Speed
•What to Build / What NOT to Build
•When to Build
•How to Build
•Delivering and Operating
You Build It, You Run It.-- Werner Vogels
ContinuousDelivery
• Repeatable Deployment Pipelineo Low-risk, push-button deploymento Rapid release cadenceo Rapid rollback and recovery
• Most applications deployed multiple times per day
• More solid systemso Release smaller units of worko Smaller changes to roll back or roll forwardo Faster to repair, easier to understand, simpler to diagnose
@randyshoup linkedin.com/in/randyshoup
FeatureFlags
• Configuration “flag” to enable / disable a feature for a particular set of userso Independently discovered at eBay, Yahoo, Facebook, Google, etc.
• More solid systemso Decouple feature delivery from code deliveryo Rapid on and offo Develop / test / verify in productiono Dark launches
• Enables experimentation
Canaries and Dark Launches
• Canary Deploymento Deploy to a small number of systems or userso Avoid catastrophic failures
• Dark Launcho Execute the code and systems for a feature without displaying results to
the usero Work out performance bottlenecks and system interactions
Innovating at Scale
•Organizing for Speed
•What to Build / What NOT to Build
•When to Build
•How to Build
•Delivering and Operating
Faster is Better
Thanks!• Stitch Fix is hiring!
o www.stitchfix.com/careerso Based in San Franciscoo Hiring everywhere!o More than half remote, all across USo Application development, Platform engineering,
Data Science
• Please contact meo @randyshoupo linkedin.com/in/randyshoup