19
WHEN THE CEPH HITS THE FAN Dr. Wolfgang Schulze Director Global Storage Consulting Practice Red Hat October 20, 2016

Red Hat Storage Day - When the Ceph Hits the Fan

Embed Size (px)

Citation preview

Page 1: Red Hat Storage Day -  When the Ceph Hits the Fan

WHEN THE CEPH HITS THE FAN

Dr. Wolfgang Schulze Director Global Storage Consulting Practice Red Hat October 20, 2016

Page 2: Red Hat Storage Day -  When the Ceph Hits the Fan

CAN THE CEPH EVEN HIT THE FAN?

2

•  A"erall…

•  Architecturehasnosinglepointoffailure•  Codebaseisverysolidandhadmanyyearstomature•  Designedfromthegrounduptoaccommodateforfailures•  Supposedtobeself-healingandself-managing•  Itsimplifiesday-to-daydatacenteropera?ons

Page 3: Red Hat Storage Day -  When the Ceph Hits the Fan

WHAT IS “HITTING THE FAN”, ANYWAYS?

3

•  Examplescenarios:•  Heavystormtakesoutdatacenter,clusterfailstorestartautoma?cally•  Increasedworkloadmakesclusterunstable•  Performanceisfinewhenclusterisemptytomoderatelyfilled,butwhen

whengeHngclosephysicalcapacity,writeperformancedrops•  Nearlyfullclusterhasbecomeunresponsiveanddenieswrites•  Bulkdele?onofobjectstakessolongthattheclientapplica?on?mesout•  Rebalancinga"erapar?alelectricoutageimpactsclientswithslow/

blockedrequests

•  Resultineachcase:customerfiles•  Sev1:Produc?onisdown•  Sev2:Produc?onisimpacted

Page 4: Red Hat Storage Day -  When the Ceph Hits the Fan

TICKET QUEUE IN RED HAT SUPPORT

4

Realscreenshot,dated2016-10-19CustomernamesremovedManyofthese,cketscouldhavebeenavoidedifbestprac,ceshadbeenfollowed

Page 5: Red Hat Storage Day -  When the Ceph Hits the Fan

A SAD, BUT TRUE STORY

5

•  CustomerboughtRedHatCephStoragesubscrip?ons•  Theyweresuretheyhadenoughexperienceontheirteamandspecifically

declinedoffersfortrainingandconsul?ng•  TheydesignedanddeployedCephclusterwithoutguidance

•  Originallyforfeasibilitystudy,buteverythingseemedtoworkfine,sotheyputitintoproduc?on

•  Nobodyno?cedthatthejournalsizewasconfiguredtoonly100MBinsteadofbestprac?cesizeof5GB

•  Acoupleofmonthslatera"erapowerfailure,theCephclusterfailedtorecover•  Support?cketwentonforseveralweeks,attheendsomepermanentdataloss

•  Endresult:Par?aldataloss,unhappymanagement,unhappycustomers

Page 6: Red Hat Storage Day -  When the Ceph Hits the Fan

SOME COMMON MISCONCEPTIONS

6

•  ThenewtoolsmakeCepheasytosetup•  Youdon’tneeddetailedplanningorarchitecturedesign•  Cephworksonanyhardware,andyoucanmix&matchhardware•  Storageinfrastructurepeoplewillknowhowtohandletheproduct•  Serverpeoplewillknowhowtohandletheproduct•  Cephcommunitybitsarejustfine(“Weuseastablerelease”)•  Usingcommunitybitsismore“cuHngedge”

Page 7: Red Hat Storage Day -  When the Ceph Hits the Fan

COMMON TROUBLE #1 UPSTREAM BITS FOR PRODUCTION SYSTEMS

7

Observa,on•  Userisrunningupstreambits•  ThishappensevenwithuserswhoarepayingforaRedHatSupportsubscrip?on•  Peoplemisinterpretthephrase“stablerelease”incommunityreleasenotes

Problem•  RedHatSupportwon’tbeabletohelp•  RedHatonlysupportslongtermstablereleases•  WhatcouldbeasafeandfullydocumentedupgradetoanewerLTSversion

suddenlybecomesa“migra?on”withrisksandpieallsMi,ga,on•  Usesupportedbits,stayinformedaboutroadmap,getinvolved

Page 8: Red Hat Storage Day -  When the Ceph Hits the Fan

COMMON TROUBLE #2 USE OF UNSUPPORTED FEATURES

8

Observa,on•  Userdeployssystemintoproduc?onusingfeatureswhicharenot(yet)supported•  Examples:CephFS,BlueStore

Problem•  RedHatSupportwon’tbeabletohelp

•  Unlessyouhaveasupportexcep?on,theconversa?onmayendquickly•  RedHatEngineeringwillnotbuildhotfixesforyouMi,ga,on•  Trytogetasupportexcep?onfromRedHat•  Don’tusethefeature

Page 9: Red Hat Storage Day -  When the Ceph Hits the Fan

COMMON TROUBLE #3 USE OF UNSUPPORTED CONFIGURATIONS

9

Observa,on•  UserdeployCephinawaythatisnotapprovedandhasnotbeentested•  Examples:

•  RunningCephonunsupportedOpera?ngSystemversions(e.g.GenToo,Debian)•  Deploying

Problem•  RedHatSupportwon’tbeabletohelp

•  Unlessyouhaveasupportexcep?on,theconversa?onmayendquickly•  RedHatEngineeringwillnotbuildhotfixesforyou

Mi,ga,on•  Readdocumenta?on,considerhealthcheckbeforego-live

Page 10: Red Hat Storage Day -  When the Ceph Hits the Fan

COMMON TROUBLE #4 POORLY MANAGED CLUSTER GROWTH

10

Observa,on•  Addingdisks(orevenen?renodes)toclustersofrela?velysmalltotalcapacity•  Backfill/recoverystarvesclientI/O

Problem•  InolderversionsofCeph,defaultconfigura?onvaluesarenotidealforthis

(osd_max_backfills,osd_recovery_max_ac?ve,osd_recovery_op_priority)•  Ifyoufailtoadjustthesebeforeyouchangethephysicalconfigura?on,youwill

indeedhavehugeimpact

Mi,ga,on•  Knowyourstuff,thinkahead,es?mateimpact,graduallyweighin

Page 11: Red Hat Storage Day -  When the Ceph Hits the Fan

COMMON TROUBLE #5 POOR SKILLS AND OPERATIONAL PRACTICES

11

Observa,ons•  SubjectmajerexpertswhobroughtCephtotheorganiza?onwerehiredguns,

oremployeeswhohavesincele"•  Teamthatendsupmanagingclusterconsidersitsomesortofblackart

Problem•  Operatorswhodon’tknowwhattheyaredoingputyourdataatrisk•  Thebuilt-insafety/durabilitymaybecompromised

Mi,ga,on•  Makesureusersreceivepropertraining,andavoidstaffSPOF•  Conductcontrolledemergencydrillstoprac?ceforoutages•  Maintainseparateclusterwithsameversionforexperimentsanddryrun,

orlearnhowtodoitwithacloudbasedenvironment

Page 12: Red Hat Storage Day -  When the Ceph Hits the Fan

COMMON TROUBLE #6 RISKY CONFIGURATION CHOICES

12

Observa,ons•  Usersreadsomewherethatmoun?ngXFSOSD’swiththe‘nobarrier’op?on

willresultinperformancegains

Problem•  Whiletheperformancegetsno?ceablybejer,youareintroducingariskfor

datacorrup?onduringpoweroutages•  Thebuilt-insafety/durabilitymaybecompromised

Mi,ga,on•  Donotuse‘nobarrier’mountop?onunlessyouunderstandfullywhat

hardwareyouhave,andunlessyouknowwhatyouaredoing

Page 13: Red Hat Storage Day -  When the Ceph Hits the Fan

COMMON TROUBLE #7 POOR NETWORK CONFIGURATION

13

Observa,ons•  Usersdon’tpayenoughajen?ontonetworkconfigura?on•  Networkinconsistencies(e.g.JumboFrames)andbojlenecksgoundetected

…un?lCephperformspoorly.

Problem•  Troubleshoo?ngnetworkingissuesisdifficultandexpertshardtofind•  Cephheavilyreliesonproperconfigura?on

Mi,ga,on•  Investinyourteamandnetworkmaintenanceskills

Page 14: Red Hat Storage Day -  When the Ceph Hits the Fan

WHAT TO DO WHEN THINGS WENT WRONG

14

1.  Staycalmanddon’tmakeitworse!•  Poorlyskilledoperatorsmayturnaproblemintoacatastrophe

2.  ContactRedHatSupportimmediately•  Sev1andSev2issuesarehandledwithtoppriority•  Chancesarethattheywillbeabletohelprightawayandgetyourcluster

hummingagain

3.  ContactyourtrustedRedHatServicesorSalescontacts•  Ifproblemspersistoryoufeelyouneedextrahelp,youmightwanttogeta

CephexpertfromRedHatProfessionalServices

Page 15: Red Hat Storage Day -  When the Ceph Hits the Fan

GOOD PRACTICES TO AVOID PROBLEMS

15

1.  Don’tstumbleintoimplementa?on/deploymentwithoutcarefulplanning•  Captureanddocumentrequirements,doaPOC,doanactualdesign•  Engageexpertsearlytohelpwithclusterdesignandhardwarechoices

2.  Unlessyoulovetotakerisks,usesupportedbits3.  StayclosetotherecommendedreferencearchitecturesfromRedHatpartners4.  Makesureyourstaffreceivespropertraining

•  RedHatGlobalLearningprovidesexcellenttrainingforGlusterandCeph5.  Planforgrowth6.  Don’tletthingslinger.Cephdoesnotlikeitwhentheclusteris90%full7.  HaveanexpertperformregularStorageHealthCheckstodetectproblemswhile

theyares?llsmall

Page 16: Red Hat Storage Day -  When the Ceph Hits the Fan

STORAGE DESIGN CONSULTING

16

•  SpecialistsfromRedHatConsul?ngwillhelpplanningyourCephdeployment

•  Start:StorageDiscoverySession

•  Wecanhelpdiscoverrequirementsanddesignastoragesolu?onthatmatches

•  YouwillreceiveadetailedStorageSolu,onarchitecturedocumentwhichwillar?culatedesignchoicesandlayoutastep-by-stepplanforimplementa?on

Page 17: Red Hat Storage Day -  When the Ceph Hits the Fan

STORAGE HEALTH CHECKS

17

•  Standard3-dayengagementdonebyRedHatstorageexperts•  Comprehensivetop-to-bojomanalysisofyourso"ware-definedstorageplaeorm•  Sixfocusareas

1.  Lifecycle2.  Configura?on3.  Organiza?on4.  UseCase5.  Hardware6.  Opera?onal

•  Clearread-outofissues•  Ac?onablerecommenda?ons

Page 18: Red Hat Storage Day -  When the Ceph Hits the Fan

POSITIVE NOTE

18

•  Iaskedmyconsultantsforfeedbackonthispresenta?on.Hereisonecomment

Page 19: Red Hat Storage Day -  When the Ceph Hits the Fan

19

WHERE TO GO NEXT

REDHATSUBSCRIPTIONS

hjps://access.redhat.com/subscrip?on-valueEvalua?on,Pre-produc?on,andProduc?onsubscrip?onsavailable

CONSULTING hjp://www.redhat.com/en/services/consul?ng/storage

TRAINING hjps://www.redhat.com/en/services/training

TESTDRIVE hjp://red.ht/cephtestdrive

To engage a Territory Service Manager in your area, ask for a local Red Hat Storage sales professional at: NORTH AMERICA: 1 (888) REDHAT-1; LATIN AMERICA: 54 (11) 4329-7300; EMEA: 00800 7334 2835 APJ: 65 6490 4200; Brazil: 55 (11) 3529-6000,; Australia: 1800 733 428; New Zealand: 0800 733 428