Introduction3TierApproachForwardingArchitectureIndexingArchitectureSearchArchitectureSizingRecapSizingExamplesMonitoringQ&A
AGENDA
SizingConsiderations
VitalInfo• Amountofincomingdata• Amountofindexed(stored)data• Numberofconcurrentusers• Numberofsavedsearches• Typesofsearches• SpecificSplunkapps
http://docs.splunk.com/Documentation/Splunk/latest/Installation/Performancechecklist
Splunk3TierArchitecture
5
Enterprise-classScale,ResilienceandInteroperability
SenddatafromthousandsofserversusinganycombinationofSplunkforwarders
Autoload-balancedforwardingtoSplunkIndexers
OffloadsearchloadtoSplunkSearchHeads
ReferenceHardware
Allinstancesx64,CPU>2Ghzpercore*http://docs.splunk.com/Documentation/Splunk/latest/Capacity/Referencehardware
†http://docs.splunk.com/Documentation/ES/latest/Install/DeploymentPlanning
6
Role CoreSplunk* EnterpriseSecurity(ES) †
Indexer12CPUcores12GBofRAM800IOPS/indexerRAID1+0dataingest:150-250GB/day
16CPUcores32GBofRAM800IOPS/indexerRAID1+0dataingest:100GB/day
SearchHead16CPUcores12GBofRAM2x300GB10krpmSASinRAID1
16CPUcores32GBofRAM2x300GB10krpmSASinRAID1
RequiredReading
DistributedDeploymentManual• http://docs.splunk.com/Documentation/Splunk/latest/Deploy/Distributedoverview
Highlights• Referencehardwarespecs• Howsearchesaffectperformance• Dense/Rare/Sparse
• Appconsiderations• Summarytable
7
ForwardingTierDesignFactorsSyslogCollectors(HA)DBConnectInputs
• Eg.McAfeeEPOdata
TAInputs• Eg.CheckPoint
AssortedInputs• MicrosoftADlogs• MicrosoftExchangeServer• MicrosoftSharepointlogs• Log4j,Linux,IIS• …
9
SyslogCollectors
• Bestpracticetousededicatedsyslogservers• Syslog-NG/rSyslogrecommended• Syslogcanwriteeventstodedicatedlogfilesallowingforeasysourcetypeclassification
oninputs
10
SyslogCollectors
UsingaLoadBalancer/VIPwithLinuxHeartbeattoprovidefailoverforthesysloglistenerSyslog-NGPEClient-sidefailover
11
Syslog-NG Server Syslog-NG Server
Syslog 514/tcp & 514/udp
Router (Physical)
Load Balancer
Load Balancer
ForwarderforTA’s
TA-McAfeerequiresDBConnecttopullendpointeventsTA-CheckpointusestheLEAClienttoretrieveFirewalllogeventsNotaHAdesign,butcouldbehostedonaVMtostandbyorfailover
12
Heavy Forwarder, Linux
ePO Database
Checkpoint Server
TA-McAfee(DBConnect)
TA-Checkpoint
Firewall
DeploymentServer
Deployment Server
Splunk Forwarders to get apps from splkds.internal.door2door.com:8089
13
● DeploymentServertomanageLinuxandWindowsforwarders
● NotaHAdesign,butcouldbehostedonaVMtostandbyorfailover
ForwardingTier
Syslog-NG ServerForwarders, LinuxForwarders,
Windows
Deployment Server
Windows SharePoint Server
Heavy Forwarder, Linux
ePO Database
Checkpoint Server
Windows AD ServerSyslog-NG Server
Indexers
Syslog 514/tcp & 514/udp
TA-McAfee(DBConnect)
TA-Checkpoint
Splunk AutoLB to splkidx.internal.door2door.com:9997Splunk Forwarders to get apps from splkds.internal.door2door.com:8089
Router (Physical)
Load Balancer
Load Balancer
Firewall
14
ForwardingTierDesignBestPractices
UseaSyslogServerforSyslogdataBecarefulwithIntermediateforwarders• Theycanintroducebottlenecks• ReducethedistributionofeventsacrossIndexers
MayneedtoincreaseUFthruputsettingforhighvelocitysources• [thruput]• maxKBps
AutoLBwillspreadoverallavailableindexers,butdon’tassumeevenly!• EnableforceTimebasedAutoLBforUF
15
DataDistributionImbalanceEvendatadistributioniscrucialinparallelcomputingWaystoimprovedatadistribution:
• Enableparallelpipelines onheavyforwarders(Inserver.conf)• RoutedirectlyfromUniversalforwarderswherepossible• Makethefollowing changestoforwarders’outputs.conf:
• forceTimebasedAutoLB=true• autoLBFrequency=x
Examinesavedsearchtimewindows.Examplebelowhasmanysearchesovera5minutewindow, andsomesearchesover1minutewindow,autoLBFrequencytimesnumberofindexersshould bedivisible by5minutes, or1minuteifpossible
|tstats summariesonly=tcountWHEREindex=“*” bysplunk_server_time |timechart span=5msum(count) bysplunk_server
16
6Indexers;autoLBFrequency=30Unevendistribution ofworkloadover5minuteperiods.Unpredictableworkloadvariation
6Indexers;autoLBFrequency=15Betterdistributionover5minutes.autoLBFrequency=10wouldbeevenbetterasthereare6indexers
DataImbalance- Troubleshoot
Troubleshooting:• Validatefirewallrulesareinplace• Checkthatallforwardershavethecorrectoutputs• Ensureindexersallalllisteningonproperport• Doessplunkd.loghaveanythingtosay?• UsetheIndexingOverviewandConfigurationOverview(btoolsavestheday)
OtherCauses:• Simplemisconfiguration• Dataprocessingqueuesfillingupandforwarderstimingoutandjumpingtonextindexer
• CheckDistributedIndexingPerformanceintheDMCforqueuefilling- typicalsignofdiskperformanceissues
• Indexeraffinity- theforwardersgetstucktooneindexerbecauseEOFnevermet• forceTimebasedAutoLBcanhelp!http://blogs.splunk.com/2014/03/18/time-based-load-balancing/
17
HowManyDeploymentServers?
Ruleofthumbsays:1per10kclients@10– 15minpollingperiodAdjustpollingperiodtoincreasetotalclientssupportedSmalldeploymentscansharethesameinstanceasothermanagementinstances(LicenseMaster,ClusterMaster,etc.)Lowrequirementfordiskperformance(goodcandidateforvirtualization)Orusesomethingotherthandeploymentserver• puppet,SCCM,cfengine,chef…
IndexingTier
DesignFactorsPeakingestvolumeHighAvailability– IndexerReplication10%DiskSpaceContingencyDataretention
ClusterSizingCalculatorhttp://splunk-sizing.appspot.com
20
HowManyIndexers?
Ruleofthumbsays:1indexerper150- 250GB/day80– 100GBwithEnterpriseSecurity
Leaveroomfor:• Dailypeaks
Needmoreindexersfor:• Heavyreporting•Moreusers• Slowerdisks,slowerCPUs,fewerCPUs
StorageCalculations
RAIDConfiguration• Amountofrawdisk• Faulttolerance• AvailableIOPS
FilesystemOverhead• inodesconsumespace
Wiggleroom• Additionalreplicatedbucketswhenanodefails• Unbalancedreplicatedbuckets
Splunkinternallogs,SummaryIndexes,ReportAcceleration,AcceleratedDataModels
22
StorageTypes
LocalvsDirectAttachedvsSANvsNASSSD/FlashvsSpinningDisk• SSDsoffermuchhigherIOPSwithnolatency• SignificantperformanceincreaseswithSparseSearches
23
IndexReplication(akaIndexClustering)Whatisit?
• Dataisreplicatedto1ormoreindexersbasedonindexes• SplunkClusterMastercontrolled
Basics• MasterNode(managesindexingandsearchinglocation)• HorizontalScaling
HAvsDR• HA- Dataismadeavailableon1ormoreindexersinonelocation• DR– Multisite clustering.Alldataexists inmultiple locations
BenefitsofClustering
• Dataredundancy• Dataavailability• Indexerresiliency• Simplermanagementofindexers• Simplersetupofdistributedsearch• Multi-siteclusteringallowssite-specificsearchtoreduceWANtraffic
25
IndexClusteringSizingReplicationfactorüDeterminethenumberofrebuildablecopiesofdatatomaintain
SearchfactorüDeterminethenumberofsearchablecopiesofthedata
DataRetentionequationbasedonsyslogdataü TotaldiskusageacrossclusterinGB=(RepFactor*0.15+SearchFactor*0.35)*DatasetSizeGB
IncreaseinI/O,CPU,anddiskrequirement• Meansdailyindexingvolumeperserverwillbelower
Searchfactorincreasediskusageby~30%(rawdata+tsidx)Replicationfactorincreasesdiskusageby~10%(onlyrawdata)
ClusterMasterServer
• IndexerAppsaredeployedviaCM• NotaHAdesign,butcouldbehostedonaVMtostandbyorfailover
27
SearchTier
DesignFactors• HighAvailability• SearchHeadClustering• #users• #concurrentsearches• Forwardalldatatoindexers
30
SearchHeadClustering
Whatisit?• Groupsearchheadsintoaclusterasasingleentity• ProvidesHAattheSearchHeadlayer• SplunkHeadCaptaincontrolled• RAFTprotocoltopickcaptain
Basics• Acaptaingetselecteddynamically(pre6.3)orcanbedefinedmanually(6.3)• Knowledgeobjectsandsearchartifactsarereplicated• Searchworkloaddistribution• ReplicationusinglocalstorageNOToverNFS
SHC&Deployer
• SearchHeadClusterAppsneedtobeinstalledbytheDeployer• Aminimumof3SearchHeadsarerequiredforaSHC• Noexchange,nodbxwithSHC• ESwillstillrequireaseparateSearchHeadordedicatedSHC• UseLDAP/AD/SSOforuserAuthentication• LoadBalancerconfiguredforstickysessions
32
HowManySearchHeads?
Ruleofthumbsays:1per20– 40concurrentqueriesLimitisconcurrentqueriesSearchQuerynormallyusesupto1CPUcore
• 6.3Parallelizationcanleveragemore
Don’taddsearchheads;addindexers:indexersdomostwork• UnlessyouneedHA/SearchClustering
Scaleverticallyifinfrastructureallowsit.AddCPU,addmemory.
RealWorldExamplesCiscoUnifiedComputingSystem(UCS)
• SearchHead:• UCSC220M4• 24cores• Indexer:• UCSC240M4• 24cores
CiscoValidatedDesign(CVD)forUCS267pageReferenceManualfordeploying1TB/dayonUCSValidatedandBenchmarkedbyCiscoandSplunk
37
DistributedDeployment– CommonComponents
Search-Head 3 XCiscoUCSC220-M4RackServers,eachwith:▫ CPU:2X E5-2680v3(24cores)▫ Memory:256GB▫ Cisco12GbpsSASmodularRAIDcontroller (2GBFBWCcache)▫ CiscoVIC1227▫ 2X600GB15KSFFSASdrives(RAID1)
Admin/MasterNodes 2 XCiscoUCSC220-M4RackServers,eachwith:▫ 2X E5-2620v3(12cores)▫ Memory:256GB▫ Cisco12GbpsSASmodularRAIDcontroller(2GBFBWCcache)▫ CiscoVIC1227▫ 2X600GB15KSFFSASdrives(RAID1)
NetworkFabric 2 XCiscoUCS6248UP48- PortFabricInterconnects
DistributedDeployment– Retentionvs.Performance
DistributedDeploymentwithHighCapacity DistributedDeploymentwithHighPerformanceIndexer 16XC240-M4rackservers, eachwith:
▫ CPU:2XE5-2680v3(24cores)▫ Memory:256GB▫ Cisco12GbpsSASmodularRAIDcontroller(2GBFBWCcache)▫ CiscoVIC1227▫ 24X1.2TB10KSASinRAID10
2X120GBSSDinRAID1forOS
16XC220-M4rackservers, eachwith:▫ CPU:2XE5-2680v3 (24cores)▫ Memory:256GB▫ Cisco12GbpsSASmodularRAIDcontroller(2GBFBWCcache)▫ CiscoVIC1227▫ 6X800GBSSD-EPinRAID5▫ 2X600GB10KSFFSASHDDw/RAID1forOS
RetentionCapability >1TB/Dayw/1year+retention >1.25TB/Dayw/90dayretention
IndexingCapacity 4TB/Day 8TB/DayIndexingCapacityw/Replication
2TB/Day 4TB/Day
RawIndexCapacity 236TB 64TBExpectedDataCapacity At2:1compression:
472TBAt2:1compression:
128TBKeyUse-Cases ▫ Enterprisesrequiringlargerdataretention ▫ Abilitytosupportlargenumberofconcurrentusersthatrequire
fasterresponse timeServersCount 21(37RU) 21(21RU)Scalability ▫ AdditionalSearch-Head(s)
▫ 1to16additionalIndexers(refertoHighCapacityIndexerconfiguration)
▫ AdditionalSearch-Head(s)▫ 1to16additionalIndexers(RefertoHighPerformanceIndexer
configuration)
CloudDeploymentsCloudConsiderations
• Authenticationrestrictions• Datatransfercosts• Security– SSLTunnel• Zones• Hybriddeployments
VMware http://www.splunk.com/web_assets/pdfs/secure/Splunk_and_VMware_VMs_Tech_Brief.pdf
AWShttps://www.splunk.com/pdfs/technical-briefs/deploying-splunk-enterprise-on-amazon-web-services-technical-brief.pdf
Azurehttp://www.splunk.com/pdfs/technical-briefs/deploying-splunk-enterprise-on-microsoft-azure.pdf
RealWorldExamplesAmazonWebServicesEC2
• SearchHead:• c4.4xlarge+EBSstorage• c4.8xlarge+EBSstorage
• Indexer:• c4.4xlarge+EBSstorage• c4.8xlarge+EBSstorage• d2.4xlarge(IR)
High availability across Indexers & Search
Heads
Multiple AWS availability zones
Dedicated Cloud environments
- Secure- 10x Bursting
Splunk Cloud fully monitored using Splunk Enterprise
Builtfor100%Uptime
Forward dataSearch
MonitorGet value fast
What You DoHardware setup
StorageScaling
Monitoring
What We Do
Hybrid Search
Search Head(s)
Indexer(s)
Search Head(s)
Indexer(s)
On Premises Private Cloud Public Cloud On Premises Private Cloud Public Cloud
Single Pane of Glass Visibility
Top5ThingsToConsider
• IndexerStoragerequirements• Minimumbuy-inforaSHCis3• UseVMsforCM/LS/DS/Deployerifpossible• ConsideradedicatedSHformanagement
• DistributedManagementConsole• SplunkHealthCheckOverviewApp• SearchActivityApp
• Whenindoubt– addanotherIndexer
50
MoreIsBetter?CPUs
• 8,12,16,24,32,etc….• Pipelines - New6.3featureforparallelization!• Indexingcanhandlehigherburstswithmultiple indexpipelinesets• Certainsearchescanbeimprovedwithmultiple searchpipelinesets
• Historicalbatch– return thedatawithoutworrying abouttimeorder (…|statscount)• Indexersstillneedtodo theheavylifting (searchexistson indexerANDsearchhead)
Memory• Good forsearchheadsandindexers(16+GB)
• BenefitsfromextraRAMusedbyOSforcaching
Disks• Fasterisbetter- 10k– 15krpmstrongly recommended, SSDpreferred• MoredisksinRAID1+0=Faster• RAID5+1or6canbegood forColdbuckets• SSDscanalsoprovidebenefitforraretermsearchesandmanyconcurrentjobs
MonitoringToolsSowhat’soutthereandwhat’sthedifference?DistributedManagementConsole(DMC)– Built inandonlyavailableonv6.2+
• http://docs.splunk.com/Documentation/Splunk/latest/Admin/ConfiguretheMonitoringConsole• Splunksupportedandfocusesonallfacetsofthedeployment
FireBrigade• https://splunkbase.splunk.com/app/1632/• Detailed lookatindex/bucketactivityandcapacity
SoS(SplunkonSplunk)• https://splunkbase.splunk.com/app/748/• LegacySplunktroubleshootingtool
SplunkHealthOverview• https://splunkbase.splunk.com/app/1919/• Combinationofviewsfoundtobehelpfulinthefield
Note:Deploymentmonitorappisdeprecated– trytostayawayfromitManyoftheseappfunctionalities arebeingrolledintheDMC
54
Howarethings,overall?Highlevelenvironmentstatus– quickviewofwhat’sup/down/notreporting:
• Forwarderhealth- findingforwardersthatwehaven’tseenforawhile• Datasourcehealth- howareourdatafeedsdoing?• RESTendpoints(|rest/services/server/info)- lookingatsysteminformation,possiblyunderprovisionedones
SpottingwarningsanderrorswithinSplunk_internal:• index=_internalsourcetype=splunkd (log_level=ERRORORlog_level=WARN)|clustershowcount=t|tablecluster_counthostlog_level
message|sort– cluster_count|renamecluster_countAScount,log_levelASlevel• index=_internalsourceype=splunkdlog_level!=INFO|timechartcountbycomponent
Trackresourceusage:• Sayhelloto_introspection(Splunk6.1+)• Capturesdiskandotherresourcemetrics(bydefaultonfullinstalls)• http://docs.splunk.com/Documentation/Splunk/latest/Troubleshooting/Abouttheplatforminstrumentationframework
Dashboardstohelpsavetheday:• HealthStatus- SplunkHealthOverview• Instance- DistributedManagementConsole• IndexingPerformance- DistributedManagementConsole• ResourceUsage- SplunkHealthOverview• LicenseUsage- Splunk HealthOverview 55
EnvironmentOverview
Whatarewereportingon?•_internal•_introspection•metadataandusingtstatshttp://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Tstats
•RESTendpoints• |rest/services/server/info• |rest/services/server/roles• |rest/services/server/status/resource-usage
56
Howtousethetoolsavailabletocheckoverallhealth…