65
© 2018 SPLUNK INC. © 2018 SPLUNK INC. Architecting Splunk for High Availability and Disaster Recovery October 2018 | Version 1.0 Sean Delaney | Principal Architect | Splunk

Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

© 2018 SPLUNK INC.

Architecting Splunk for

High Availability and

Disaster Recovery

October 2018 | Version 1.0

Sean Delaney | Principal Architect | Splunk

Page 2: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

During the course of this presentation, we may make forward-looking statements regarding future events or

the expected performance of the company. We caution you that such statements reflect our current

expectations and estimates based on factors currently known to us and that actual events or results could

differ materially. For important factors that may cause actual results to differ from those contained in our

forward-looking statements, please review our filings with the SEC.

The forward-looking statements made in this presentation are being made as of the time and date of its live

presentation. If reviewed after its live presentation, this presentation may not contain current or accurate

information. We do not assume any obligation to update any forward-looking statements we may make. In

addition, any information about our roadmap outlines our general product direction and is subject to change

at any time without notice. It is for informational purposes only and shall not be incorporated into any contract

or other commitment. Splunk undertakes no obligation either to develop the features or functionality

described or to include any such feature or functionality in a future release.

Splunk, Splunk>, Listen to Your Data, The Engine for Machine Data, Splunk Cloud, Splunk Light and SPL are trademarks and registered trademarks of Splunk Inc. in

the United States and other countries. All other brand names, product names, or trademarks belong to their respective owners. © 2018 Splunk Inc. All rights reserved.

Forward-Looking Statements

THIS SLIDE IS REQUIRED FOR ALL 3RD PARTY PRESENTATIONS.

Presentations to Third-Parties (i.e., non-Splunkers)

• When presenting to third parties, this slide must be

included immediately after the title slide.

• Only share confidential information on a “need-to-know”

basis and make sure the audience members are bound

by non-disclosure or confidentiality agreements.

• Before disclosing any customer or other third party

names, logos or use cases, confirm with Marketing that

we have the right consents.

• Don’t bash the competition. If making comparisons

between Splunk and our competitors, stick to the facts.

• Make sure all statements are not overstated and are

supported by facts.

Confidential Information

• If the presentation contains confidential information, a

confidentiality notice must appear on every slide.

Please contact [email protected] for help

implementing this notice: Confidential information. Do

not distribute.

• Examples of confidential information: financial results,

strategic business and marketing plans, product

launches and roadmaps, customer lists and use cases.

NOTICE FROM SPLUNK LEGAL

Page 3: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

Sean Delaney ▶ Principal Field Architect

▶ 7+ years at Splunk

▶ Large scale deployments

▶ 9th .conf

Page 4: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Agenda

Top Takeaways

Recover in the event of a disaster

Disaster Recovery

High Availability

• Data Collection

• Indexing & Searching

Maintain an acceptable level of continuous service

Page 5: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

Disaster Recovery (DR)

Page 6: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

© 2018 SPLUNK INC.

What is Disaster Recovery?

Set of processes necessary to ensure recovery of service after a disaster

DR

Page 7: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Disaster Recovery Steps

1 Backup necessary data

Backup to a medium at least as resilient as source

Local Backup vs. Remote

2 Restore

Ensure this works

Backup is worthless without restore

DR

Page 8: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Backup

1 a Configurations

$SPLUNK_HOME/etc/*

b Indexes Buckets: Hot*, Warm, Cold, Frozen

DR

Page 9: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Backup Configurations

$SPLUNK_HOME/etc/*

DR

Splunk Instance

Page 10: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Backup: Bucket Lifecycle

10

Events

[Hot Bucket is Full]

[Out of Space or Bucket is Old]

[Explicit User Action]

$ Thawed Path

$ Home Path $ Cold Path [Cheaper Storage]

$ Frozen Path or Deleted

DR

[Out of volume space or too many warms]

Page 11: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Backup Data

Bucket Type State Can Backup?

Read + Write No*

Read Only Yes

Read Only Yes

DR

*Unless using snapshot aware FS or roll to warm first (which introduces a performance penalty).

Page 12: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Restore Configurations DR

$SPLUNK_HOME/etc/*

New Splunk Instance

$SPLUNK_HOME/etc/*

Page 13: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Restore Data DR

New Splunk Instance

$Indexes_Location ($SPLUNK_HOME/var/lib/splunk)

$Indexes_Location ($SPLUNK_HOME/var/lib/splunk)

Splunk advises restoring fully from a backup rather than restoring on top of a partially corrupted datastore.

Page 14: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Restore Configurations DR

$SPLUNK_HOME/etc/*

New Splunk Instance

$SPLUNK_HOME/etc/*

Page 15: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

KV store backup scripts added in 7.1, previous versions backup KV Store files

Run on the Searchead to take a stateful snapshot of the KV store

Backup KV Store DR

http://docs.splunk.com/Documentation/Splunk/latest/Admin/BackupKVstore

./splunk backup

kvstore

$SPLUNK_HOME/var/lib/splunk/kvstorebackup

Page 16: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

▶ Option 1: Backup all data on each node

• Will also result in backups of duplicate data

▶ Option 2: Identify one copy of each bucket on the cluster and backup only those (requires scripting)

• Decide whether or not you need to also backup index files

Backup Clustered Data DR

Non-clustered buckets: db_<newest_time>_<oldest_time>_<localid> Clustered original bucket: db_<newest_time>_<oldest_time>_<localid>_<guid> Clustered replicated bucket copies: rb_<newest_time>_<oldest_time>_<localid>_<guid>

Bucket naming conventions

Page 17: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Putting Restore Together

2 a (New) Splunk Instance

b Configurations

c Data/Indexes

DR

Page 18: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Considerations

Recovery Time and Tolerable Loss

vs.

Complexity and Cost

DR

Page 19: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Other elements in your environment

▶ Job Artifacts, DM, Collections etc.

▶ Utility/Management Instances:

• Deployment Server

• License Master

• Cluster Master

• Deployer

Page 20: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

High Availability (HA)

Page 21: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

© 2018 SPLUNK INC.

What is High Availability?

A design methodology whereby a system is continuously operational, bounded by a set of

predetermined tolerances. Note: “high availability” !=“complete availability”

HA

Page 22: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Splunk High Availability

1 Data Collection/Reception

2 Searching

3 Indexing

HA

Page 23: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Data Collection

Forwarder

Indexers

HA

Forwarder Forwarder

. . .

. . .

A B

Page 24: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Data Collection

Forwarder

Indexers

HA

Forwarder Forwarder

. . .

. . . outputs.conf: [tcpout] defaultGroup = mygroup [tcpout:mygroup] server = A:9997, B:9997 autoLB = true

A B

Page 25: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Searching

2 Search Head Clustering (SHC)

HA

Page 26: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Searching

Indexer A Indexer B Indexer N . . .

Typical Search Hierarchy

HA

Page 27: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Searching

Indexer A Indexer B Indexer N . . .

Typical Search Hierarchy

HA

Page 28: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Search Head Clustering (SHC) HA

▶ Improved horizontal scaling

▶ Improved high availability

▶ No single point of failure

Page 29: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

SHC

Indexer A Indexer B Indexer N . . .

HA

A B C

Indexer C

Replication protocol syncs: - Configurations - Job Artifacts

Page 30: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

SHC

Indexer A Indexer B Indexer N . . .

HA

A B C

Indexer C

Replication protocol syncs: - Configurations - Job Artifacts

Deployer

Configurations

Page 31: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

SHC

Indexer A Indexer B Indexer N . . .

HA

A B C

Indexer C

Replication protocol syncs: - Configurations - Job Artifacts

Deployer

Configurations

Deployer ensures identical

deployed configurations

Page 32: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

SHC

Indexer A Indexer B Indexer N . . .

HA

A B C

Indexer C

Replication protocol syncs: - Configurations - Job Artifacts

Deployer

Configurations

Captain

Captain plays a special

role in cluster orchestration

and job scheduling.

Page 33: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

SHC Operation - High Level

▶ Deployer ensures all SHC members have identical baseline configurations

• Subsequent UI changes propagated using an internal replication mechanism

▶ Job Scheduler gets disabled on all members but the Captain ▶ Captain selects members to run scheduled jobs based on load

• Selection based on load statistics.

• Captain orchestrates job artifact replication to selected members/candidates of the cluster.

▶ Transparent job artifact proxying (and eventual replication) if artifact not present on user’s SH.

HA

Page 34: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

SHC Operation - High Level

▶ Majority requirement and failure handling

• Surviving majority (>=51%)

▶ Site-awareness gotchas

• No notion of site in SHC (unlike in index replication)

• Case for static captain election

▶ Latency and number of nodes

HA

Page 35: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Stretched SHC

Indexer Indexer

HA

Deployer Captain

Indexer Indexer

Site A Site B

Page 36: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Stretched SHC

Indexer Indexer

HA

Deployer

Captain must be statically

elected as there is no SHC

majority

Indexer Indexer

Captain

Site A Site B

Page 37: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Stretched SHC

Indexer Indexer

HA

Deployer

Indexer Indexer

Deployer Captain

Deployer launched in

second site, if SCH

updates required

Site A Site B

Page 38: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Indexing

3 Indexer Clustering

HA

Page 39: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Index Replication ▶ Cluster = a group of search peers (indexers) that replicate each others' buckets ▶ Data Availability

• Availability for ingestion and searching

▶ Data Fidelity

• Forwarder Acknowledgement, assurance

▶ Disaster Recovery

• Site awareness

▶ Search Affinity

• Local search preference vs. remote

HA

Trade offs

• Extra storage

• Slightly increased processing load.

Page 40: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Cluster Components ▶ Master Node

• Orchestrates replication/remedial process. Informs the SH where to find searchable data. Helps manage peer configurations.

▶ Peer Nodes

• Receive and index data. Replicate data to/from other peers. Peer Nodes Number ≥ RF

▶ Search Head(s)

• Must use one to search across the cluster.

▶ Forwarders

• Use with auto-lb and indexer acknowledgement

HA

Page 41: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Single Site Cluster

Architecture

Credit: Splunk Docs Team

HA

Page 42: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Credit: Splunk Docs Team

HA Replication Factor (RF) ▶ Number of copies of data in

the cluster. Default RF=3 ▶ Cluster can tolerate RF-1 node

failures

Page 43: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Credit: Splunk Docs Team

HA Search Factor (SF) ▶ Number of copies of data in

the cluster. Default SF=2 ▶ Requires more storage ▶ Replicated vs. Searchable

Bucket

Page 44: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Clustered Indexing ▶ Originating peer node streams copies of data to other clustered peers.

• Receiving peers store those copies.

▶ Master determines replicated data destination.

• Instructs peers what peers to stream data to. Does not sit on data path.

▶ Master manages all peer-to-peer interactions and coordinates remedial activities. ▶ Master keeps track of which peers have searchable data.

• Ensures that there are always SF copies of searchable data available.

HA

Page 45: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Clustered Searching ▶ Search head coordinates all searches in the cluster ▶ SH relies on master to tell it who its peers are.

• The master keeps track of which peers have searchable data

▶ Only one replicated bucket is searchable a.k.a primary

• i.e., searches occur over primary buckets, only.

▶ Primary buckets may change over time

• Peers know their status and therefore know where to search

HA

Page 46: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Multisite Clustering

▶ Site awareness introduced in Splunk 6.1 ▶ Improved disaster recovery

• Multisite clusters provide site failover capability

▶ Search Affinity

• Search heads will scope searches to local site, whenever possible

• Ability to turn off for better thruput vs. X-Site bandwidth

Page 47: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Multi Site Cluster

Architecture

Credit: Splunk Docs Team

Differences vs. single site ▶ Assign a site to each node ▶ Specify RF and SF on a site

by site basis

Page 48: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Multisite Clustering Cont’d

▶ Each node belongs to an assigned site, except for the Master Node, which controls all sites but it’s not logically a member of any

▶ Replication of bucket copies occurs in a site-aware manner.

• Multisite replication determines # copies on each site. Ex. 3 site cluster: site_replication_factor = origin:2, site1:1, site2:1, site3:1, total:4

▶ Bucket-fixing activities respect site boundaries when applicable ▶ Searches are fulfilled by local peers whenever possible (a.k.a search

affinity)

• Each site must have at least a full set of searchable data

Page 49: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

Rolling Restarts and Upgrades

Page 50: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Rolling Restarts/Upgrades

▶ New features in 7.1.0

▶ Searchable Rolling Restarts

• Perform a restart of an Indexer Cluster without effecting Search

▶ Upgrade a Splunk Clustered Deployment (SHC and Indexer Cluster) without downtime

• Rolling upgrades apply to versions 7.1 and above (7.1 -> 7.2)

Page 51: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Searchable Rolling Restart

▶ The master restarts peers one at a time. ▶ The master runs health checks to confirm that the cluster is in a searchable

state before it initiates the searchable rolling restart. ▶ The peer waits for in-progress searches to complete, up to a maximum time

period, as determined by the decommission_search_jobs_wait_secs attribute in server.conf. The default for this attribute is 180 seconds. This covers the majority of searches in most cases.

▶ Searchable rolling restart applies to both historical searches and real-time searches.

▶ splunk rolling-restart cluster-peers -searchable true

Page 52: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

1. Perform pre-flight indexer cluster health checks

• splunk show cluster-status --verbose

2. Upgrade the cluster master

3. Initialize rolling upgrade on the indexer cluster

• splunk upgrade-init cluster-peers

4. Select a cluster peer and gracefully shutdown the cluster-peer

• splunk offline

5. Upgrade the selected cluster-peer and restart

6. Repeat steps 4 and 5 for all cluster peers

7. Finalize rolling upgrade on the indexer cluster

• splunk upgrade-finalize cluster-peers

Searchable Rolling Upgrades

Rolling Upgrade Steps:

Page 53: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

© 2018 SPLUNK INC.

Searchable Rolling Upgrades

SH-A SH-B SH-C

CM: Idx1: Idx2: IdxN:

Stop the cluster master 1

Upgrade and restart

cluster master

2

Rolling upgrade of

Search Heads

3

Repeat step 5 for multi-

site deployments

6

Upgrade each indexer

ensuring search

continuity. Repeat for

all indexers

5

Deliver business continuity by making search highly available

Deployer

Upgrade SHC deployer

4

Page 54: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

Smart Store

Page 55: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

© 2018 SPLUNK INC.

▶ Data Resiliency

Will my data be protected if something goes wrong?

• SmartStore relies on the remote store to provide data resiliency

− SmartStore and indexer clustering will not protect data in the remote store

− When used on SmartStore enabled indexes, indexer clustering will only protect hot buckets

▶ Search Resiliency

Will my search results be complete if something goes wrong?

• Indexer clustering must be used to provide highly available search

− SmartStore does not natively provide search HA

High Availability Data vs. Search

Page 56: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

© 2018 SPLUNK INC.

Architecture Components

Hot/Warm

Storage Cold

Storage

Classic Architecture

Remote Storage

Smart Store Architecture

Hot/Cache

Storage

Hot/Cache

Storage

Page 57: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

© 2018 SPLUNK INC.

Failure Scenarios Failed peers < replication factor

STUB

CACHED

HOT HOT copy bucket

STUB push metadata

STUB push metadata

Available copies of a bucket exist in the cluster

▶ Cluster master initiates a fix-up operation

• Peers with an existing copy of buckets are instructed to copy them to other peers until RF/SF is met

• Hot bucket: full copy

• Cached bucket: metadata push

− Peers don’t need the full contents of cached buckets, as it can be fetched from the remote store

− Metadata replication is done with a POST to a REST endpoint on the target peer(s)

Page 58: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

© 2018 SPLUNK INC.

Retention for SmartStore enabled indexes is managed globally

• Retention for “classic” indexes is managed on a per-indexer basis

▶ Retention Options

• Total Index size in MB

− Oldest bucket is purged when [maxGlobalDataSizeMB] is reached

• Age of events

− Bucket is purged when the oldest event reaches [frozenTimePeriodInSecs]

Retention Options

Page 59: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

Final Words

Page 60: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Putting it together

Search Head Clustering

Indexer Clustering

Forwarding Layer – autoLB

………..

………..

……….. Deployer Master

Page 61: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Top Takeaways END ▶ DR – Process of backing-up and restoring service in case of disaster

• Configuration files – copy of $SPLUNK_HOME/etc/ folder

• Indexed data – backup and restore buckets

• Hot, warm, cold, frozen

• Can’t backup hot (without snapshots) but can safely backup warm and cold

▶ HA – continuously operational system bounded by a set of tolerances

• Data collection

• Autolb from forwarders to multiple indexers

• Use Indexer Acknowledgement to protect in flight data

• Searching

• Search Head Clustering (SHC)

• Indexing

• Use Index Replication

Page 62: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

SVAs – Splunk Validated Architectures

▶ Recommended and supported Splunk deployment topologies based upon the following design pillars:

• Availability

• Performance

• Scalability

• Security

• Manageability

https://www.splunk.com/pdfs/white-papers/splunk-validated-

architectures.pdf

Page 63: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

© 2018 SPLUNK INC. CONFIDENTIAL INFORMATION. DO NOT DISTRIBUTE.

Resources

63

END • Splunk Validated Architectures

– https://www.splunk.com/pdfs/white-papers/splunk-validated-architectures.pdf

• Docs – High Availability – http://docs.splunk.com/Documentation/Splunk/latest/Deploy/Useclusters – http://docs.splunk.com/Documentation/Splunk/latest/Deploy/Indexercluster – https://docs.splunk.com/Documentation/Splunk/latest/DistSearch/AboutSHC

– Rolling restarts and upgrades: – http://docs.splunk.com/Documentation/Splunk/latest/DistSearch/SHCrollingupgrade – http://docs.splunk.com/Documentation/Splunk/latest/Indexer/Userollingrestart#Use_searchabl

e_rolling_restart – http://docs.splunk.com/Documentation/Splunk/latest/Indexer/Searchablerollingupgrade

Page 64: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

Q&A

Page 65: Architecting Splunk for High Availability and Disaster ... · between Splunk and our competitors, stick to the facts. • Make sure all statements are not overstated and are supported

© 2018 SPLUNK INC.

Don't forget to rate this session

in the .conf18 mobile app

Thank You