Copyright © 2013 Splunk Inc.
Sean Blake & Simeon Yep #splunkconf
Best PracBces: Deploying Splunk on Physical, Virtual, and Cloud Infrastructure
Legal NoBces During the course of this presentaBon, we may make forward-‐looking statements regarding future events or the expected performance of the company. We cauBon you that such statements reflect our current expectaBons and esBmates based on factors currently known to us and that actual events or results could differ materially. For important factors that may cause actual results to differ from those contained in our forward-‐looking statements, please review our filings with the SEC. The forward-‐looking statements made in this presentaBon are being made as of the Bme and date of its live presentaBon. If reviewed aVer its live presentaBon, this presentaBon may not contain current or accurate informaBon. We do not assume any obligaBon to update any forward-‐looking statements we may make. In addiBon, any informaBon about our roadmap outlines our general product direcBon and is subject to change at any Bme without noBce. It is for informaBonal purposes only and shall not, be incorporated into any contract or other commitment. Splunk undertakes no obligaBon either to develop the features or funcBonality described or to include any such feature or funcBonality in a future release.
Splunk, Splunk>, Splunk Storm, Listen to Your Data, SPL and The Engine for Machine Data are trademarks and registered trademarks of Splunk Inc. in the United States and other countries. All other brand names, product names, or trademarks belong to their respecCve
owners.
©2013 Splunk Inc. All rights reserved.
2
About Us
! Simeon Yep ! 5+ years @ Splunk ! Experience
– Customer Success Manager – On-‐site Consultant (20 TB/day) – Technical Sales – Strategic Accounts
! Based in HQ (San Francisco) ! Currently: Sales Engineering Manager, Business Development
4
! Sean Blake ! 2+ years @ Splunk ! Experience
– Professional Services – MulBtude of customers – Development background
! Not at HQ (Washington DC) ! Currently: Professional Services Manager, Public Sector
Splunk: Indexer ! Processes raw data and stores it onto disk ! Input processing
– Parsing (char set determinaBon, linebreaking) – Merging (line merging, Bme extracBon) – Typing (punctuaBon, anonymizaBon)
! Indexer pipe – Internal stuff – Write to disk (compressed)
! Performs HEAVY liVing for searches!
Splunk: Searcher ! Spawns search process (splunkd-‐search)
– 1:1 raBo of search process to CPU core – Communicates via REST API (hkps)
splunkd search
splunkd
splunkd
splunkd
splunkd
Splunk: Forwarder ! Sends data to a Splunk indexer in Splunk format ! Install onto the remote system for data ingesBon ! Low impact -‐ basically reads in data for transmission ! Full vs. Light vs. Universal?
Splunk: Universal Forwarder ! A brief history… ! Light weight = forwarding only ! Python/Splunkweb removed ! Searching/indexing removed ! Deployment server removed ! LWF (4.1 and earlier) ~ UF
Brief Summary ! Indexers: heavy liVing (index AND search) ! Searchers: spawn the iniBal search – distribute as necessary ! Forwarders: send data to the indexer for indexing
Infrastructure: Best PracBces ü Rule of thumb: more is beker – Distributed search
– Splunk scales horizontally – Higher quanBBes will parallelize CPU/IO – Map-‐reduce
ü Rule of thumb: 100 GB/day indexing volume per reference server – Reference server: 2x4 core CPU, 16GB RAM, Fast Disks in RAID 1+0 – ExcepBons and qualificaBons
ü Leverage deployment server to manage configuraBon – AlternaBves: Opscode’s Chef, Puppet, Solaris Package Manager, etc…
Infrastructure: Best PracBces ü Use commodity servers (reference server)
– Reference server: 2x4 core CPU, 16GB RAM, 4x300GB SAS drive (RAID 1+0) – 2 vCPU indexer is a poor choice
ü Use Fast Persistent Storage, don’t skimp – RAID 1+0 arrays, SAN, NFS is not ideal – This will affect the user experience and limit growth if too slow – We are oVen constrained by this first, so make it a high priority
ü Distributed Search consideraBons – Indexers require good IO performance – Searchers are not as IO dependent as indexers – Leverage blades or virtual machines for intermediate forwarders
ü How do we get data into Splunk? – File or directory monitoring (e.g. – access log files from web servers) – Network input over TCP/UDP (e.g. – syslog from a router) – Scripted input (e.g. – text output from running a shell script) – Modular inputs (5.x)
Data CollecBon: Review
Textual Machine Data
Data Inputs: Best PracBces ü Rule of thumb: persist raw data to disk
– Splunk tracks files very well (CRC checks) – Recovery and reliability
ü Network inputs – Stream to syslog-‐ng or similar – Output to a file
ü Forwarders vs. network stream or NFS – File input is more steady state than network stream – Data distribuBon: load balance to many indexers – Pre-‐processing: anonymize data or route it to a 3rd party
Data Inputs: Best PracBces ü Rule of thumb: set index and sourcetype on forwarder
– It’s the easiest method – Improves indexer efficiency
ü Rule of thumb: sourcetype your data – Use built-‐in sourcetypes or create new ones – Examples: csv, log4j, access_combined, iis, syslog – sourcetype is an indexed field – Organizes your data for efficient retrieval
Virtual Machine Specs ü Virtual Machine consideraBons (VMware, Citrix XenServer)
– Follow standard guidelines ê Bigger != Beker ê Many = Beker
– Always set vCPU, RAM for “full reservaBon”
ü Indexer – 8 vCPU recommended; 4 vCPU per VM minimum: 8GB RAM
ê Full reservaBon, if it’s 14000MHz worth of CPU, then set it that way
ü Search head – 12 vCPU recommended; 8 vCPU per VM minimum; 12GB RAM
ü Maintain expected performance = full reservaBon – Constantly read/write from disk, this is intensive and demanding
Storage ConsideraBons ü Splunk recommends 800 IOPS
– Don’t use NFS for primary storage – Thick provisioned disks “Eager Zeroed Thick”
ê Avoid double I/0 when wriBng to disk ê One write to “zero” it then another to actually write ê Does not pertain to NFS ê Common mistake and why we have cauBon on VMFS
ü Performance tesBng – SplunkIT app – tests indexing and searching – bonnie++ (blog informaBon available) – iozone (contact for app, sBll in development)
ü Use raw volumes – VMFS experiences lower performance, but see above
Virtual Machine Notes ü Snapshots will degrade performance
– All writes to filesystem are in turn wriken to the snapshot, I/O hit – When consolidaBng it also requires I/O to move blocks around along with
incoming data ü Distributed Resource Scheduler (DRS)
– Great tool, hurts Splunk – If avoidable, pin the VM to the host – If not, ensure anB-‐affinity so indexers are not overloading a single host
ü Highest priority for CPU and memory shares ü Do NOT set Max Resources to less than the assigned value for the VM ü No memory overcommit ü Possible 20-‐30% overhead against data volumes
Cloud Providers ü Splunk – Storm; Enterprise SaaS ü Amazon Web Services (AWS)
– Large market share – Many best pracBces – Splunk friendly
ü Azure – Splunk runs on Azure VMs – Azure app data is not trivial to retrieve
ü Other – Rackspace
AWS – Overview ü Availability Zones – concept of regions ü Amazon Machine Image (AMI)
– Amazon Linux based – Best performance – Cost effecBve (extra $$ for Windows)
ü Instance type – Spot vs. On-‐demand vs. Reserved
ü Instance size – Small, Medium, Large, Extra Large (2 -‐ 8 EC2 Compute Units, 1-‐8 GB RAM) – Standard vs. Cluster compute vs. GPU (varying CPU and RAM) – XL standard behaves similar to a reference server (50-‐100 GB/day)
AWS – Instance SelecBon ü Which instance do I want? ü Splunk test results
– c1.xlarge is most cost effecBve (High-‐CPU Extra Large)
AWS – Storage SelecBon ü Storage:
– ElasBc Block Storage (EBS) – Simple Storage Service (S3)
ü EBS consideraBons – Behaves like a volume – RAID 1+0 – improved performance – Zone limited – Provisioned IOPS – tesBng underway
ü Use Snapshots ü S3 opBmal for long term due to zone availability
AWS – Infrastructure PracBces ü AVer selecBon process is done
– Create your own “Custom AMI” – Use your configuraBon tool to push AMI or bits
ü AuthenBcaBon and AuthorizaBon – Managing Users and Roles – SSO or LDAP (SSL Tunnel)
ü Security – Create SSL tunnels or similar for distributed environments – Enable SSL and use your own cerBficates
AWS – Infrastructure PracBces ü Search head pooling requires NFS
ü EBS != NFS; must build NFS server
ü Example configuraBon (500+ GB/day ) ü m1.small as forwarder (N) ü m1.large as indexer (10) ü c1.xlarge as search head (1)
ü Who does it? – Best Buy – scales 1000s of systems with Chef
AWS – Content ü AWS usage app
• Splunk developed • Track usage, cost, and capacity of your AWS instances with improved
granularity
ü AWS S3 add-‐on • Splunk developed • Modular input for S3 data
ü AWS cloud formaBon • Allows easy creaBon of a mulB-‐node Splunk distributed environment
Azure – H/W Plagorm ü What is a “Role”
– VM Role (Virtual Machine) – Worker Role – Web Role
ü OperaBng Systems ü Windows and Linux based
ü Sizing – XS, S, M, L, XL – Medium or Higher (CPU = 1.6GHz, recommended minimum=3 GHz)
Azure – Storage
! Windows Azure Storage – “Blobs” = Block Blobs and Page Blobs
! VHD (Virtual Hard Disk/Drive) – VHD = base storage unit (exists as a Page Blob) – Drives, disks, and images are all VHDs that exist in Blob storage
! Windows Azure Drive – AKA: X drive – Persistent Storage (network akached durable drive) – RAID?
32
Azure – Data CollecBon
! Use Forwarders + Standard CollecBon Methods – File/directory monitoring – Network input – Scripted input…Azure Apps
! Azure Apps – Output typically wriken to Blob Storage (must write a scripted input) – AlternaBve is to naBvely (within app) send events to a Splunk instance
ê Some content on how this can be done
33
Azure – Infrastructure Summary
! Use Extra Large instances ! Azure App ! Use Persistent Storage
– Azure (X) Drives can be moved if the VM dies
34
Azure – Infrastructure Summary
! Leverage Deployment server + scripBng to manage distributed environment
! Power Shell Scripts to automate deployment
35
Cloud Summary
! Great informaBon for major cloud providers ! Reference architecture and automaBon templates publicly available
! Consider best fit for your scenario – Security requirements – Pricing model – Topology fit
36
Plan for Growth
38
! Splunk is very flexible, but ensure you have enough at all Bers (forwarders, indexers, search) ! A bokleneck today can be remedied…but something else will take it’s place ! Use more nodes to scale up, not bigger machines (when it doubt = reference architecture)
indexer indexers search head & indexers
search head, deployment server & indexers
Scaling Up
! Use off-‐the-‐shelf dual-‐socket machines with direct-‐akached storage ! Use more nodes to scale up, not bigger machines ! More cores are more expensive and don’t scale aggregate IO as much
– You can work with this (up to a point) with mulBple instances
! Number of nodes depends mostly on search requirements – Ignore the 100GB/day/instance rule-‐of-‐thumb at scale – Indexers can index 250GB/day safely, and over 500GB/day possibly
ê Search requirements drive this heavily, lots of read I/O will take away from the writes
39
Scaling Indexing ! Make sure you have enough at every point: ! Readers/forwarders to read the data?
– May need several for large syslog volumes
! Indexers to receive the data? ! Forwarders to spread data over indexers?
– AutoLB by default sends at least 30 seconds of data to a single indexer – So a single indexer can be backed up while others are idle – Decrease the AutoLB interval, and increase the input queue size to contain it
! Try to avoid Heavy Forwarders and boklenecks – Don’t funnel (if you don’t have to), go directly from UF to indexers, parse/filter on
indexers
40
Scaling Search
! Use parallel dispatch from search head to indexer peers – Requires disabling SSL on splunkd
! Use job servers: isolate jobs and users from each other on different search heads
! ParBBon groups of users from each other on different search heads
! SHP can be used if you have fast shared NFS between search heads
41
Ongoing OperaBons ! SoVware configuraBon management and deployment ! Backups, retenBon, archiving, disaster recovery
– You can akend Dritan’s session: Architect Splunk for High Availability and Disaster Recovery
– Lower RTO = Lots more $; Lower RPO = Lots more $; Lower both = $$$$$$ – 5.x = clustering – Align data retenBon policy with search use cases
! Service resiliency – Splunk is fine for this re: indexing – forwarder LB handles it – In 5.x+, use index replicaBon (clustering) for HA, however, if you need DR a
conversaBon sBll needs to take place ! Capacity monitoring, review, and planning
– Perform over Bme, especially when data and users are onboarded
42
Expert’s Tool Bag ü Metadata searches
– Hosts, Sources, Sourcetypes – Latest, Oldest, and Last Event Bme – Total Event Count – Fast!!!
ü Field opBmizaBon – Use Advanced charBng (disables “Preview”) – Use “| fields <relevant_field>” – Turn off field discovery
ü Inspect search – Displays staBsBcs about your search – Data is from $SPLUNK_HOME/var/run/splunk/dispatch/<job_id>
Expert’s Tool Bag ü ConfiguraBon check
– ./splunk cmd btool <config_file> list --debug!– What Splunk currently thinks (may not be what is loaded)
ü Remote commands – ./splunk <command_set> -uri https://remoteserver:8089!– Good for searching – Clean your history!
ü Splunk logs – Tune logging level via management UI – $SPLUNK_HOME/etc/log.cfg – Not ideal for ProducBon
Expert’s Tool Bag ü Debugging forwarders with Splunk
– Forwarder connecBons and stats – index=_internal source=*metrics.log group=tcpin_connections – Data transferred, connected?, 30 second intervals
ü Debugging Bme stamp issues – Check MAX_TIMESTAMP_LOOKAHEAD and TIME_FORMAT!– Leverage _indextime, timestartpos, timeendpos fields to debug
ü Bundle replicaBon (distributed environments) – What is it? – OpBmize bundles so large ones are NOT transferred all the Bme – MySQL App for lookups
More InformaBon
48
! Contact: [email protected] or [email protected] ! ApplicaBons: apps.splunk.com ! Answers: answers.splunk.com ! EducaBon: www.splunk.com/view/educaBon/SP-‐CAAAAH9 ! Professional Services: www.splunk.com/view/professional-‐services/SP-‐CAAABH9
! Videos: www.splunk.com/videos
Other Sessions You Can Akend ! Architect Splunk for High Availability and Disaster Recovery
– Dritan Bitincka, Wed @ 9:00
! Architecting and Sizing Your Splunk Deployments – Simeon Yep, Wed @ 3:00
! Onboard Data into Splunk, Correctly – Matthew Settipane, Wed @ 4:30
! The S.o.S App: All Splunk on Splunk Action, All The Time – Octavio DiSciullo, Thurs @ 9:00
! Planning and Execution for Successful Deployments – Chris Olson & Pete Sicilia, Thurs @ 10:15
49
Next Steps
50
Download the .conf2013 Mobile App If not iPhone, iPad or Android, use the Web App
Take the survey & WIN A PASS FOR .CONF2014… Or one of these bags! View the other “Deploying” sessions All sessions are available on the Mobile App Videos will be available shortly
1
2
3