Upload
lethuy
View
224
Download
3
Embed Size (px)
Citation preview
Resolving and Preventing MySQL DowntimeCommon MySQL service impacting challenges, resolutions and prevention.
Jervin Real
2
Jervin Real
• Technical Services Manager – APAC
• Engineer Engineering Engineers
3
What is Downtime?
Application
Users
BOSS
4
Why Prevent Downtime?
• Your business loses money when the application is down
• You and your team’s reputation suffers
5
Agenda
• Real world adventures• Problems
• Solutions
• Prevention
• Putting them all together
6
I Had a Crash on You
7
I Had a Crash on You: Page Corruption
8
I Had a Crash on You: Page Corruption
• Disk bad sectors problem
• No monitoring, checks
• Page corruption on disk level, crashes when reading page from disk
• … and it keeps crashing
9
I Had a Crash on You: Page Corruption
• Percona Server, we tried:• innodb_table_corrupt_action = salvage
• Worked!
• Dropped table, recreated - application back online
• Worst case:• innodb_force_recovery > 0
• Data Recovery
10
I Had a Crash on You: Assertion
• Running 5.6.11, early adopter, InnoDB FULLTEXT
• Upgrade to 5.6.18, MySQL crashed
• Data was unusable - bug#72079
11
I Had a Crash on You: Assertion
• Downgrade and restore from backup
• Re-execute upgrade to avoid the bug
12
I Had a Crash on You: Assertion
• innodb_corrupt_table_action = salvage|warn
• pt-table-checksum• Regularly recurse your data and check for errors in error log
• RAID card health checks• Can vary by vendor
• SMART checks• Be vigilant for disk level errors
• Plan your upgrades properly
13
Nobody’s Watching
14
Nobody’s Watching: Nobody Cared
• Percona XtraDB Cluster, 3 nodes
• Few months ago node 3 went down due to conflict, but nobody noticed
• Few hours ago, node 2 was killed by OOM, cluster lost quorum
• EVERYBODY NOTICED!
15
Nobody’s Watching: Nobody Cared
• Bootstrap remaining node• mysql> SET GLOBAL wsrep_provider_options=’pc.bootstrap=1’;
• SST second and 3rd node
• Define wsrep_notify_cmd temporarily
• Implement better alerting
16
Nobody’s Watching: Dropped the Bomb
• New sysadmin received disk space alert
• du -hx --max-depth=1 /
• /var has lots of data
• find /var/ -size +5G -exec rm -rf {} \;
• Bam, ibdata1gone!
• Restart maintenance occurred later in the day ...
17
Nobody’s Watching: Dropped the Bomb
• Restore from backup
• Really, they were lucky!
• What if there were no backups and innodb_file_per_table = 0?
18
Nobody’s Watching: Prevention
• Percona Monitoring Plugins• pmp-check-deleted-files
• pmp-check-mysql-status
• pmp-check-mysql-innodb
• Define a script executable by mysql user• Triggered on node state changes
• Take backups, and alert on failure• https://github.com/dotmanila/pyxbackup
19
Self Induced Pain
20
Self Induced Pain: Query Cache Lock
• “Waiting for query cache lock”
21
Self Induced Pain: Query Cache Lock
• Global mutex, point of contention• Moreso on hot dataset/table
• Worse, with large QC
22
Self Induced Pain: Query Cache Lock
• Set it to small size - to reduce performance overhead
• Disable completely to to avoid contention
• Hint offending queries to skip the query cache i.e. SELECT SQL_NO_CACHE
23
Self Induced Pain: Buffer Pool Dump/Restore
• Dumps buffer pool page list to disk
• Reloads buffer pool based on this list at startup
• Meant to help speed up buffer pool warmup
24
Self Induced Pain: Buffer Pool Dump/Restore
• Maintenance restart, buffer dump and restore enabled
• Yey! Expecting everything to go well.
• 30mins in performance still really bad, IO trashing
• Large buffer pool, busy read/write
25
Self Induced Pain: Buffer Pool Dump/Restore
• Extend your maintenance period to let the server warmup if possible, otherwise they will contend on IO
• RAID1 of 2 SATA disks is not a license to use buffer pool warmup on 240GB of buffer pool
26
Self Induced Pain: Prevention
• Percona Toolkit• pt-sift
• pt-stalk
• pt-kill
• Optimize for IO
27
MySQL, MySQL! What Have Suffereth Ye Thee?
28
MySQL, MySQL! What Have Suffereth Ye Thee?: Grind to a Halt
• Slow queries
• Connections build up
• Slow response times
• Long running transactions
• Stop the world scenario
29
MySQL, MySQL! What Have Suffereth Ye Thee?: Grind to a Halt
--innodb--txns: 486xACTIVE (28s) 994xnot (0s) 227xLOCK WAIT (25844s)0 queries inside InnoDB, 0 queries in queueMain thread: sleeping, pending reads 0, writes 28, flush 1Log: lsn = 2147483647, chkp = 2147483647, chkp age = 210625191
30
MySQL, MySQL! What Have Suffereth Ye Thee?: Grind to a Halt
---TRANSACTION 230207990, ACTIVE 13779 sec fetching rowsmysql tables in use 1, locked 180337 lock struct(s), heap size 8271400, 10979242 row lock(s)MySQL thread id 671621, OS thread handle 0x7fe03528a700, query id 37505085 localhost magento Sending data
SELECT `sales_flat_quote_item`.* FROM `sales_flat_quote_item` LIMIT 376 OFFSET 491056
31
MySQL, MySQL! What Have Suffereth Ye Thee?: Grind to a Halt
• Kill long running trx
• pt-kill for persistent long running trx
• Deploy immediate code changes to disable erring code
32
MySQL, MySQL! What Have Suffereth Ye Thee?: CPU Load
• MySQL is still responding
• All sorts of mutexes• trx_sys->mutex• block->lock• lock_sys->mutex• lock_sys->wait_mutex
• … and is killing latency
• Service impact means lost income
33
MySQL, MySQL! What Have Suffereth Ye Thee?: CPU Load
• innodb_thread_concurrency > 0
34
MySQL, MySQL! What Have Suffereth Ye Thee?: Prevention• pt-kill –log
• Separate your OLTP from analytics if possible
• Proactive analysis on performance and queries
• pt-query-digest (PMM)
• pt-stalk
• MySQL Server Configuration
• Remember to tune innodb_thread_ concurrency (default is 0)
• Innodb_concurrency_tickets , innodb_sync_spin_loops, etc
• Application Stack Configuration (Schema Design)
• Single tenant per schema
• Multiple tenants per schema (each table has client_id column)
• All tenants in one schema
35
Wizard of OS: Disk Performance
36
Wizard of OS: Disk Performance
• Disk performance cascading to MySQL to application
37
Wizard of OS: Disk Performance
• Slow writes, binlogs, redo logs, syncs
• Transactions stalling on COMMIT, updating, inserting …
• Replication getting delayed if node is a slave
• Translates to latency
38
Wizard of OS: Disk Performance
• RAID Controller in Write-Through
• Could also be bad disk
• Default IO elevator – deadline|noop
• Bad mount options - +noatime
• NFS?
39
Wizard of OS: Swapping
• Swapping heavily, with significant amount of RAM free
40
Wizard of OS: Swapping
• Swapping induces significant amount of IO
• Swapping in and out of disk is mighty expensive
• Affects MySQL in magnificent ways
• Swap insanity!
41
Wizard of OS: Swapping
• NUMA Interleave
• Percona Server is NUMA configurable• numa_interleave
• flush_caches
• Check numastat - perl check_numa.pl
42
Wizard of OS: Prevention
• Tune• NUMA Policy
• vm.swappiness (always have swap space)
• mount options - noatime
• Disk scheduler/IO Elevator – noop|deadline
• Blog: Linux performance tuning tips for MySQL
• Blog: InnoDB performance optimization basics (redux)
43
Summary
• Be proactive and analyze your performance regularly• https://www.percona.com/blog/2016/04/18/percona-monitoring-and-management/
• Monitor, monitor wisely
• Test, tune, repeat
• Plan and plan more
44
Join us at Percona Live Europe
When: October 3-5, 2016Where: Amsterdam, Netherlands
The Percona Live Open Source Database Conference is a great event for users of any level using open source database technologies.
§ Get briefed on the hottest topics
§ Learn about building and maintaining high-performing deployments
§ Listen to technical experts and top industry leaders
Get the advanced registration rate before prices go up on Sep 5th! Register now!
Sponsorship opportunities available as well here.
45
Questions?
DATABASE PERFORMANCE MATTERS