Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Running MongoDB in Production
Tim VaillancourtSr Technical Operations Architect, Percona
2
`whoami`
{name: “tim”,lastname: “vaillancourt”,employer: “percona”,techs: [
“mongodb”,“mysql”,“cassandra”,“redis”,“rabbitmq”,“solr”,“python”,“golang”
]}
3
Agenda
● Backups● Security● Monitoring● Architecture and High-Availability● Hardware● Tuning MongoDB● Tuning Linux● Troubleshooting● Schema● Data Integrity● Scaling (Read/Writes)
4
● Data○ Document: single *SON object, often nested○ Field: single field in a document○ Collection: grouping of documents○ Database: grouping of collections○ Capped Collection: A fixed-size FIFO collection
● Replication○ Oplog: A special capped collection for replication○ Primary: A replica set node that can receive writes○ Secondary: A replica of the Primary that is read-only ○ Voting: A process to elect a Primary node○ Hidden-Secondary: A replica that cannot become Primary
Terminology
Backups“An admin is only worth the backups they keep” ~ Unknown
6
Backups: Logical
● ‘mongodump’ tool from mongo-tools project● Supports
○ Multi-threaded dumping in 3.2+○ Optional gzip compression of data○ Optional dumping of oplog for single-node consistency○ Replica set awareness (Read Preference)
■ Ie: primary, primaryPreferred, secondary, secondaryPreferred, nearest● Process
○ Tool issues .find() with $snapshot query○ Stores BSON data in a file per collection○ Stores BSON oplog data in “oplog.bson”
7
Backups: Logical
● Limitations○ Not Sharding aware○ Sharded backups are not Point-in-Time consistent○ Fetch from storage-engine, serialization, networking, etc is very inefficient○ Indexes are rebuilt in serial (per-collection)
■ Index metadata only in backup■ Often indexing process takes longer than restoring the data!
○ Wire Protocol Compression (added in 3.4+) not supported: https://jira.mongodb.org/browse/TOOLS-1668 (Please vote/watch Issue!)
○ Requires oplog to be as long as the backup run time
8
Backups: Binary
● Options○ Cold Backup○ LVM Snapshot○ Hot Backup
■ Percona Server for MongoDB (FREE!)■ MongoDB Enterprise Hot Backup (non-free)■ NOTE: MMAPv1 not supported
● Benefits○ Indexes are backed up == faster restore!○ Storage-engine format backed up == faster backup AND restore!
● Limitations○ Increased backup storage requirements○ Compression is storage-engine dependant
9
Backups: Binary
● Limitations○ CPU Architecture limitations (64-bit vs 32-bit)○ Cascading corruption○ Batteries not included
■ Not Sharding aware■ Not Replica Set aware
● Process○ Cold Backup
■ Stop a mongod SECONDARY, copy/archive dbPath○ LVM Snapshot
■ Optionally call ‘db.fsyncLock()’ (not required in 3.2+ with Journalling)■ Create LVM snapshot of the dbPath■ Copy/Archive dbPath
10
Backups: Binary
● Process○ LVM Snapshot
■ Remove LVM snapshot (as quickly as possible!)■ NOTE: LVM snapshots can cause up to 30%* write latency impact to disk (due to COW)
○ Hot Backup (PSMDB or MongoDB Enterprise)■ Pay $$$ for MongoDB Enterprise or download PSMDB for free(!)■ db.adminCommand({
createBackup: 1,backupDir: "/data/mongodb/backup"
})■ Copy/archive the output path■ Delete the backup output path■ NOTE: RocksDB-based createBackup creates filesystem hardlinks whenever possible!■ NOTE: Delete RocksDB backupDir as soon as possible to reduce bloom filter overhead!
11
Backups: Architecture
● Risks○ Dynamic nature of Replica Set○ Impact of backup on live nodes
● Example: Cheap Disaster-Recovery○ Place a ‘hidden: true’ SECONDARY in another location○ Optionally use cloud object store (AWS S3, Google GS, etc)
● Example: Replica Set Tags○ “tags” allow fine-grained server selection with key/value pairs○ Use key/value pair to fence various application workflows○ Example:
■ { “role”: “backup” } == Backup Node■ { “role”: “application” } == App Node
12
Backups: mongodb_consistent_backup
● Python project by Percona-Lab for consistent backups● URL: https://github.com/Percona-Lab/mongodb_consistent_backup● Best-effort support, not a “Percona Product”● Created to solve limitations in MongoDB backup tools:
○ Replica Set and Sharded Cluster awareness○ Cluster-wide Point-in-time consistency○ In-line Oplog backup (vs post-backup)○ Notifications of success / failure
● Extra Features○ Remote Upload (AWS S3, Google Cloud Storage and Rsync)○ Archiving (Tar or ZBackup deduplication and optional AES-at-rest)○ CentOS/RHEL7 RPMs and Docker-based releases (.deb soon!)
13
Backups: mongodb_consistent_backup
● Extra Features○ Single Python PEX binary○ Multithreaded / Concurrent○ Auto-scales to available CPUs
● Low-Impact○ Tool focuses on low impact○ Uses Secondary nodes only○ Considers (Scoring)
■ Replication Lag■ Replication Priority■ Replication Health / State■ Hidden-Secondary State (preferred by tool)■ Fails if chosen Secondary becomes Primary (on purpose)
14
Backups: mongodb_consistent_backup
● 1.2.0○ Multi-threaded Rsync Upload○ Replica Set Tags support○ Support for MongoDB SSL / TLS connections and client auth○ Rotation / Expiry of old backups (locally-stored only)
● Future○ Incremental Backups○ Binary-level Backups (Hot Backup, Cold Backup, LVM, Cloud-based, etc)○ More Notification Methods (PagerDuty, Email, etc)○ Restore Helper Tool○ Instrumentation / Metrics○ <YOUR AWESOME IDEA HERE> we take GitHub PRs (and it’s Python)!
15
Backups: mongodb_consistent_backup
● Simple Restore○ Seamless restore: “mongorestore --oplogReplay --gzip --dir /path/to/backup”
● Restore an Entire Cluster○ Mongorestore backups of config servers
■ If restoring old/SCCC config servers, restore to every node■ If restoring replica-set config servers
● Ensure Replica Set is initiated (rs.initiate() / rs.config())● Ensure SECONDARY members are added (via PRIMARY)● Restore to PRIMARY only
○ Update “config.shards” documents if shard hosts/ports changed○ Mongorestore each shard from backup subdirectory (matches shard name)○ Start mongos process and test / QA
■ Tip: stopping the balancer may simplify troubleshooting any problems
16
More on Backups (some overlap)
Room:Field Suite #2
Time:Wednesday, 16:30 -
16:55
Security“Think of the network like a public place” ~ Unknown
18
Security: Authorization
● Always enable auth on Production Installs!● Built-in Roles
○ Database User: Read or Write data from collections■ “All Databases” or Single-database
○ Database Admin: Non-RW commands (create/drop/list/etc)○ Backup and Restore: ○ Cluster Admin: Add/Drop/List shards○ Superuser/Root: All capabilities
● User-Defined Roles○ Exact Resource+Action specification○ Very fine-grained ACLs
■ DB + Collection specific
19
Security: Filesystem Access
● Use a service user+group○ ‘mongod’ or ‘mongodb’ on most systems○ Ensure data path, log file and key file(s) are owned by this user+group
● Data Path○ Mode: 0750
● Log File○ Mode: 0640○ Contains real queries and their fields!!!
■ See Log Redaction for PSMDB (or MongoDB Enterprise) to remove these fields● Key File(s)
○ Files Include: keyFile and SSL certificates or keys○ Mode: 0600
20
Security: Network Access
● Firewall○ Single TCP port
■ MongoDB Client API■ MongoDB Replication API■ MongoDB Sharding API
○ Sharding■ Only the ‘mongos’ process needs access to shards■ Client driver does not need to reach shards directly
○ Replication■ All nodes must be accessible to the driver
● Internal Authentication: Use a key to use inter-node replication/sharding● Creating a dedicated network segment for Databases is recommended!● DO NOT allow MongoDB to talk to the internet at all costs!!!
21
Security: System Access
● Recommended to restrict system access to Database Administrators● A “shell” on a system can be enough to take the system over!● SELinux
○ Linux Kernel Built-in Security mechanism○ Massively reduces the attack possibilities on a system by using ACLs/policies○ Modes
■ Enforcing: Do not allow policy violations■ Permissive: Log and allow policy violations■ Disabled: I really don’t like security!
○ Enforcing mode supported with Percona Server for MongoDB when using CentOS / RHEL 7+ RPMs■ SELinux NOT supported by MongoDB Community or Enterprise binaries!!
22
Security: External Authentication
● LDAP Authentication○ Supported in PSMDB and MongoDB Enterprise○ The following components are necessary for external authentication to work
■ LDAP Server: Remotely stores all user credentials (i.e. user name and associated password).
■ SASL Daemon: Used as a MongoDB server-local proxy for the remote LDAP service.■ SASL Library: Used by the MongoDB client and server to create authentication
mechanism-specific data.○ Creating a User:
db.getSiblingDB("$external").createUser( {user : christian, roles: [{role: "read", db: "test"} ]} );
○ Authenticating as a User:db.getSiblingDB("$external").auth({ mechanism:"PLAIN", user:"christian", pwd:"secret", digestPassword:false})
○ Other auth methods possible with MongoDB Enterprise
23
Security: SSL Connections and Auth
● SSL / TLS Connections○ Supported since MongoDB 2.6x
■ May need to complile-in yourself on older binaries■ Supported 100% in Percona Server for MongoDB
○ Minimum of 128-bit key length for security○ Relaxed and strict (requireSSL) modes○ System (default) or Custom Certificate Authorities are accepted
● SSL Client Authentication (x509)○ MongoDB supports x.509 certificate authentication for use with a secure TLS/SSL
connection as of 2.6.x.○ The x.509 client authentication allows clients to authenticate to servers with
certificates rather than with a username and password.○ Enabled with: security.clusterAuthMode: x509
24
Security: Encryption at Rest
● MongoDB Enterprise○ Encryption supported in Enterprise binaries ($$$)
● Percona Server for MongoDB○ Use CryptFS/LUKS block device for encryption of data volume○ Documentation published (or coming soon)○ Completely open-source / Free
● Application-Level○ Selectively encrypt only required fields in application○ Benefits
■ The data is only readable by the application (reduced touch points)■ The resource cost of encryption is lower when it’s applied selectively■ Offloading of encryption overhead from database
25
Security: Network Firewall
● MongoDB only requires a single TCP port to be reachable (to all nodes)○ Default port 27017○ This does not include monitoring tools, etc
■ Percona PMM requires inbound connectivity to 1-2 TCP ports ● Restrict TCP port access to nodes that require it!● Sharded Cluster
○ Application servers only need access to ‘mongos’○ Block direct TCP access from application -> shard/mongod instances
■ Unless ‘mongos’ is bound to localhost!● Advanced
○ Move inter-node replication to own network fabric, VLAN, etc○ Accept client connections on a Public interface
26
More on Security (some overlap)
Room:Field Suite #2
Time:Tuesday, 17:25
to 17:50
Monitoring
28
Monitoring: Methodology
● Monitor often○ 60 - 300 seconds is not enough!○ Problems can begin/end in seconds
● Correlate Database and Operating System together!● Monitor a lot
○ Store more than you graph○ Example: PMM gathers 700-900 metrics per polling
● Process○ Use to troubleshoot Production events / incidents○ Iterate and Improve monitoring
■ Add graphing for whatever made you SSH to a host■ Blind QA with someone unfamiliar with the problem
29
Monitoring: Important Metrics
● Database○ Operation counters○ Cache Traffic and Capacity○ Checkpoint / Compaction Performance○ Concurrency Tickets (WiredTiger and RocksDB)○ Document and Index scanning○ Various engine-specific details
● Operating System○ CPU○ Disk○ Bandwidth / Util○ Average Wait Time○ Memory and Network
30
Monitoring: Percona PMM
● Open-source monitoring from Percona!
● Based on open-source technology○ Prometheus○ Grafana○ Go Language
● Simple deployment● Examples in this demo are
from PMM!● Correlation of OS and DB
Metrics● 800+ metrics per ping
Architecture and High-Availability
32
High Availability
● Replication○ Asynchronous
■ Write Concerns can provide psuedo-synchronous replication■ Changelog based, using the “Oplog”
○ Maximum 50 members○ Maximum 7 voting members
■ Use “vote:0” for members $gt 7○ Oplog
■ The “oplog.rs” capped-collection in “local” storing changes to data■ Read by secondary members for replication■ Written to by local node after “apply” of operation
33
Architecture
● Datacenter Recommendations○ Minimum of 3 x physical servers required for
High-Availability○ Ensure only 1 x member per Replica Set is on a single
physical server!!!● EC2 / Cloud Recommendations
○ Place Replica Set members in odd number of Availability Zones, same region
○ Use a hidden-secondary node for Backup and Disaster Recover in another region
○ Entire Availability Zones have been lost before!
Hardware
35
Hardware: Mainframe vs Commodity
● Databases: The Past○ Buy some really amazing, expensive hardware○ Buy some crazy expensive license
■ Don’t run a lot of servers due to above○ Scale up:
■ Buy even more amazing hardware for monolithic host■ Hardware came on a truck
○ HA: When it rains, it pours● Databases: A New Era
○ Everything fails, nothing is precious○ Elastic infrastructures (“The cloud”, Mesos, etc)○ Scale up: add more cheap, commodity servers○ HA: lots of cheap, commodity servers - still up!
36
Hardware: Block Devices
● Isolation○ Run Mongod dbPaths on separate volume○ Optionally, run Mongod journal on separate volume
● RAID Level○ RAID 10 == performance/durability sweet spot○ RAID 0 == fast and dangerous
● SSDs○ Benefit MMAPv1 a lot○ Benefit WT and RocksDB a bit less○ Keep about 30% free for internal GC on the SSD
37
Hardware: Block Devices
● EBS / NFS / iSCSI○ Risks / Drawbacks
■ Exponentially more things to break■ Block device requests wrapped in TCP is extremely slow■ You probably already paid for some fast local disks■ More difficult (sometimes nearly-impossible) to troubleshoot■ MongoDB doesn’t really benefit from remote storage features/flexibility
● Built-in High-Availability of data via replication● MongoDB replication can bootstrap new members● Strong write concerns can be specified for critical data
38
Hardware: CPUs
● Cores vs Core Speed○ Lots of cores > faster cores (4 CPU minimum recommended)○ Thread-per-connection Model
● CPU Frequency Scaling○ ‘cpufreq’: a daemon for dynamic scaling of the CPU frequency○ Terrible idea for databases or any predictability!○ Disable or set governor to 100% frequency always, i.e mode: ‘performance’○ Disable any BIOS-level performance/efficiency tuneable○ ENERGY_PERF_BIAS
■ A CentOS/RedHat tuning for energy vs performance balance■ RHEL 6 = ‘performance’■ RHEL 7 = ‘normal’ (!)
● My advice: use ‘tuned’ to set to ‘performance’
39
Hardware: Network Infrastructure
● Datacenter Tiers○ Network Edge○ Public Server VLAN
■ Servers with Public NAT and/or port forwards from Network Edge■ Examples: Proxies, Static Content, etc■ Calls backends in Backend VLAN
○ Backend Server VLAN■ Servers with port forwarding from Public Server VLAN (w/Source IP ACLs)■ Optional load balancer for stateless backends■ Examples: Webserver, Application Server/Worker, etc■ Calls data stores in Data VLAN
○ Data VLAN■ Servers, filers, etc with port forwarding from Backend Server VLAN (w/Source IP ACLs)■ Examples: Databases, Queues, Filers, Caches, HDFS, etc
40
Hardware: Network Infrastructure
● Network Fabric○ Try to use 10GBe for low latency○ Use Jumbo Frames for efficiency○ Try to keep all MongoDB nodes on the same segment
■ Goal: few or no network hops between nodes■ Check with ‘traceroute’
● Outbound / Public Access○ Databases don’t need to talk to the internet*
■ Store a copy of your Yum, DockerHub, etc repos locally■ Deny any access to Public internet or have no route to it■ Hackers will try to upload a dump of your data out of the network!!
● Cloud?○ Try to replicate the above with features of your provider
41
Hardware: Why So Quick?
● MongoDB allows you to scale reads and writes with more nodes○ Single-instance performance is important, but deal-breaking
● You are the most expensive resource!○ Not hardware anymore
Tuning MongoDB
43
Tuning MongoDB: MMAPv1
● A kernel-level function to map file blocks to memory
● MMAPv1 syncs data to disk once per 60 seconds (default)
○ Override with —syncDelay <seconds> flag○ If a server with no journal crashes it can lose 1 min of
data!!!● In memory buffering of Journal
○ Synced every 30ms ‘journal’ is on a different disk○ Or every 100ms○ Or 1/3rd of above if change uses Journaled write
concern (explained later)
44
Tuning MongoDB: MMAPv1
● Fragmentation○ Can cause serious slowdowns on scans, range
queries, etc○ db.<collection>.stats()
■ Shows various storage info for a collection■ Fragmentation can be computed by dividing
‘storageSize’ by ‘size’■ Any value > 1 indicates fragmentation
○ Compact when you near a value of 2 by rebuilding secondaries or using the ‘compact’ command
○ WiredTiger and RocksDB have little-no fragmentation due to checkpoints / compaction
45
Tuning MongoDB: WiredTiger
● WT syncs data to disk in a process called “Checkpointing”:
○ Every 60 seconds or >= 2GB data changes● In-memory buffering of Journal
○ Journal buffer size 128kb○ Synced every 50 ms (as of 3.2)○ Or every change with Journaled write concern (explained
later)○ In between write operations while the journal records
remain in the buffer, updates can be lost following a hard shutdown!
46
Tuning MongoDB: RocksDB
● Level-based strategy using immutable data level files○ Built-in Compression○ Block and Filesystem caches
● RocksDB uses “compaction” to apply changes to data files
○ Tiered level compaction○ Follows same logic as MMAPv1 for journal
buffering● MongoRocks
○ A layer between RocksDB and MongoDB’s storage engine API
○ Developed in partnership with Facebook
47
Tuning MongoDB: Storage Engine Caches
● WiredTiger○ In heap
■ 50% available system memory■ Uncompressed WT pages
○ Filesystem Cache■ 50% available system memory■ Compressed pages
● RocksDB○ Internal testing planned from Percona in the future○ 30% in-heap cache recommended by Facebook /
Parse
48
Tuning MongoDB: Durability
● storage.journal.enabled = <true/false>○ Default since 2.0 on 64-bit builds○ Always enable unless data is transient○ Always enable on cluster config servers
● storage.journal.commitIntervalMs = <ms>○ Max time between journal syncs
● storage.syncPeriodSecs = <secs>○ Max time between data file flushes
49
Tuning MongoDB: Don’t Enable!
● “cpu”○ External monitoring is recommended
● “rest”○ Will be deprecated in 3.6+
● “smallfiles”○ In most situations this is not necessary unless
■ You use MMAPv1, and■ It is a Development / Test environment■ You have 100s-1000s of databases with very little data
inside (unlikely)● Profiling mode ‘2’
○ Unless troubleshooting an issue / intentional
Tuning Linux
51
Tuning Linux: The Linux Kernel
● Linux 2.6.x?● Avoid Linux earlier than 3.10.x - 3.12.x● Large improvements in parallel efficiency in 3.10+ (for Free!)
● More: https://blog.2ndquadrant.com/postgresql-vs-kernel-versions/
52
Tuning Linux: NUMA
● A memory architecture that takes into account the locality of memory, caches and CPUs for lower latency○ But no databases want to use it :(
● MongoDB codebase is not NUMA “aware”, causing unbalanced memory allocations on NUMA systems
● Disable NUMA○ In the Server BIOS ○ Using ‘numactl’ in init scripts BEFORE ‘mongod’
command (recommended for future compatibility):
numactl --interleave=all /usr/bin/mongod <other flags>
53
Tuning Linux: Transparent HugePages
● Introduced in RHEL/CentOS 6, Linux 2.6.38+● Merges memory pages in background (Khugepaged process)● Decreases overall performance when used with MongoDB!● “AnonHugePages” in /proc/meminfo shows usage● Disable TransparentHugePages!● Add “transparent_hugepage=never” to kernel command-line (GRUB)
○ Reboot the system■ Disabling online does not clear previous TH pages■ Rebooting tests your system will come back up!
54
Tuning Linux: Time Source
● Replication and Clustering needs consistent clocks○ mongodb_consistent_backup relies on time sync, for example!
● Use a consistent time source/server○ “It’s ok if everyone is equally wrong”
● Non-Virtualized○ Run NTP daemon on all MongoDB and Monitoring hosts○ Enable service so it starts on reboot
● Virtualised○ Check if your VM platform has an “agent” syncing time○ VMWare and Xen are known to have their own time sync○ If no time sync provided install NTP daemon
55
Tuning Linux: I/O Scheduler
● Algorithm kernel uses to commit reads and writes to disk● CFQ
○ “Completely Fair Queue”○ Default scheduler in 2.6-era Linux distributions○ Perhaps too clever/inefficient for database workloads○ Probably good for a laptop
● Deadline○ Best general default IMHO○ Predictable I/O request latencies
● Noop○ Use with virtualised servers○ Use with real-hardware BBU RAID controllers
56
Tuning Linux: Filesystems
● Filesystem Types○ Use XFS or EXT4, not EXT3
■ EXT3 has very poor pre-allocation performance■ Use XFS only on WiredTiger■ EXT4 “data=ordered” mode recommended
○ Btrfs not tested, yet!● Filesystem Options
○ Set ‘noatime’ on MongoDB data volumes in ‘/etc/fstab’:
○ Remount the filesystem after an options change, or reboot
57
Tuning Linux: Block Device Readahead
● Tuning that causes data ahead of a block on disk to be read and then cached
● Assumption:○ There is a sequential read pattern○ Something will benefit from the extra cached blocks
● Risk○ Too high waste cache space○ Increases eviction work
■ MongoDB tends to have very random disk patterns● A good start for MongoDB volumes is a ’32’ (16kb) read-ahead
○ Let MongoDB worry about optimising the pattern
58
Tuning Linux: Block Device Readahead
● Change ReadAhead○ Add file to ‘/etc/udev/rules.d’
■ /etc/udev/rules.d/60-mongodb-disk.rules: # set deadline scheduler and 32/16kb read-ahead for /dev/sda ACTION=="add|change", KERNEL=="sda", ATTR{queue/scheduler}="deadline", ATTR{bdi/read_ahead_kb}="16"
○ Reboot (or use CLI tools to apply)
59
Tuning Linux: Virtual Memory Dirty Pages
● Dirty Pages○ Pages stored in-cache, but needs to be written to
storage● Dirty Ratio
○ Max percent of total memory that can be dirty○ VM stalls and flushes when this limit is reached○ Start with ’10’, default (30) too high
● Dirty Background Ratio○ Separate threshold forbackground dirty page
flushing○ Flushes without pauses○ Start with ‘3’, default (15) too high
60
Tuning Linux: Swappiness
● A Linux kernel sysctl setting for preferring RAM or disk for swap
○ Linux default: 60○ To avoid disk-based swap: 1 (not zero!)○ To allow some disk-based swap: 10○ ‘0’ can cause more swapping than ‘1’ on recent
kernels■ More on this here:
https://www.percona.com/blog/2014/04/28/oom-relation-vm-swappiness0-new-kernel/
61
Tuning Linux: Ulimit
● Allows per-Linux-user resource constraints○ Number of User-level Processes○ Number of Open Files○ CPU Seconds○ Scheduling Priority○ And others…
● MongoDB○ Should probably have a dedicated VM, container or server○ Creates a new process
■ For every new connection to the Database■ Plus various background tasks / threads
○ Creates an open file for each active data file on disk■ 64,000 open files and 64,000 max processes is a good start
62
Tuning Linux: Ulimit
● Setting ulimits○ /etc/security/limits.d file○ Systemd Service○ Init script
● Ulimits are set by Percona and MongoDB packages!○ Example on left:
PSMDB RPM (Systemd)
63
Tuning Linux: Network Stack
● Defaults are not good for > 100mbps Ethernet● Suggested starting point:
● Set Network Tunings:○ Add the above sysctl tunings to /etc/sysctl.conf○ Run “/sbin/sysctl -p” as root to set the tunings○ Run “/sbin/sysctl -a” to verify the changes
64
Tuning Linux: More on this...
https://www.percona.com/blog/2016/08/12/tuning-linux-for-mongodb/
65
Tuning Linux: “Tuned”
● Tuned○ A “framework” for applying tunings to Linux
■ RedHat/CentOS 7 only for now■ Debian added tuned, not sure if compatible yet■ Cannot tune NUMA, file system type or fs mount opts■ Syctls, THP, I/O sched, etc
● My apology to the community for writing “Tuning Linux for MongoDB”:○ https://github.com/Percona-Lab/tuned-percona-mongodb
Troubleshooting“The problem with troubleshooting is trouble shoots back” ~ Unknown
67
Troubleshooting: Usual Suspects
● Locking○ Collection-level locks○ Document-level locks○ Software mutex/semaphore
● Limits○ Max connections○ Operation rate limits○ Resource limits
● Resources○ Lack of IOPS, RAM, CPU, network, etc
68
Troubleshooting: MongoDB Resources
● Memory● CPU
○ System CPU○ FS cache○ Networking○ Disk I/O○ Threading
● User CPU (MongoDB)○ Compression (WiredTiger and RocksDB)○ Session Managemen○ BSON (de)serialisation○ Filtering / scanning / sorting
69
Troubleshooting: MongoDB Resources
● User CPU (MongoDB)○ Optimiser○ Disk○ Data file read/writes○ Journaling○ Error logging○ Network○ Query request/response○ Replication
● Disk I/O○ Journaling○ Oplog Reads / Writes○ Background Flushing / Compactions / etc
70
Troubleshooting: MongoDB Resources
● Disk I/O○ Page Faults (data not in cache)○ Swapping
● Network○ Client API○ Replication○ Sharding
■ Chunk Moves■ Mongos -> Shards
71
Troubleshooting: db.currentOp()
● A function that dumps status info about running operations and various lock/execution details
● Only queries currently in progress are shown.● Provided Query ID number can be used to kill long running queries.● Includes
○ Original Query○ Parsed Query○ Query Runtime○ Locking details
● Filter Documents○ { "$ownOps": true } == Only show operations for the current user○ https://docs.mongodb.com/manual/reference/method/db.currentOp/#examples
72
Troubleshooting: db.stats()
● Returns○ Document-data size (dataSize)○ Index-data size (indexSize)○ Real-storage size (storageSize)○ Average Object Size○ Number of Indexes○ Number of Objects
73
Troubleshooting: db.currentOp()
74
Troubleshooting: Log File
● Interesting details are logged to the mongod/mongos log files○ Slow queries○ Storage engine details (sometimes)○ Index operations○ Sharding
■ Chunk moves○ Elections / Replication○ Authentication○ Network
■ Connections● Errors● Client / Inter-node connections
75
Troubleshooting: Log File - Slow Query
2017-09-19T20:58:03.896+0200 I COMMAND [conn175] command config.locks appName: "MongoDB Shell" command: findAndModify { findAndModify: "locks", query: { ts: ObjectId('59c168239586572394ae37ba') }, update: { $set: { state: 0 } }, writeConcern: { w: "majority", wtimeout: 15000 }, maxTimeMS: 30000 } planSummary: IXSCAN { ts: 1 } update: { $set: { state: 0 } }
keysExamined:1docsExamined:1nMatched:1nModified:1keysInserted:1keysDeleted:1numYields:0reslen:604locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 2 } }, Collection: { acquireCount: { w: 1 } }, Metadata: { acquireCount: { w: 1 } }, oplog: { acquireCount: { w: 1 } } }protocol:op_command106ms
76
Troubleshooting: Operation Profiler
● Writes slow database operations to a new MongoDB collection for analysis
○ Capped Collection “system.profile” in each database, default 1mb○ The collection is capped, ie: profile data doesn’t last forever
● Support for operationProfiling data in Percona Monitoring and Management in current future goals
● Enable operationProfiling in “slowOp” mode○ Start with a very high threshold and decrease it in steps○ Usually 50-100ms is a good threshold○ Enable in mongod.conf
operationProfiling: slowOpThresholdMs: 100 mode: slowOp
77
Troubleshooting: Operation Profiler
● Useful Profile Metrics○ op/ns/query: type, namespace and query of a profile○ keysExamined: # of index keys examined○ docsExamined: # of docs examined to achieve result○ writeConflicts: # of Write Concern Exceptions
encountered during update○ numYields: # of times operation yielded for others○ locks: detailed lock statistics
78
Troubleshooting: .explain()
● Shows query explain plan for query cursors
● This will include○ Winning Plan
■ Query stages● Query stages may include sharding info in
clusters■ Index chosen by optimiser
○ Rejected Plans
79
Troubleshooting: .explain() and Profiler
80
Troubleshooting: Cluster Metadata
● The “config” database on Cluster Config servers○ Use .find() queries to view Cluster Metadata
● Contains○ actionlog (3.0+)○ changelog○ databases○ collections○ shards○ chunks○ settings○ mongos○ locks○ lockpings
81
Troubleshooting: Percona PMM QAN
● The Query Analytics tool enables DBAs and developers to analyze queries over periods of time and find performance problems.
● Helps you optimise database performance by making sure that queries are executed as expected and within the shortest time possible.
● Central, web-based location for visualising data.● Agent collected from MongoDB Profiler (required) from agent.● Great for reducing access to systems while proving valueable data to
development teams!● Query Normalization
○ ie:“{ item: 123456 }” -> “{ item: ##### }”.● Command-line Equivalent: pt-mongodb-query-digest tool
82
Troubleshooting: Percona PMM QAN
83
Troubleshooting: mlogfilter
● A useful tool for processing mongod.log files● A log-aware replacement for ‘grep’, ‘awk’ and friends● Generally focus on
○ mlogfilter --scan <file>■ Shows all collection scan queries
○ mlogfilter --slow <ms> <file>■ Shows all queries that are slower than X milliseconds
○ mlogfilter --op <op-type> <file>■ Shows all queries of the operation type X (eg: find, aggregate, etc)
● More on this tool here https://github.com/rueckstiess/mtools/wiki/mlogfilter
84
Troubleshooting: Common Problems
● Sharding○ removeShard Doesn’t Complete
■ Check the ‘dbsToMove’ array of the removeShard responsemongos> db.adminCommand({removeShard:"test2"}){
"msg" : "draining started successfully","state" : "started","shard" : "test2","note" : "you need to drop or movePrimary these databases","dbsToMove" : [
"wikipedia"],"ok" : 1
}
■ Why?mongos> use configswitched to db configmongos> db.databases.find(){ "_id" : "wikipedia", "primary" : "test2", "partitioned" : true }
85
Troubleshooting: Common Problems
● Sharding○ removeShard Doesn’t Complete
■ Try● Use movePrimary to move database(s) Primary-role to others● Run the removeShard command once the shard being removed is NOT primary for any
database○ This starts ths draining of the shard
● Run the same removeShard command to check on progress.○ If the draining and removing is complete this will respond with success
○ Jumbo Chunks■ Will prevent balancing from occurring■ config.chunks collection document will contain jumbo:true as a key/value pair■ Sharding ‘split’ commands can be used to reduce the chunk size (sh.splitAt, etc)■ https://www.percona.com/blog/2016/04/11/dealing-with-jumbo-chunks-in-mongodb/
Schema Design & Workflow“The problem with troubleshooting is trouble shoots back” ~ Unknown
87
Schema Design: Data Types
● Strings○ Only use strings if required○ Do not store numbers as strings!○ Look for {field:“123456”} instead of {field:123456}
■ “12345678” moved to a integer uses 25% less space■ Range queries on proper integers is more efficient
○ Example JavaScript to convert a field in an entire collection■ db.items.find().forEach(function(x) {
newItemId = parseInt(x.itemId); db.containers.update( { _id: x._id }, { $set: {itemId: itemId } } )});
88
Schema Design: Data Types
● Strings○ Do not store dates as strings!
■ The field "2017-08-17 10:00:04 CEST" stores in 52.5% less space!○ Do not store booleans as strings!
■ “true” -> true = 47% less space wasted● DBRefs
○ DBRefs provide pointers to another document○ DBRefs can be cross-collection
● NumberLong (3.4+)○ Higher precision for floating-point numbers
89
Schema Design: Indexes
● MongoDB supports BTree, text and geo indexes○ Default behaviour
● Collection lock until indexing completes● {background:true}
○ Runs indexing in the background avoiding pauses○ Hard to monitor and troubleshoot progress○ Unpredictable performance impact
● Avoid drivers that auto-create indexes○ Use real performance data to make indexing decisions, find out before Production!
● Too many indexes hurts write performance for an entire collection● Indexes have a forward or backward direction
○ Try to cover .sort() with index and match direction!
90
Schema Design: Indexes
● Compound Indexes○ Several fields supported○ Fields can be in forward or backward direction
■ Consider any .sort() query options and match sort direction!○ Composite Keys are read Left -> Right
■ Index can be partially-read■ Left-most fields do not need to be duplicated!■ All Indexes below are duplicates:
● {username: 1, status: 1, date: 1, count: -1}● {username: 1, status: 1, data: 1}● {username: 1, status: 1 }● {username: 1 }
● Use db.collection.getIndexes() to view current Indexes
91
Schema Design: Query Efficiency
● Query Efficiency Ratios○ Index: keysExamined / nreturned○ Document: docsExamined / nreturned
● End goal: Examine only as many Index Keys/Docs as you return!○ Tip: when using covered indexes zero documents are fetched (docsExamined: 0)!○ Example: a query scanning 10 documents to return 1 has efficiency 0.1○ Scanning zero docs is possible if using a covered index!
92
Schema Workflow
● MongoDB optimised for single-document operations● Single Document / Centralised
○ Greate cache/disk-footprint efficiency○ Centralised schemas may create a hotspot for write locking
● Multi Document / Decentralised○ MongoDB rarely stores data sequentially on disk○ Multi-document operations are less efficient○ Less potential for hotspots/write locking○ Increased overhead due to fan-out of updates○ Example: Social Media status update, graph relationships, etc○ More on this later..
93
Schema Workflow
● Read Heavy Workflow○ Read-heavy apps benefit from pre-computed results○ Consider moving expensive reads computation to insert/update/delete○ Example 1: An app does ‘count’ queries often
■ Move .count() read query to a summary document with counters■ Increment/decrement single count value at write-time
○ Example 2: An app that does groupings of data■ Move .aggregate() read query that is in-line to the user to a backend summary worker■ Read from a summary collection, like a view
● Write Heavy Workflow○ Reduce indexing as much as possible○ Consider batching or a decentralised model with lazy updating (eg: social media
graph)
94
Schema Workflow
● Batching Inserts/Updates○ Requires less network commands○ Allows the server to do some internal batching
Operations will be slower overallSuited for queue worker scenarios batching many changesTraditional user-facing database traffic should aim to operate on a single (or few) document(s)
● Thread-per-connection model○ 1 x DB operation = 1 x CPU core only○ Executing Parallel Reads
■ Large batch queries benefit from several parallel sessions■ Break query range or conditions into several client->server threads■ Not recommended for Primary nodes or Secondaries with heavy reads
95
Schema Workflow
● No list of fields specified in .find()○ MongoDB returns entire documents unless fields are specified○ Only return the fields required for an application operation!○ Covered-index operations require only the index fields to be specified
● Using $where operators○ This executes JavaScript with a global lock
● Many $and or $or conditions○ MongoDB (or any RDBMS) doesn’t handle large lists of $and or $or efficiently○ Try to avoid this sort of model with
■ Data locality■ Background Summaries / Views
96
Fan Out / Fan In
● Fan-Out Systems○ Decentralised○ Data is eventually written in many locations○ Complex write path (several updates)
■ Good use-case for Queue/Worker model■ Batching possible
○ Simple read path (data locality)● Fan-In
○ Centralised○ Simple Write path
■ Possible Write locking○ Complex Read Path
■ Potential for latency due to network
Data Integrity“The problem with troubleshooting is trouble shoots back” ~ Unknown
98
Data Integrity: `whoami` (continued)
● Very Paranoid● Previous RDBMs
○ Online Marketing / Publishing■ Paid for clicks coming in■ Downtime = revenue + traffic (paid for) loss
○ Warehousing / Pricing SaaS■ Store real items in warehouses/stores/etc■ Downtime = many businesses
(customers)/warehouses/etc at stand-still■ Integrity problems =
● Orders shipped but not paid for● Orders paid for but not shipped, etc
○ Moved on to Gaming, Percona● So why MongoDB?
2010
99
Data Integrity: Storage and Journaling
● The Journal provides durability in the event of failure of the server
● Changes are written ahead to the journal for each write operation
● On crash recovery, the server○ Finds the last point of consistency to disk○ Searches the journal file(s) for the record matching the checkpoint○ Applies all changes in the journal since the last point of consistency
● Journal data is stored in the ‘journal’ subdirectory of the server data path (dbPath)
● Dedicated disks for data (random I/O) and journal (sequential I/O) improve performance
100
Data Integrity: Write Concern
● MongoDB Replication is Asynchronous● Write Concerns
○ Allow control of data integrity of a write to a Replica Set○ Write Concern Modes
■ “w: <num>” - Writes much acknowledge to defined number of nodes■ “majority” - Writes much acknowledge on a majority of nodes■ “<replica set tag>” - Writes acknowledge to a member with the specified replica set tags
○ Durability■ By default write concerns are NOT durable■ “j: true” - Optionally, wait for node(s) to acknowledge journaling of operation■ In 3.4+ “writeConcernMajorityJournalDefault” allows enforcement of “j: true” via replica
set configuration!● Must specify “j: false” or alter “writeConcernMajorityDefault” to disable
101
Data Integrity: Replica Set Rollbacks
● Consider this when using “w:1” Write Concern○ A PRIMARY writes 10 documents with w:1 Write Concern to the oplog, then dies○ SECONDARY (2x) nodes applied 5 and 7 of the changes written○ The SECONDARY with 7 changes wins PRIMARY election○ The PRIMARY that died comes back alive○ The old-PRIMARY node becomes RECOVERING then SECONDARY○ 3 documents are “rolled back” to disk
■ A JSON file written to ‘rollback’ dir on-disk when PRIMARY crashes when ahead of SECONDARYs
■ Monitor for this file existing on disk!!● Risk: the application and/or end-user thinks this was written!● Majority Write Concern and correct Read Concern can avoid this!
102
Data Integrity: Read Concern
● New feature in MongoDB and PSMDB 3.2+● Like write concerns, the consistency of reads can be
tuned per session or operation● Levels
○ “local” - Default, return the current node’s most-recent version of the data
○ “majority” - Reads return the most-version of the data that has been ack’d on a majority of nodes. Not supported on MMAPv1.
○ “linearizable” (3.4+) - Reads return data that reflects a “majority” read of all changes prior to the read
103
Data Integrity: Replication
● Oplog○ Ordered by time○ Written to locally after apply-time of
■ Client API change■ Or replication change
○ A crashed node will resume replication using last position from local oplog○ Size of Oplog
■ Monitor this closely!■ The length of time from start to end of the oplog affects the impact of adding new nodes■ If a node is brought online with a backup within the window it avoids a full sync■ If a node is brought online with a backup older than the window it will full sync!!!
● Lag○ Due to async lag is possible○ A use of Read Concerns and/or Write Concerns can work around this!
104
Data Integrity: Replication
● Replset as Data Redundancy (use at own risk)○ Lots of Replset Members○ Read and Write Concern○ Proper Geolocation/Node Redundancy○ Cheaper, non-redundant storage becomes possible
■ JBOD■ RAID0■ InMemory (faster)
105
Data Integrity: Disaster Recovery
● Storing data in many physical locations provides improved recovery options and integrity
● Hidden Secondaries○ Store a Hidden Secondary in another location
● Upload completed backups to another location● Use a geo-distributed architecture
○ Sharding: Look into Tag-aware Sharding○ Replica Set: Multi-locations of members
Scaling Reads“The problem with troubleshooting is trouble shoots back” ~ Unknown
107
Scaling Reads: Schema and Summaries
● Correct data types○ “true” vs true○ “123456” vs 123456○ "2017-09-19T16:50:58.347Z" vs ISODate("2017-09-19T16:50:58.347Z")
● Predictive Workflow○ Read-heavy apps benefit from pre-computed results○ Consider moving expensive reads computation to insert/update/delete
■ Example: move .count() read query to a summary-document, increment/decrement summary count at write-time
● External Caching○ MongoDB is fast, memcache is even faster (although very simple)
108
Scaling Reads: Read Preference
● Modes○ primary (default)○ primaryPreferred○ secondary○ secondaryPreferred (recommended for Read Scaling!)○ nearest
● Tags○ Select nodes based on key/value pairs (one or more)○ Often used for
■ Datacenter awareness, eg: { “dc”: “eu-east” }■ Specific workflows, eg: Analytics, BI, Batch summaries, Backups
109
Scaling Reads: Secondary Members
● Linear* scale-out of reads, utilization of all nodes!● Be aware
○ rs.add() new Replicas with no data triggers initial sync from Primary!■ This can also happen if the backup is too old
○ 50 member maximum, Primary included● Adding a Replica
○ Ensure configuration replSetName and key files match○ Logical Restore
■ mongorestore a mongodump-based backup (containing oplog)■ Use bsondump to find the last document in the “oplog.bson” file■ Create oplog and insert the last “oplog.bson” document
use localdb.runCommand( { create: "oplog.rs", capped: true, size: (20 * 1024 * 1024 * 1024) } );db.oplog.rs.insert(<last doc>);
110
Scaling Reads: Secondary Members
● Adding a Replica○ Logical Restore
■ Create oplog and insert the last “oplog.bson” document■ Gather the “system.replset” document from the “local” database of an existing node
use localdb.system.replset.find(){ … }
■ Insert the “system.replset” document into the “local” database of the new nodeuse localdb.system.replset.insert(< document >);
■ On the Primary, add the new node with rs.reconfig() or rs.add()● Adding the node as “hidden: true” during recovery causes less visible lag
Scaling Writes (Sharding)“The problem with troubleshooting is trouble shoots back” ~ Unknown
112
Sharding: Components
● Sharding Routers○ ‘mongos’ binary is the router○ Applications connect to the Sharding Router○ Abstraction layer to driver
● Config Metadata○ Read by the Sharding Routers (‘mongos’)○ Config Servers
■ 1+ ‘mongod’ servers storing the metadata■ 3.4+ config servers are require Replica Sets
○ 1+ ‘mongod’ instances that store an exclusive piece of the cluster data
○ Often a Replica Set■ 3 x members or more recommended
○ Standalone nodes are supported (misconception)
113
Sharding: Config Metadata
● Metadata Collections○ databases: Sharding-enabled databases○ collections: Sharding-enabled collections
■ Shard Key■ Shard Primary
○ chunks: List of collection chunks■ Chunk range of shard key■ Mapping of responsible shard
○ shards: List of shards in cluster○ actionlog: Log of performed actions○ locks: Distributed Locks○ lockpings: Pings to Distributed Locks○ mongos: List of mongos instances (online and offline, check ‘ping’)
114
Sharding: Shard Key
● The shard key determines the distribution of the collection’s documents among the cluster’s shards.
● The shard key is either an indexed field or indexed compound fields that exists in every document in the collection.
● MongoDB partitions data in the collection using ranges of shard key values.
● Each range defines a non-overlapping range of shard key values and is associated with a chunk.
115
Sharding: Chunks
● A contiguous range of shard key values within a particular shard● Chunk ranges are inclusive of the lower boundary and exclusive of the
upper boundary (minKey and maxKey)● MongoDB splits chunks when they grow beyond the configured chunk
size, which by default is 64 megabytes.● MongoDB migrates chunks when a shard contains too many chunks
116
Sharding: Balancer
● The balancer is a background process that monitors the number of chunks on each shard.
● When the number of chunks on a given shard reaches thresholds, the balancer attempts to automatically migrate chunks between shards and reach an equal number of chunks per shard.
● The balancer does not move any data! The data moves from shard to shard.
● Serial operation in pre-3.4, multiple migrations possible in 3.4+
117
Sharding: Deployment Methods
● Small Deployment / Getting Started○ Useful for app that will scale up eventually○ Start with 1 shard and a good shard key○ Add replica set members and/or shards as the system scales
● Chunk Pre-Splitting○ Chunks that grow larger than 64mb are split by the shard Primary
■ Chunk splits impact write performance!○ Pre-creating an even distribution of chunks spanning the shard key range improves
write performance greatly!○ Example: A shard key with a possible value of 1-100 pre-split on 10 shards will
result in 10 pre-sploy chunks per shard○ More on this topic:
https://docs.mongodb.com/manual/tutorial/create-chunks-in-sharded-cluster/
118
Sharding: Tag Aware Sharding
● In sharded clusters, you can tag specific ranges of the shard key and associate those tags with a shard or subset of shards.
● MongoDB routes reads and writes that fall into a tagged range only to those shards configured for the tag.
● Use Cases○ Isolate a subset of data on a specific set of shards.○ Ensure that the most relevant data reside on
shards that are geographically closest to the application servers.
○ Route data based on Hardware / Performance
119
Sharding: Multi-Datacenter
● Application Usage○ Use “nearest” Read Preference to route reads
to closest servers○ Use shard tag to route ops to correct shard(s)
● Mongos / Routers○ At least 2 x per datacenter or running locally to
application servers● Config Servers
○ Replica Set based (CSRS) set required○ At least 2 x Config Servers per Datacenter○ 50 max members in Config Server set○ No votes required (7 voter limit)
120
Sharding: Multi-Datacenter
Coming in 3.6...
122
MongoDB 3.6
● Transactions○ Previously did not exist in MongoDB except TokuMX
● Better Default Security○ bindIp set to localhost○ Addition of source IP restrictions
● Better concurrency on metadata refreshing to remove slow downs● Metadata refresh can try 10 times not just 3 now● New cluster based clock vs system based clocks
○ Watch this feature closely!○ NTP may no longer be necessary
● RepairDB removed● Move Chunk command report statistics
123
MongoDB 3.6: More on this...
Speaker:David Murphy
Location:Goldsmith 2
Date:Tuesday,
17:55-18:20
124
Questions?