Production Running MongoDB in - Percona · Capped Collection: A fixed-size FIFO collection ... MongoDB Client API MongoDB Replication API MongoDB Sharding API Sharding Only the ‘mongos’

Running MongoDB in Production

Tim VaillancourtSr Technical Operations Architect, Percona

2

`whoami`

{name: “tim”,lastname: “vaillancourt”,employer: “percona”,techs: [

“mongodb”,“mysql”,“cassandra”,“redis”,“rabbitmq”,“solr”,“python”,“golang”

]}

3

Agenda

● Backups● Security● Monitoring● Architecture and High-Availability● Hardware● Tuning MongoDB● Tuning Linux● Troubleshooting● Schema● Data Integrity● Scaling (Read/Writes)

4

● Data○ Document: single *SON object, often nested○ Field: single field in a document○ Collection: grouping of documents○ Database: grouping of collections○ Capped Collection: A fixed-size FIFO collection

● Replication○ Oplog: A special capped collection for replication○ Primary: A replica set node that can receive writes○ Secondary: A replica of the Primary that is read-only ○ Voting: A process to elect a Primary node○ Hidden-Secondary: A replica that cannot become Primary

Terminology

Backups“An admin is only worth the backups they keep” ~ Unknown

6

Backups: Logical

● ‘mongodump’ tool from mongo-tools project● Supports

○ Multi-threaded dumping in 3.2+○ Optional gzip compression of data○ Optional dumping of oplog for single-node consistency○ Replica set awareness (Read Preference)

■ Ie: primary, primaryPreferred, secondary, secondaryPreferred, nearest● Process

○ Tool issues .find() with $snapshot query○ Stores BSON data in a file per collection○ Stores BSON oplog data in “oplog.bson”

7

Backups: Logical

● Limitations○ Not Sharding aware○ Sharded backups are not Point-in-Time consistent○ Fetch from storage-engine, serialization, networking, etc is very inefficient○ Indexes are rebuilt in serial (per-collection)

■ Index metadata only in backup■ Often indexing process takes longer than restoring the data!

○ Wire Protocol Compression (added in 3.4+) not supported: https://jira.mongodb.org/browse/TOOLS-1668 (Please vote/watch Issue!)

○ Requires oplog to be as long as the backup run time

https://jira.mongodb.org/browse/TOOLS-1668

8

Backups: Binary

● Options○ Cold Backup○ LVM Snapshot○ Hot Backup

■ Percona Server for MongoDB (FREE!)■ MongoDB Enterprise Hot Backup (non-free)■ NOTE: MMAPv1 not supported

● Benefits○ Indexes are backed up == faster restore!○ Storage-engine format backed up == faster backup AND restore!

● Limitations○ Increased backup storage requirements○ Compression is storage-engine dependant

9

Backups: Binary

● Limitations○ CPU Architecture limitations (64-bit vs 32-bit)○ Cascading corruption○ Batteries not included

■ Not Sharding aware■ Not Replica Set aware

● Process○ Cold Backup

■ Stop a mongod SECONDARY, copy/archive dbPath○ LVM Snapshot

■ Optionally call ‘db.fsyncLock()’ (not required in 3.2+ with Journalling)■ Create LVM snapshot of the dbPath■ Copy/Archive dbPath

10

Backups: Binary

● Process○ LVM Snapshot

■ Remove LVM snapshot (as quickly as possible!)■ NOTE: LVM snapshots can cause up to 30%* write latency impact to disk (due to COW)

○ Hot Backup (PSMDB or MongoDB Enterprise)■ Pay $$$ for MongoDB Enterprise or download PSMDB for free(!)■ db.adminCommand({

createBackup: 1,backupDir: "/data/mongodb/backup"

})■ Copy/archive the output path■ Delete the backup output path■ NOTE: RocksDB-based createBackup creates filesystem hardlinks whenever possible!■ NOTE: Delete RocksDB backupDir as soon as possible to reduce bloom filter overhead!

11

Backups: Architecture

● Risks○ Dynamic nature of Replica Set○ Impact of backup on live nodes

● Example: Cheap Disaster-Recovery○ Place a ‘hidden: true’ SECONDARY in another location○ Optionally use cloud object store (AWS S3, Google GS, etc)

● Example: Replica Set Tags○ “tags” allow fine-grained server selection with key/value pairs○ Use key/value pair to fence various application workflows○ Example:

■ { “role”: “backup” } == Backup Node■ { “role”: “application” } == App Node

12

Backups: mongodb_consistent_backup

● Python project by Percona-Lab for consistent backups● URL: https://github.com/Percona-Lab/mongodb_consistent_backup● Best-effort support, not a “Percona Product”● Created to solve limitations in MongoDB backup tools:

○ Replica Set and Sharded Cluster awareness○ Cluster-wide Point-in-time consistency○ In-line Oplog backup (vs post-backup)○ Notifications of success / failure

● Extra Features○ Remote Upload (AWS S3, Google Cloud Storage and Rsync)○ Archiving (Tar or ZBackup deduplication and optional AES-at-rest)○ CentOS/RHEL7 RPMs and Docker-based releases (.deb soon!)

https://github.com/Percona-Lab/mongodb_consistent_backup

13


● Extra Features○ Single Python PEX binary○ Multithreaded / Concurrent○ Auto-scales to available CPUs

● Low-Impact○ Tool focuses on low impact○ Uses Secondary nodes only○ Considers (Scoring)

■ Replication Lag■ Replication Priority■ Replication Health / State■ Hidden-Secondary State (preferred by tool)■ Fails if chosen Secondary becomes Primary (on purpose)

14


● 1.2.0○ Multi-threaded Rsync Upload○ Replica Set Tags support○ Support for MongoDB SSL / TLS connections and client auth○ Rotation / Expiry of old backups (locally-stored only)

● Future○ Incremental Backups○ Binary-level Backups (Hot Backup, Cold Backup, LVM, Cloud-based, etc)○ More Notification Methods (PagerDuty, Email, etc)○ Restore Helper Tool○ Instrumentation / Metrics○ <YOUR AWESOME IDEA HERE> we take GitHub PRs (and it’s Python)!

15


● Simple Restore○ Seamless restore: “mongorestore --oplogReplay --gzip --dir /path/to/backup”

● Restore an Entire Cluster○ Mongorestore backups of config servers

■ If restoring old/SCCC config servers, restore to every node■ If restoring replica-set config servers

● Ensure Replica Set is initiated (rs.initiate() / rs.config())● Ensure SECONDARY members are added (via PRIMARY)● Restore to PRIMARY only

○ Update “config.shards” documents if shard hosts/ports changed○ Mongorestore each shard from backup subdirectory (matches shard name)○ Start mongos process and test / QA

■ Tip: stopping the balancer may simplify troubleshooting any problems

16

More on Backups (some overlap)

Room:Field Suite #2

Time:Wednesday, 16:30 -

16:55

Security“Think of the network like a public place” ~ Unknown

18

Security: Authorization

● Always enable auth on Production Installs!● Built-in Roles

○ Database User: Read or Write data from collections■ “All Databases” or Single-database

○ Database Admin: Non-RW commands (create/drop/list/etc)○ Backup and Restore: ○ Cluster Admin: Add/Drop/List shards○ Superuser/Root: All capabilities

● User-Defined Roles○ Exact Resource+Action specification○ Very fine-grained ACLs

■ DB + Collection specific

19

Security: Filesystem Access

● Use a service user+group○ ‘mongod’ or ‘mongodb’ on most systems○ Ensure data path, log file and key file(s) are owned by this user+group

● Data Path○ Mode: 0750

● Log File○ Mode: 0640○ Contains real queries and their fields!!!

■ See Log Redaction for PSMDB (or MongoDB Enterprise) to remove these fields● Key File(s)

○ Files Include: keyFile and SSL certificates or keys○ Mode: 0600

20

Security: Network Access

● Firewall○ Single TCP port

■ MongoDB Client API■ MongoDB Replication API■ MongoDB Sharding API

○ Sharding■ Only the ‘mongos’ process needs access to shards■ Client driver does not need to reach shards directly

○ Replication■ All nodes must be accessible to the driver

● Internal Authentication: Use a key to use inter-node replication/sharding● Creating a dedicated network segment for Databases is recommended!● DO NOT allow MongoDB to talk to the internet at all costs!!!

21

Security: System Access

● Recommended to restrict system access to Database Administrators● A “shell” on a system can be enough to take the system over!● SELinux

○ Linux Kernel Built-in Security mechanism○ Massively reduces the attack possibilities on a system by using ACLs/policies○ Modes

■ Enforcing: Do not allow policy violations■ Permissive: Log and allow policy violations■ Disabled: I really don’t like security!

○ Enforcing mode supported with Percona Server for MongoDB when using CentOS / RHEL 7+ RPMs■ SELinux NOT supported by MongoDB Community or Enterprise binaries!!

22

Security: External Authentication

● LDAP Authentication○ Supported in PSMDB and MongoDB Enterprise○ The following components are necessary for external authentication to work

■ LDAP Server: Remotely stores all user credentials (i.e. user name and associated password).

■ SASL Daemon: Used as a MongoDB server-local proxy for the remote LDAP service.■ SASL Library: Used by the MongoDB client and server to create authentication

mechanism-specific data.○ Creating a User:

db.getSiblingDB("$external").createUser( {user : christian, roles: [{role: "read", db: "test"} ]} );

○ Authenticating as a User:db.getSiblingDB("$external").auth({ mechanism:"PLAIN", user:"christian", pwd:"secret", digestPassword:false})

○ Other auth methods possible with MongoDB Enterprise

23

Security: SSL Connections and Auth

● SSL / TLS Connections○ Supported since MongoDB 2.6x

■ May need to complile-in yourself on older binaries■ Supported 100% in Percona Server for MongoDB

○ Minimum of 128-bit key length for security○ Relaxed and strict (requireSSL) modes○ System (default) or Custom Certificate Authorities are accepted

● SSL Client Authentication (x509)○ MongoDB supports x.509 certificate authentication for use with a secure TLS/SSL

connection as of 2.6.x.○ The x.509 client authentication allows clients to authenticate to servers with

certificates rather than with a username and password.○ Enabled with: security.clusterAuthMode: x509

24

Security: Encryption at Rest

● MongoDB Enterprise○ Encryption supported in Enterprise binaries ($$$)

● Percona Server for MongoDB○ Use CryptFS/LUKS block device for encryption of data volume○ Documentation published (or coming soon)○ Completely open-source / Free

● Application-Level○ Selectively encrypt only required fields in application○ Benefits

■ The data is only readable by the application (reduced touch points)■ The resource cost of encryption is lower when it’s applied selectively■ Offloading of encryption overhead from database

25

Security: Network Firewall

● MongoDB only requires a single TCP port to be reachable (to all nodes)○ Default port 27017○ This does not include monitoring tools, etc

■ Percona PMM requires inbound connectivity to 1-2 TCP ports ● Restrict TCP port access to nodes that require it!● Sharded Cluster

○ Application servers only need access to ‘mongos’○ Block direct TCP access from application -> shard/mongod instances

■ Unless ‘mongos’ is bound to localhost!● Advanced

○ Move inter-node replication to own network fabric, VLAN, etc○ Accept client connections on a Public interface

26

More on Security (some overlap)

Room:Field Suite #2

Time:Tuesday, 17:25

to 17:50

Monitoring

28

Monitoring: Methodology

● Monitor often○ 60 - 300 seconds is not enough!○ Problems can begin/end in seconds

● Correlate Database and Operating System together!● Monitor a lot

○ Store more than you graph○ Example: PMM gathers 700-900 metrics per polling

● Process○ Use to troubleshoot Production events / incidents○ Iterate and Improve monitoring

■ Add graphing for whatever made you SSH to a host■ Blind QA with someone unfamiliar with the problem

29

Monitoring: Important Metrics

● Database○ Operation counters○ Cache Traffic and Capacity○ Checkpoint / Compaction Performance○ Concurrency Tickets (WiredTiger and RocksDB)○ Document and Index scanning○ Various engine-specific details

● Operating System○ CPU○ Disk○ Bandwidth / Util○ Average Wait Time○ Memory and Network

30

Monitoring: Percona PMM

● Open-source monitoring from Percona!

● Based on open-source technology○ Prometheus○ Grafana○ Go Language

● Simple deployment● Examples in this demo are

from PMM!● Correlation of OS and DB

Metrics● 800+ metrics per ping

Architecture and High-Availability

32

High Availability

● Replication○ Asynchronous

■ Write Concerns can provide psuedo-synchronous replication■ Changelog based, using the “Oplog”

○ Maximum 50 members○ Maximum 7 voting members

■ Use “vote:0” for members $gt 7○ Oplog

■ The “oplog.rs” capped-collection in “local” storing changes to data■ Read by secondary members for replication■ Written to by local node after “apply” of operation

33

Architecture

● Datacenter Recommendations○ Minimum of 3 x physical servers required for

High-Availability○ Ensure only 1 x member per Replica Set is on a single

physical server!!!● EC2 / Cloud Recommendations

○ Place Replica Set members in odd number of Availability Zones, same region

○ Use a hidden-secondary node for Backup and Disaster Recover in another region

○ Entire Availability Zones have been lost before!

Hardware

35

Hardware: Mainframe vs Commodity

● Databases: The Past○ Buy some really amazing, expensive hardware○ Buy some crazy expensive license

■ Don’t run a lot of servers due to above○ Scale up:

■ Buy even more amazing hardware for monolithic host■ Hardware came on a truck

○ HA: When it rains, it pours● Databases: A New Era

○ Everything fails, nothing is precious○ Elastic infrastructures (“The cloud”, Mesos, etc)○ Scale up: add more cheap, commodity servers○ HA: lots of cheap, commodity servers - still up!

36

Hardware: Block Devices

● Isolation○ Run Mongod dbPaths on separate volume○ Optionally, run Mongod journal on separate volume

● RAID Level○ RAID 10 == performance/durability sweet spot○ RAID 0 == fast and dangerous

● SSDs○ Benefit MMAPv1 a lot○ Benefit WT and RocksDB a bit less○ Keep about 30% free for internal GC on the SSD

37

Hardware: Block Devices

● EBS / NFS / iSCSI○ Risks / Drawbacks

■ Exponentially more things to break■ Block device requests wrapped in TCP is extremely slow■ You probably already paid for some fast local disks■ More difficult (sometimes nearly-impossible) to troubleshoot■ MongoDB doesn’t really benefit from remote storage features/flexibility

● Built-in High-Availability of data via replication● MongoDB replication can bootstrap new members● Strong write concerns can be specified for critical data

38

Hardware: CPUs

● Cores vs Core Speed○ Lots of cores > faster cores (4 CPU minimum recommended)○ Thread-per-connection Model

● CPU Frequency Scaling○ ‘cpufreq’: a daemon for dynamic scaling of the CPU frequency○ Terrible idea for databases or any predictability!○ Disable or set governor to 100% frequency always, i.e mode: ‘performance’○ Disable any BIOS-level performance/efficiency tuneable○ ENERGY_PERF_BIAS

■ A CentOS/RedHat tuning for energy vs performance balance■ RHEL 6 = ‘performance’■ RHEL 7 = ‘normal’ (!)

● My advice: use ‘tuned’ to set to ‘performance’

39

Hardware: Network Infrastructure

● Datacenter Tiers○ Network Edge○ Public Server VLAN

■ Servers with Public NAT and/or port forwards from Network Edge■ Examples: Proxies, Static Content, etc■ Calls backends in Backend VLAN

○ Backend Server VLAN■ Servers with port forwarding from Public Server VLAN (w/Source IP ACLs)■ Optional load balancer for stateless backends■ Examples: Webserver, Application Server/Worker, etc■ Calls data stores in Data VLAN

○ Data VLAN■ Servers, filers, etc with port forwarding from Backend Server VLAN (w/Source IP ACLs)■ Examples: Databases, Queues, Filers, Caches, HDFS, etc

40

Hardware: Network Infrastructure

● Network Fabric○ Try to use 10GBe for low latency○ Use Jumbo Frames for efficiency○ Try to keep all MongoDB nodes on the same segment

■ Goal: few or no network hops between nodes■ Check with ‘traceroute’

● Outbound / Public Access○ Databases don’t need to talk to the internet*

■ Store a copy of your Yum, DockerHub, etc repos locally■ Deny any access to Public internet or have no route to it■ Hackers will try to upload a dump of your data out of the network!!

● Cloud?○ Try to replicate the above with features of your provider

41

Hardware: Why So Quick?

● MongoDB allows you to scale reads and writes with more nodes○ Single-instance performance is important, but deal-breaking

● You are the most expensive resource!○ Not hardware anymore

Tuning MongoDB

43

Tuning MongoDB: MMAPv1

● A kernel-level function to map file blocks to memory

● MMAPv1 syncs data to disk once per 60 seconds (default)

○ Override with —syncDelay <seconds> flag○ If a server with no journal crashes it can lose 1 min of

data!!!● In memory buffering of Journal

○ Synced every 30ms ‘journal’ is on a different disk○ Or every 100ms○ Or 1/3rd of above if change uses Journaled write

concern (explained later)

44

Tuning MongoDB: MMAPv1

● Fragmentation○ Can cause serious slowdowns on scans, range

queries, etc○ db.<collection>.stats()

■ Shows various storage info for a collection■ Fragmentation can be computed by dividing

‘storageSize’ by ‘size’■ Any value > 1 indicates fragmentation

○ Compact when you near a value of 2 by rebuilding secondaries or using the ‘compact’ command

○ WiredTiger and RocksDB have little-no fragmentation due to checkpoints / compaction

45

Tuning MongoDB: WiredTiger

● WT syncs data to disk in a process called “Checkpointing”:

○ Every 60 seconds or >= 2GB data changes● In-memory buffering of Journal

○ Journal buffer size 128kb○ Synced every 50 ms (as of 3.2)○ Or every change with Journaled write concern (explained

later)○ In between write operations while the journal records

remain in the buffer, updates can be lost following a hard shutdown!

46

Tuning MongoDB: RocksDB

● Level-based strategy using immutable data level files○ Built-in Compression○ Block and Filesystem caches

● RocksDB uses “compaction” to apply changes to data files

○ Tiered level compaction○ Follows same logic as MMAPv1 for journal

buffering● MongoRocks

○ A layer between RocksDB and MongoDB’s storage engine API

○ Developed in partnership with Facebook

47

Tuning MongoDB: Storage Engine Caches

● WiredTiger○ In heap

■ 50% available system memory■ Uncompressed WT pages

○ Filesystem Cache■ 50% available system memory■ Compressed pages

● RocksDB○ Internal testing planned from Percona in the future○ 30% in-heap cache recommended by Facebook /

Parse

48

Tuning MongoDB: Durability

● storage.journal.enabled = <true/false>○ Default since 2.0 on 64-bit builds○ Always enable unless data is transient○ Always enable on cluster config servers

● storage.journal.commitIntervalMs = <ms>○ Max time between journal syncs

● storage.syncPeriodSecs = <secs>○ Max time between data file flushes

49

Tuning MongoDB: Don’t Enable!

● “cpu”○ External monitoring is recommended

● “rest”○ Will be deprecated in 3.6+

● “smallfiles”○ In most situations this is not necessary unless

■ You use MMAPv1, and■ It is a Development / Test environment■ You have 100s-1000s of databases with very little data

inside (unlikely)● Profiling mode ‘2’

○ Unless troubleshooting an issue / intentional

Tuning Linux

51

Tuning Linux: The Linux Kernel

● Linux 2.6.x?● Avoid Linux earlier than 3.10.x - 3.12.x● Large improvements in parallel efficiency in 3.10+ (for Free!)

● More: https://blog.2ndquadrant.com/postgresql-vs-kernel-versions/

https://blog.2ndquadrant.com/postgresql-vs-kernel-versions/

52

Tuning Linux: NUMA

● A memory architecture that takes into account the locality of memory, caches and CPUs for lower latency○ But no databases want to use it :(

● MongoDB codebase is not NUMA “aware”, causing unbalanced memory allocations on NUMA systems

● Disable NUMA○ In the Server BIOS ○ Using ‘numactl’ in init scripts BEFORE ‘mongod’

command (recommended for future compatibility):

numactl --interleave=all /usr/bin/mongod <other flags>

53

Tuning Linux: Transparent HugePages

● Introduced in RHEL/CentOS 6, Linux 2.6.38+● Merges memory pages in background (Khugepaged process)● Decreases overall performance when used with MongoDB!● “AnonHugePages” in /proc/meminfo shows usage● Disable TransparentHugePages!● Add “transparent_hugepage=never” to kernel command-line (GRUB)

○ Reboot the system■ Disabling online does not clear previous TH pages■ Rebooting tests your system will come back up!

54

Tuning Linux: Time Source

● Replication and Clustering needs consistent clocks○ mongodb_consistent_backup relies on time sync, for example!

● Use a consistent time source/server○ “It’s ok if everyone is equally wrong”

● Non-Virtualized○ Run NTP daemon on all MongoDB and Monitoring hosts○ Enable service so it starts on reboot

● Virtualised○ Check if your VM platform has an “agent” syncing time○ VMWare and Xen are known to have their own time sync○ If no time sync provided install NTP daemon

55

Tuning Linux: I/O Scheduler

● Algorithm kernel uses to commit reads and writes to disk● CFQ

○ “Completely Fair Queue”○ Default scheduler in 2.6-era Linux distributions○ Perhaps too clever/inefficient for database workloads○ Probably good for a laptop

● Deadline○ Best general default IMHO○ Predictable I/O request latencies

● Noop○ Use with virtualised servers○ Use with real-hardware BBU RAID controllers

56

Tuning Linux: Filesystems

● Filesystem Types○ Use XFS or EXT4, not EXT3

■ EXT3 has very poor pre-allocation performance■ Use XFS only on WiredTiger■ EXT4 “data=ordered” mode recommended

○ Btrfs not tested, yet!● Filesystem Options

○ Set ‘noatime’ on MongoDB data volumes in ‘/etc/fstab’:

○ Remount the filesystem after an options change, or reboot

57

Tuning Linux: Block Device Readahead

● Tuning that causes data ahead of a block on disk to be read and then cached

● Assumption:○ There is a sequential read pattern○ Something will benefit from the extra cached blocks

● Risk○ Too high waste cache space○ Increases eviction work

■ MongoDB tends to have very random disk patterns● A good start for MongoDB volumes is a ’32’ (16kb) read-ahead

○ Let MongoDB worry about optimising the pattern

58

Tuning Linux: Block Device Readahead

● Change ReadAhead○ Add file to ‘/etc/udev/rules.d’

■ /etc/udev/rules.d/60-mongodb-disk.rules: # set deadline scheduler and 32/16kb read-ahead for /dev/sda ACTION=="add|change", KERNEL=="sda", ATTR{queue/scheduler}="deadline", ATTR{bdi/read_ahead_kb}="16"

○ Reboot (or use CLI tools to apply)

59

Tuning Linux: Virtual Memory Dirty Pages

● Dirty Pages○ Pages stored in-cache, but needs to be written to

storage● Dirty Ratio

○ Max percent of total memory that can be dirty○ VM stalls and flushes when this limit is reached○ Start with ’10’, default (30) too high

● Dirty Background Ratio○ Separate threshold forbackground dirty page

flushing○ Flushes without pauses○ Start with ‘3’, default (15) too high

60

Tuning Linux: Swappiness

● A Linux kernel sysctl setting for preferring RAM or disk for swap

○ Linux default: 60○ To avoid disk-based swap: 1 (not zero!)○ To allow some disk-based swap: 10○ ‘0’ can cause more swapping than ‘1’ on recent

kernels■ More on this here:

https://www.percona.com/blog/2014/04/28/oom-relation-vm-swappiness0-new-kernel/



61

Tuning Linux: Ulimit

● Allows per-Linux-user resource constraints○ Number of User-level Processes○ Number of Open Files○ CPU Seconds○ Scheduling Priority○ And others…

● MongoDB○ Should probably have a dedicated VM, container or server○ Creates a new process

■ For every new connection to the Database■ Plus various background tasks / threads

○ Creates an open file for each active data file on disk■ 64,000 open files and 64,000 max processes is a good start

62

Tuning Linux: Ulimit

● Setting ulimits○ /etc/security/limits.d file○ Systemd Service○ Init script

● Ulimits are set by Percona and MongoDB packages!○ Example on left:

PSMDB RPM (Systemd)

63

Tuning Linux: Network Stack

● Defaults are not good for > 100mbps Ethernet● Suggested starting point:

● Set Network Tunings:○ Add the above sysctl tunings to /etc/sysctl.conf○ Run “/sbin/sysctl -p” as root to set the tunings○ Run “/sbin/sysctl -a” to verify the changes

64

Tuning Linux: More on this...

https://www.percona.com/blog/2016/08/12/tuning-linux-for-mongodb/

https://www.percona.com/blog/2016/08/12/tuning-linux-for-mongodb/

65

Tuning Linux: “Tuned”

● Tuned○ A “framework” for applying tunings to Linux

■ RedHat/CentOS 7 only for now■ Debian added tuned, not sure if compatible yet■ Cannot tune NUMA, file system type or fs mount opts■ Syctls, THP, I/O sched, etc

● My apology to the community for writing “Tuning Linux for MongoDB”:○ https://github.com/Percona-Lab/tuned-percona-mongodb

https://github.com/Percona-Lab/tuned-percona-mongodb

Troubleshooting“The problem with troubleshooting is trouble shoots back” ~ Unknown

67

Troubleshooting: Usual Suspects

● Locking○ Collection-level locks○ Document-level locks○ Software mutex/semaphore

● Limits○ Max connections○ Operation rate limits○ Resource limits

● Resources○ Lack of IOPS, RAM, CPU, network, etc

68

Troubleshooting: MongoDB Resources

● Memory● CPU

○ System CPU○ FS cache○ Networking○ Disk I/O○ Threading

● User CPU (MongoDB)○ Compression (WiredTiger and RocksDB)○ Session Managemen○ BSON (de)serialisation○ Filtering / scanning / sorting

69


● User CPU (MongoDB)○ Optimiser○ Disk○ Data file read/writes○ Journaling○ Error logging○ Network○ Query request/response○ Replication

● Disk I/O○ Journaling○ Oplog Reads / Writes○ Background Flushing / Compactions / etc

70


● Disk I/O○ Page Faults (data not in cache)○ Swapping

● Network○ Client API○ Replication○ Sharding

■ Chunk Moves■ Mongos -> Shards

71

Troubleshooting: db.currentOp()

● A function that dumps status info about running operations and various lock/execution details

● Only queries currently in progress are shown.● Provided Query ID number can be used to kill long running queries.● Includes

○ Original Query○ Parsed Query○ Query Runtime○ Locking details

● Filter Documents○ { "$ownOps": true } == Only show operations for the current user○ https://docs.mongodb.com/manual/reference/method/db.currentOp/#examples

https://docs.mongodb.com/manual/reference/method/db.currentOp/#examples

72

Troubleshooting: db.stats()

● Returns○ Document-data size (dataSize)○ Index-data size (indexSize)○ Real-storage size (storageSize)○ Average Object Size○ Number of Indexes○ Number of Objects

73

Troubleshooting: db.currentOp()

74

Troubleshooting: Log File

● Interesting details are logged to the mongod/mongos log files○ Slow queries○ Storage engine details (sometimes)○ Index operations○ Sharding

■ Chunk moves○ Elections / Replication○ Authentication○ Network

■ Connections● Errors● Client / Inter-node connections

75

Troubleshooting: Log File - Slow Query

2017-09-19T20:58:03.896+0200 I COMMAND [conn175] command config.locks appName: "MongoDB Shell" command: findAndModify { findAndModify: "locks", query: { ts: ObjectId('59c168239586572394ae37ba') }, update: { $set: { state: 0 } }, writeConcern: { w: "majority", wtimeout: 15000 }, maxTimeMS: 30000 } planSummary: IXSCAN { ts: 1 } update: { $set: { state: 0 } }

keysExamined:1docsExamined:1nMatched:1nModified:1keysInserted:1keysDeleted:1numYields:0reslen:604locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 2 } }, Collection: { acquireCount: { w: 1 } }, Metadata: { acquireCount: { w: 1 } }, oplog: { acquireCount: { w: 1 } } }protocol:op_command106ms

76

Troubleshooting: Operation Profiler

● Writes slow database operations to a new MongoDB collection for analysis

○ Capped Collection “system.profile” in each database, default 1mb○ The collection is capped, ie: profile data doesn’t last forever

● Support for operationProfiling data in Percona Monitoring and Management in current future goals

● Enable operationProfiling in “slowOp” mode○ Start with a very high threshold and decrease it in steps○ Usually 50-100ms is a good threshold○ Enable in mongod.conf

operationProfiling: slowOpThresholdMs: 100 mode: slowOp

77

Troubleshooting: Operation Profiler

● Useful Profile Metrics○ op/ns/query: type, namespace and query of a profile○ keysExamined: # of index keys examined○ docsExamined: # of docs examined to achieve result○ writeConflicts: # of Write Concern Exceptions

encountered during update○ numYields: # of times operation yielded for others○ locks: detailed lock statistics

78

Troubleshooting: .explain()

● Shows query explain plan for query cursors

● This will include○ Winning Plan

■ Query stages● Query stages may include sharding info in

clusters■ Index chosen by optimiser

○ Rejected Plans

79

Troubleshooting: .explain() and Profiler

80

Troubleshooting: Cluster Metadata

● The “config” database on Cluster Config servers○ Use .find() queries to view Cluster Metadata

● Contains○ actionlog (3.0+)○ changelog○ databases○ collections○ shards○ chunks○ settings○ mongos○ locks○ lockpings

81

Troubleshooting: Percona PMM QAN

● The Query Analytics tool enables DBAs and developers to analyze queries over periods of time and find performance problems.

● Helps you optimise database performance by making sure that queries are executed as expected and within the shortest time possible.

● Central, web-based location for visualising data.● Agent collected from MongoDB Profiler (required) from agent.● Great for reducing access to systems while proving valueable data to

development teams!● Query Normalization

○ ie:“{ item: 123456 }” -> “{ item: ##### }”.● Command-line Equivalent: pt-mongodb-query-digest tool

82

Troubleshooting: Percona PMM QAN

83

Troubleshooting: mlogfilter

● A useful tool for processing mongod.log files● A log-aware replacement for ‘grep’, ‘awk’ and friends● Generally focus on

○ mlogfilter --scan <file>■ Shows all collection scan queries

○ mlogfilter --slow <ms> <file>■ Shows all queries that are slower than X milliseconds

○ mlogfilter --op <op-type> <file>■ Shows all queries of the operation type X (eg: find, aggregate, etc)

● More on this tool here https://github.com/rueckstiess/mtools/wiki/mlogfilter

https://github.com/rueckstiess/mtools/wiki/mlogfilter

84

Troubleshooting: Common Problems

● Sharding○ removeShard Doesn’t Complete

■ Check the ‘dbsToMove’ array of the removeShard responsemongos> db.adminCommand({removeShard:"test2"}){

"msg" : "draining started successfully","state" : "started","shard" : "test2","note" : "you need to drop or movePrimary these databases","dbsToMove" : [

"wikipedia"],"ok" : 1

}

■ Why?mongos> use configswitched to db configmongos> db.databases.find(){ "_id" : "wikipedia", "primary" : "test2", "partitioned" : true }

85

Troubleshooting: Common Problems

● Sharding○ removeShard Doesn’t Complete

■ Try● Use movePrimary to move database(s) Primary-role to others● Run the removeShard command once the shard being removed is NOT primary for any

database○ This starts ths draining of the shard

● Run the same removeShard command to check on progress.○ If the draining and removing is complete this will respond with success

○ Jumbo Chunks■ Will prevent balancing from occurring■ config.chunks collection document will contain jumbo:true as a key/value pair■ Sharding ‘split’ commands can be used to reduce the chunk size (sh.splitAt, etc)■ https://www.percona.com/blog/2016/04/11/dealing-with-jumbo-chunks-in-mongodb/

Schema Design & Workflow“The problem with troubleshooting is trouble shoots back” ~ Unknown

87

Schema Design: Data Types

● Strings○ Only use strings if required○ Do not store numbers as strings!○ Look for {field:“123456”} instead of {field:123456}

■ “12345678” moved to a integer uses 25% less space■ Range queries on proper integers is more efficient

○ Example JavaScript to convert a field in an entire collection■ db.items.find().forEach(function(x) {

newItemId = parseInt(x.itemId); db.containers.update( { _id: x._id }, { $set: {itemId: itemId } } )});

88

Schema Design: Data Types

● Strings○ Do not store dates as strings!

■ The field "2017-08-17 10:00:04 CEST" stores in 52.5% less space!○ Do not store booleans as strings!

■ “true” -> true = 47% less space wasted● DBRefs

○ DBRefs provide pointers to another document○ DBRefs can be cross-collection

● NumberLong (3.4+)○ Higher precision for floating-point numbers

89

Schema Design: Indexes

● MongoDB supports BTree, text and geo indexes○ Default behaviour

● Collection lock until indexing completes● {background:true}

○ Runs indexing in the background avoiding pauses○ Hard to monitor and troubleshoot progress○ Unpredictable performance impact

● Avoid drivers that auto-create indexes○ Use real performance data to make indexing decisions, find out before Production!

● Too many indexes hurts write performance for an entire collection● Indexes have a forward or backward direction

○ Try to cover .sort() with index and match direction!

90

Schema Design: Indexes

● Compound Indexes○ Several fields supported○ Fields can be in forward or backward direction

■ Consider any .sort() query options and match sort direction!○ Composite Keys are read Left -> Right

■ Index can be partially-read■ Left-most fields do not need to be duplicated!■ All Indexes below are duplicates:

● {username: 1, status: 1, date: 1, count: -1}● {username: 1, status: 1, data: 1}● {username: 1, status: 1 }● {username: 1 }

● Use db.collection.getIndexes() to view current Indexes

91

Schema Design: Query Efficiency

● Query Efficiency Ratios○ Index: keysExamined / nreturned○ Document: docsExamined / nreturned

● End goal: Examine only as many Index Keys/Docs as you return!○ Tip: when using covered indexes zero documents are fetched (docsExamined: 0)!○ Example: a query scanning 10 documents to return 1 has efficiency 0.1○ Scanning zero docs is possible if using a covered index!

92

Schema Workflow

● MongoDB optimised for single-document operations● Single Document / Centralised

○ Greate cache/disk-footprint efficiency○ Centralised schemas may create a hotspot for write locking

● Multi Document / Decentralised○ MongoDB rarely stores data sequentially on disk○ Multi-document operations are less efficient○ Less potential for hotspots/write locking○ Increased overhead due to fan-out of updates○ Example: Social Media status update, graph relationships, etc○ More on this later..

93

Schema Workflow

● Read Heavy Workflow○ Read-heavy apps benefit from pre-computed results○ Consider moving expensive reads computation to insert/update/delete○ Example 1: An app does ‘count’ queries often

■ Move .count() read query to a summary document with counters■ Increment/decrement single count value at write-time

○ Example 2: An app that does groupings of data■ Move .aggregate() read query that is in-line to the user to a backend summary worker■ Read from a summary collection, like a view

● Write Heavy Workflow○ Reduce indexing as much as possible○ Consider batching or a decentralised model with lazy updating (eg: social media

graph)

94

Schema Workflow

● Batching Inserts/Updates○ Requires less network commands○ Allows the server to do some internal batching

Operations will be slower overallSuited for queue worker scenarios batching many changesTraditional user-facing database traffic should aim to operate on a single (or few) document(s)

● Thread-per-connection model○ 1 x DB operation = 1 x CPU core only○ Executing Parallel Reads

■ Large batch queries benefit from several parallel sessions■ Break query range or conditions into several client->server threads■ Not recommended for Primary nodes or Secondaries with heavy reads

95

Schema Workflow

● No list of fields specified in .find()○ MongoDB returns entire documents unless fields are specified○ Only return the fields required for an application operation!○ Covered-index operations require only the index fields to be specified

● Using $where operators○ This executes JavaScript with a global lock

● Many $and or $or conditions○ MongoDB (or any RDBMS) doesn’t handle large lists of $and or $or efficiently○ Try to avoid this sort of model with

■ Data locality■ Background Summaries / Views

96

Fan Out / Fan In

● Fan-Out Systems○ Decentralised○ Data is eventually written in many locations○ Complex write path (several updates)

■ Good use-case for Queue/Worker model■ Batching possible

○ Simple read path (data locality)● Fan-In

○ Centralised○ Simple Write path

■ Possible Write locking○ Complex Read Path

■ Potential for latency due to network

Data Integrity“The problem with troubleshooting is trouble shoots back” ~ Unknown

98

Data Integrity: `whoami` (continued)

● Very Paranoid● Previous RDBMs

○ Online Marketing / Publishing■ Paid for clicks coming in■ Downtime = revenue + traffic (paid for) loss

○ Warehousing / Pricing SaaS■ Store real items in warehouses/stores/etc■ Downtime = many businesses

(customers)/warehouses/etc at stand-still■ Integrity problems =

● Orders shipped but not paid for● Orders paid for but not shipped, etc

○ Moved on to Gaming, Percona● So why MongoDB?

2010

99

Data Integrity: Storage and Journaling

● The Journal provides durability in the event of failure of the server

● Changes are written ahead to the journal for each write operation

● On crash recovery, the server○ Finds the last point of consistency to disk○ Searches the journal file(s) for the record matching the checkpoint○ Applies all changes in the journal since the last point of consistency

● Journal data is stored in the ‘journal’ subdirectory of the server data path (dbPath)

● Dedicated disks for data (random I/O) and journal (sequential I/O) improve performance

100

Data Integrity: Write Concern

● MongoDB Replication is Asynchronous● Write Concerns

○ Allow control of data integrity of a write to a Replica Set○ Write Concern Modes

■ “w: <num>” - Writes much acknowledge to defined number of nodes■ “majority” - Writes much acknowledge on a majority of nodes■ “<replica set tag>” - Writes acknowledge to a member with the specified replica set tags

○ Durability■ By default write concerns are NOT durable■ “j: true” - Optionally, wait for node(s) to acknowledge journaling of operation■ In 3.4+ “writeConcernMajorityJournalDefault” allows enforcement of “j: true” via replica

set configuration!● Must specify “j: false” or alter “writeConcernMajorityDefault” to disable

101

Data Integrity: Replica Set Rollbacks

● Consider this when using “w:1” Write Concern○ A PRIMARY writes 10 documents with w:1 Write Concern to the oplog, then dies○ SECONDARY (2x) nodes applied 5 and 7 of the changes written○ The SECONDARY with 7 changes wins PRIMARY election○ The PRIMARY that died comes back alive○ The old-PRIMARY node becomes RECOVERING then SECONDARY○ 3 documents are “rolled back” to disk

■ A JSON file written to ‘rollback’ dir on-disk when PRIMARY crashes when ahead of SECONDARYs

■ Monitor for this file existing on disk!!● Risk: the application and/or end-user thinks this was written!● Majority Write Concern and correct Read Concern can avoid this!

102

Data Integrity: Read Concern

● New feature in MongoDB and PSMDB 3.2+● Like write concerns, the consistency of reads can be

tuned per session or operation● Levels

○ “local” - Default, return the current node’s most-recent version of the data

○ “majority” - Reads return the most-version of the data that has been ack’d on a majority of nodes. Not supported on MMAPv1.

○ “linearizable” (3.4+) - Reads return data that reflects a “majority” read of all changes prior to the read

103

Data Integrity: Replication

● Oplog○ Ordered by time○ Written to locally after apply-time of

■ Client API change■ Or replication change

○ A crashed node will resume replication using last position from local oplog○ Size of Oplog

■ Monitor this closely!■ The length of time from start to end of the oplog affects the impact of adding new nodes■ If a node is brought online with a backup within the window it avoids a full sync■ If a node is brought online with a backup older than the window it will full sync!!!

● Lag○ Due to async lag is possible○ A use of Read Concerns and/or Write Concerns can work around this!

104

Data Integrity: Replication

● Replset as Data Redundancy (use at own risk)○ Lots of Replset Members○ Read and Write Concern○ Proper Geolocation/Node Redundancy○ Cheaper, non-redundant storage becomes possible

■ JBOD■ RAID0■ InMemory (faster)

105

Data Integrity: Disaster Recovery

● Storing data in many physical locations provides improved recovery options and integrity

● Hidden Secondaries○ Store a Hidden Secondary in another location

● Upload completed backups to another location● Use a geo-distributed architecture

○ Sharding: Look into Tag-aware Sharding○ Replica Set: Multi-locations of members

Scaling Reads“The problem with troubleshooting is trouble shoots back” ~ Unknown

107

Scaling Reads: Schema and Summaries

● Correct data types○ “true” vs true○ “123456” vs 123456○ "2017-09-19T16:50:58.347Z" vs ISODate("2017-09-19T16:50:58.347Z")

● Predictive Workflow○ Read-heavy apps benefit from pre-computed results○ Consider moving expensive reads computation to insert/update/delete

■ Example: move .count() read query to a summary-document, increment/decrement summary count at write-time

● External Caching○ MongoDB is fast, memcache is even faster (although very simple)

108

Scaling Reads: Read Preference

● Modes○ primary (default)○ primaryPreferred○ secondary○ secondaryPreferred (recommended for Read Scaling!)○ nearest

● Tags○ Select nodes based on key/value pairs (one or more)○ Often used for

■ Datacenter awareness, eg: { “dc”: “eu-east” }■ Specific workflows, eg: Analytics, BI, Batch summaries, Backups

109

Scaling Reads: Secondary Members

● Linear* scale-out of reads, utilization of all nodes!● Be aware

○ rs.add() new Replicas with no data triggers initial sync from Primary!■ This can also happen if the backup is too old

○ 50 member maximum, Primary included● Adding a Replica

○ Ensure configuration replSetName and key files match○ Logical Restore

■ mongorestore a mongodump-based backup (containing oplog)■ Use bsondump to find the last document in the “oplog.bson” file■ Create oplog and insert the last “oplog.bson” document

use localdb.runCommand( { create: "oplog.rs", capped: true, size: (20 * 1024 * 1024 * 1024) } );db.oplog.rs.insert(<last doc>);

110

Scaling Reads: Secondary Members

● Adding a Replica○ Logical Restore

■ Create oplog and insert the last “oplog.bson” document■ Gather the “system.replset” document from the “local” database of an existing node

use localdb.system.replset.find(){ … }

■ Insert the “system.replset” document into the “local” database of the new nodeuse localdb.system.replset.insert(< document >);

■ On the Primary, add the new node with rs.reconfig() or rs.add()● Adding the node as “hidden: true” during recovery causes less visible lag

Scaling Writes (Sharding)“The problem with troubleshooting is trouble shoots back” ~ Unknown

112

Sharding: Components

● Sharding Routers○ ‘mongos’ binary is the router○ Applications connect to the Sharding Router○ Abstraction layer to driver

● Config Metadata○ Read by the Sharding Routers (‘mongos’)○ Config Servers

■ 1+ ‘mongod’ servers storing the metadata■ 3.4+ config servers are require Replica Sets

○ 1+ ‘mongod’ instances that store an exclusive piece of the cluster data

○ Often a Replica Set■ 3 x members or more recommended

○ Standalone nodes are supported (misconception)

113

Sharding: Config Metadata

● Metadata Collections○ databases: Sharding-enabled databases○ collections: Sharding-enabled collections

■ Shard Key■ Shard Primary

○ chunks: List of collection chunks■ Chunk range of shard key■ Mapping of responsible shard

○ shards: List of shards in cluster○ actionlog: Log of performed actions○ locks: Distributed Locks○ lockpings: Pings to Distributed Locks○ mongos: List of mongos instances (online and offline, check ‘ping’)

114

Sharding: Shard Key

● The shard key determines the distribution of the collection’s documents among the cluster’s shards.

● The shard key is either an indexed field or indexed compound fields that exists in every document in the collection.

● MongoDB partitions data in the collection using ranges of shard key values.

● Each range defines a non-overlapping range of shard key values and is associated with a chunk.

115

Sharding: Chunks

● A contiguous range of shard key values within a particular shard● Chunk ranges are inclusive of the lower boundary and exclusive of the

upper boundary (minKey and maxKey)● MongoDB splits chunks when they grow beyond the configured chunk

size, which by default is 64 megabytes.● MongoDB migrates chunks when a shard contains too many chunks

116

Sharding: Balancer

● The balancer is a background process that monitors the number of chunks on each shard.

● When the number of chunks on a given shard reaches thresholds, the balancer attempts to automatically migrate chunks between shards and reach an equal number of chunks per shard.

● The balancer does not move any data! The data moves from shard to shard.

● Serial operation in pre-3.4, multiple migrations possible in 3.4+

117

Sharding: Deployment Methods

● Small Deployment / Getting Started○ Useful for app that will scale up eventually○ Start with 1 shard and a good shard key○ Add replica set members and/or shards as the system scales

● Chunk Pre-Splitting○ Chunks that grow larger than 64mb are split by the shard Primary

■ Chunk splits impact write performance!○ Pre-creating an even distribution of chunks spanning the shard key range improves

write performance greatly!○ Example: A shard key with a possible value of 1-100 pre-split on 10 shards will

result in 10 pre-sploy chunks per shard○ More on this topic:

https://docs.mongodb.com/manual/tutorial/create-chunks-in-sharded-cluster/

https://docs.mongodb.com/manual/tutorial/create-chunks-in-sharded-cluster/

118

Sharding: Tag Aware Sharding

● In sharded clusters, you can tag specific ranges of the shard key and associate those tags with a shard or subset of shards.

● MongoDB routes reads and writes that fall into a tagged range only to those shards configured for the tag.

● Use Cases○ Isolate a subset of data on a specific set of shards.○ Ensure that the most relevant data reside on

shards that are geographically closest to the application servers.

○ Route data based on Hardware / Performance

119

Sharding: Multi-Datacenter

● Application Usage○ Use “nearest” Read Preference to route reads

to closest servers○ Use shard tag to route ops to correct shard(s)

● Mongos / Routers○ At least 2 x per datacenter or running locally to

application servers● Config Servers

○ Replica Set based (CSRS) set required○ At least 2 x Config Servers per Datacenter○ 50 max members in Config Server set○ No votes required (7 voter limit)

120

Sharding: Multi-Datacenter

Coming in 3.6...

122

MongoDB 3.6

● Transactions○ Previously did not exist in MongoDB except TokuMX

● Better Default Security○ bindIp set to localhost○ Addition of source IP restrictions

● Better concurrency on metadata refreshing to remove slow downs● Metadata refresh can try 10 times not just 3 now● New cluster based clock vs system based clocks

○ Watch this feature closely!○ NTP may no longer be necessary

● RepairDB removed● Move Chunk command report statistics

123

MongoDB 3.6: More on this...

Speaker:David Murphy

Location:Goldsmith 2

Date:Tuesday,

17:55-18:20

124

Questions?