44
ADVANCED POSTGRES MONITORING DENISH PATEL

Advanced Postgres Monitoring

Embed Size (px)

Citation preview

ADVANCED POSTGRES MONITORING

DENISH PATEL

SPEAKER

WHO IS THIS GUY?

▸ Sr. Database Architect at Medallia

▸ Recent fun employments:

▸ Principal Database Engineer@ WithMe

▸ Lead Database Architect @ OmniTI

▸ Expertise in PostgreSQL , Oracle, MySQL, NoSQL

▸ Contact : [email protected] or [email protected]

▸ Twitter: @DenishPatel

▸ Blog: http://www.pateldenish.com

▸ Postgres Slack Channel (https://postgres-slack.herokuapp.com/)

2

AGENDA

DISCUSSION LIST

▸ What to look for monitoring solution in general?

▸ Comparison - selected open source and commercial monitoring solutions

▸ Which metrics to collect and how?

▸ Which metrics to alert on and how to define thresholds?

▸ How to keep up with monitoring changes ?

▸ How to react on alerts at 3AM?

▸ Open discussion3

SELECTION CRITERIA

WHAT TO LOOK FOR IN MONITORING SOLUTION ?

▸ Blend of system monitoring with Postgres support

▸ Centralized monitoring

▸ Hosted vs On-premise

▸ Security concerns of clients

▸ Alerting and Dashboard/Graphs

▸ Easy installation and configuration

▸ Postgres Support

▸ pg_stat_statements

▸ Resource monitoring - CPU, RAM, DISK IO & Network

▸ pgbouncer support

4

COMPARISON

MONITORING SOLUTIONS

▸ Open Source

▸ Sensu

▸ Zabbix

▸ Zenoss (Limited capabilities)

▸ Nagios

▸ (Stop using Nagios so it can die peacefully!!)

▸ SAAS Offerings

▸ Wavefront

▸ Circonus

▸ Vividcortex

▸ OkMeter

▸ NewRelic5

COMPARISON

OPEN SOURCE SOLUTIONS

6

Postgres support?

Configuration Reco Confidence

Sensu Yescheck_postgres

Easy Graphite

HIGH

Zabbix Yes Plugin Easy

MED

Zenoss Yes Plugin Easy

MED

Nagios Yes check_postgres Difficult

LOW

COMPARISON

SAAS OFFERINGS

7

Postgres support Configuration Confidence

Wavefront Collectd

Yes collectd plugins HIGH

Circonus Yes Default checks HIGH

Vividcortex Yes Default checks HIGH

Okmeter Yes One click install pgbouncer

HIGH

New Relic Yes Plugins - missing some metrics

MEDIUM

COMPARISON

SAAS OFFERINGS

▸ Capacity Planning

▸ Real Time Analytics

▸ Anomaly Detection

▸ Data Retention

▸ Support Reviews

▸ Pricing

8

WAVEFRONT.COM

WAVEFRONT

▸ Nice Dashboard and alerting functionality

▸ Very scalable solution

▸ Works with existing metrics collection tools i.e collectd

▸ Real time analytics capability

▸ Complete monitoring suite

9

OKMETER.IO

OKMETER

▸ It is agent based system so you just need to install agent in your environment to monitor application, database or any other servers

▸ Very easy to install and configure

▸ Provides easy to configure Postgres Server monitoring using pg_stat_statements with server stats. Once you install agent, you get everything without any effort

▸ Built-in pgbouncer monitoring

▸ Built-in all resources monitoring ; Disk, CPU, Network & Memory10

OKMETER.IO

PGBOUNCER STATS

11

USE CASE

MONITORING SOLUTION

▸ 150+ DB clusters across the globe

▸ Easy installation

▸ Standardization

▸ Centralized solution

▸ Real time analytics

▸ Support new Infra - Docker/Aurora/Mesos

12

METRICS COLLECTION

SETUP ROLE

13

create role collectd login encrypted password 'XXX'; create schema collectd; set search_path = collectd,pg_catalog; grant usage on schema collectd to collectd; alter role collectd set search_path = collectd,pg_catalog;

▸ Things to consider:

▸ Separate role for monitoring

▸ No SUPER ROLE

▸ Limited permissions

METRICS COLLECTION

COLLECTD PLUGIN

14

LoadPlugin postgresql # https://github.com/collectd/collectd/blob/master/src/postgresql_default.conf <Plugin postgresql> <Database dba> Host "localhost" Port "5432" User "collectd" Query backends Query transactions Query queries Query table_states Query disk_io Query disk_usage Query query_plans Query connections #custom Query slow_queries #custom Query txn_wraparound #custom Query locks #custom Query wal_files #custom Query scans #custom Query seq_scans #custom Query avg_querytime #custom Query checkpoints #custom Query slave_lag #custom </Database> </Plugin>

METRICS COLLECTION

BACKENDS

15

Query backends> Statement "SELECT count(*) AS count \ FROM pg_stat_activity \ WHERE datname = $1;"

Param database

<Result> Type "pg_numbackends" ValuesFrom "count" </Result> </Query>

METRICS COLLECTION

PG_STAT_ACTIVITY

16

create or replace function pg_stat_activity()

returns set of pg_catalog.pg_stat_activity

as $$

begin

return query(select * from pg_catalog.pg_stat_activity);

end $$

language plpgsql security definer;

revoke all on function pg_stat_activity() from public;

grant execute on function pg_stat_activity() to collectd;

METRICS COLLECTION

TRANSACTIONS

17

<Query transactions> Statement "SELECT xact_commit, xact_rollback \ FROM pg_stat_database \ WHERE datname = $1;" Param database <Result> Type "pg_xact" InstancePrefix "commit" ValuesFrom "xact_commit" </Result> <Result> Type "pg_xact" InstancePrefix "rollback" ValuesFrom "xact_rollback" </Result> </Query>

METRICS COLLECTION

QUERIES (DML)

18

<Query queries> Statement "SELECT sum(n_tup_ins) AS ins, \ sum(n_tup_upd) AS upd, \ sum(n_tup_del) AS del, \ sum(n_tup_hot_upd) AS hot_upd \ FROM pg_stat_user_tables;" <Result> Type "pg_n_tup_c" InstancePrefix "ins" ValuesFrom "ins" </Result> .. . . . </Query>

METRICS COLLECTION

TABLE_STATES

19

<Query table_states> Statement "SELECT sum(n_live_tup) AS live, sum(n_dead_tup) AS dead \ FROM pg_stat_user_tables;"

<Result> Type "pg_n_tup_g" InstancePrefix "live" ValuesFrom "live" </Result> <Result> Type "pg_n_tup_g" InstancePrefix "dead" ValuesFrom "dead" </Result>

</Query>

METRICS COLLECTION

QUERY_PLANS

20

<Query query_plans> Statement "SELECT sum(seq_scan) AS seq, \ sum(seq_tup_read) AS seq_tup_read, \ sum(idx_scan) AS idx, \ sum(idx_tup_fetch) AS idx_tup_fetch \ FROM pg_stat_user_tables;"

<Result> Type "pg_scan" InstancePrefix "seq" ValuesFrom "seq" . . </Query>

METRICS COLLECTION

DISK_IO

21

<Query disk_io>

Statement "SELECT coalesce(sum(heap_blks_read), 0) AS heap_read, \

coalesce(sum(heap_blks_hit), 0) AS heap_hit, \

coalesce(sum(idx_blks_read), 0) AS idx_read, \

coalesce(sum(idx_blks_hit), 0) AS idx_hit, \

coalesce(sum(toast_blks_read), 0) AS toast_read, \

coalesce(sum(toast_blks_hit), 0) AS toast_hit, \

coalesce(sum(tidx_blks_read), 0) AS tidx_read, \

coalesce(sum(tidx_blks_hit), 0) AS tidx_hit \

FROM pg_statio_user_tables;"

METRICS COLLECTIONS

DISK USAGE / DB SIZE

22

<Query disk_usage>

Statement "SELECT pg_database_size($1) AS size;"

Param database

<Result>

Type pg_db_size

ValuesFrom "size"

</Result>

</Query>

METRICS COLLECTION

CONNECTIONS #CUSTOM

23

<Query connections> Statement "SELECT COUNT(state) AS count, state FROM (SELECT CASE WHEN state = 'idle' THEN 'idle' WHEN state = 'idle in transaction' THEN 'idle_in_transaction' WHEN state = 'active' THEN 'active' ELSE 'unknown' END AS state FROM collectd.pg_stat_activity) state GROUP BY state UNION SELECT COUNT(*) AS count, 'waiting' AS state FROM collectd.pg_stat_activity WHERE waiting ;" <Result> Type "pg_numbackends" InstancePrefix "state" InstancesFrom "state" ValuesFrom "count" </Result> </Query>

CONNECTIONS

24

METRICS COLLECTION

SLOW_QUERIES

25

<Query slow_queries>

Statement "SELECT COUNT(*) AS count FROM collectd.pg_stat_activity WHERE state='active'

AND now()-query_start > '300 seconds'::interval

AND query ~* '^(insert|update|delete|select)' ;"

<Result>

Type "counter"

InstancePrefix "pg_slow_queries"

ValuesFrom "count"

</Result>

</Query>

METRICS COLLECTION

TXN_WRAPAROUND

26

<Query txn_wraparound>

Statement "SELECT age(datfrozenxid) as txn_wrap_age FROM pg_database ;"

<Result>

Type "counter"

InstancePrefix "txn_wraparound"

ValuesFrom "txn_wrap_age"

</Result>

</Query>

METRICS COLLECTION

LOCKS

27

<Query locks>

Statement "SELECT COUNT(mode) AS count, mode FROM pg_locks GROUP BY mode

UNION SELECT COUNT(*) AS count, 'waiting' AS mode FROM pg_locks

WHERE granted is false ;"

<Result>

Type "gauge"

InstancePrefix "pg_locks"

InstancesFrom "mode"

ValuesFrom "count"

</Result>

</Query>

DB LOCKS

28

METRICS COLLECTION

WAL_FILES

29

<Query wal_files>

Statement "SELECT archived_count AS count, failed_count AS failed FROM pg_stat_archiver;" <Result>

Type "gauge"

InstancePrefix "pg_wal_count"

ValuesFrom "count"

</Result>

<Result>

Type "gauge"

InstancePrefix "pg_wal_failed"

ValuesFrom "failed"

</Result> </Query>

METRICS COLLECTION

SCANS

30

<Query scans>

Statement "SELECT sum(idx_scan) as index_scans, sum(seq_scan) as seq_scans,

sum(idx_tup_fetch) as index_tup_fetch, sum(seq_tup_read) as seq_tup_read

FROM pg_stat_all_tables ; "

<Result>

Type "pg_scan"

InstancePrefix "index"

ValuesFrom "index_scans"

</Result>

.

.

</Query>

METRIC COLLECTION

SEQ_SCANS

31

<Query seq_scans>

Statement "SELECT CASE WHEN status='OK' THEN 0 ELSE 1 END AS status

FROM ( SELECT get_seq_scan_on_large_tables AS status

FROM collectd.get_seq_scan_on_large_tables) AS foo;"

<Result>

Type "gauge"

InstancePrefix "pg_seq_scans"

ValuesFrom "status"

</Result>

</Query>

METRICS COLLECTION

SEQ_SCAN_ON_LARGE_TABLES

32

CREATE MATERIALIZED VIEW collectd.seq_scan_on_large_tables AS SELECT relid, schemaname, relname, seq_scan, seq_tup_read , pg_relation_size(relid) as relsize, now() as refreshed_at FROM pg_stat_all_tables WHERE pg_relation_size(relid) > 1073741824 AND schemaname not in ('pg_catalog', 'information_schema') UNION ALL SELECT 0,'0','0','0',0,0,now();

ALTER materialized VIEW collectd.seq_scan_on_large_tables OWNER TO collectd;

METRICS COLLECTION

GET_SEQ_SCAN_ON_LARGE_TABLES

33

CREATE OR REPLACE FUNCTION collectd.get_seq_scan_on_large_tables() RETURNS text AS $$

DECLARE v_matview text; v_refreshed_at timestamptz; v_tables_with_seq_scan text[]; BEGIN SELECT refreshed_at INTO v_refreshed_at FROM collectd.seq_scan_on_large_tables WHERE relid=0; -- refresh MV every 4 hours IF v_refreshed_at < now() - interval '4 hours' and pg_is_in_recovery() is false THEN REFRESH MATERIALIZED VIEW collectd.seq_scan_on_large_tables; END IF;

SELECT ARRAY (SELECT base.relname ||':'|| (current.seq_scan-base.seq_scan) INTO v_tables_with_seq_scan FROM collectd.seq_scan_on_large_tables AS base LEFT JOIN pg_stat_all_tables AS current ON (base.schemaname=base.schemaname AND base.relname=current.relname) WHERE (current.seq_scan-base.seq_scan) > 0 AND ((current.seq_tup_read-base.seq_tup_read)/(current.seq_scan-base.seq_scan)) > 50000 ) AS tables_with_seq_scan;

IF v_tables_with_seq_scan = '{}' THEN RETURN 'OK'; ELSE RETURN 'PROBLEM: Seq scan on table: '|| array_to_string(v_tables_with_seq_scan,'&'); END If; END; $$ LANGUAGE 'plpgsql' SECURITY DEFINER;

METRICS COLLECTION

AVG_QUERYTIME

34

<Query avg_querytime>

Statement "SELECT sum(total_time)/sum(calls) AS avg_querytime FROM collectd.get_stat_statements() ;"

<Result>

Type "gauge"

InstancePrefix "pg_avg_querytime"

ValuesFrom "avg_querytime"

</Result>

</Query>

<Query scans>

METRICS COLLECTION

GET_STAT_STATEMENTS

35

create extension IF NOT EXISTS pg_stat_statements WITH SCHEMA collectd;

alter schema collectd owner to collectd;

CREATE OR REPLACE FUNCTION collectd.get_stat_statements() RETURNS SETOF pg_stat_statements AS

$$

SELECT * FROM pg_stat_statements

WHERE dbid IN (SELECT oid FROM pg_database WHERE datname = current_database());

$$ LANGUAGE sql VOLATILE SECURITY DEFINER;

AVG QUERYTIME

36

OKMETER

37

METRICS COLLECTIONS

CHECKPOINTS

38

<Query checkpoints>

Statement "SELECT (checkpoints_timed + checkpoints_req) AS total_checkpoints

FROM pg_stat_bgwriter ;"

<Result>

Type "counter"

InstancePrefix "pg_checkpoints"

ValuesFrom "total_checkpoints"

</Result>

</Query>

METRICS COLLECTION

SLAVE LAG

39

<Query slave_lag>

Statement "SELECT CASE WHEN pg_is_in_recovery = 'false' THEN 0

ELSE COALESCE(ROUND(EXTRACT(epoch FROM now() pg_last_xact_replay_timestamp())),0) END AS seconds

FROM pg_is_in_recovery();"

<Result>

Type "counter"

InstancePrefix "slave_lag"

ValuesFrom "seconds"

</Result>

</Query>

ALERTING

SETUP ALERTS ON DB METRICS

▸ Uptime

▸ Waiting Connections

▸ # of connections waiting > 5

▸ Slow queries

▸ # of slow queries > 5

▸ Seq scan on large tables

▸ TXN Wraparound

▸ Age Over 1.5B

▸ Disk space usage

▸ 85%?

▸ Slave lag

▸ 5 minutes?

40

MONITORING CHANGES

HOW TO KEEP UP?▸ Design with failover in mind

▸ Keep eyes on new features for monitoring in latest DB or OS version

▸ Postgres 9.5 enhancements

▸ Commit timestamp tracking

▸ SELECT * FROM pg_last_committed_xact();

▸ cluster_name

▸ $ ps -ef | grep checkpointer

▸ postgres 12181 12178 0 11:12 ? 00:00:00 postgres: personnel: checkpointer process

▸ postgres 12207 12204 0 11:12 ? 00:00:00 postgres: reportsdb: checkpointer process

▸ postgres 12233 12230 0 11:12 ? 00:00:00 postgres: management: checkpointer process

▸ A bunch of changes coming in Postgres 9.6

▸ Improve the pg_stat_activity view provides more details about waiting on what resources

▸ Deploy monitoring through config management tools 41

INCIDENT MANAGEMENT

HOW TO BE READY TO HANDLE 3AM CALL?▸ PagerDuty calendar : https://www.pagerduty.com/

▸ Document metrics

▸ URL for the Dashboard

▸ Alert resolution procedure

▸ Clear SLAs (Decision)

▸ Escalation policy

▸ Scenarios

▸ Wait for server to bring backup

▸ Failover

▸ Review alerts before going OnCall

▸ Oncall notification

▸ Think for the worst and document accordingly

▸ What if you are in movie theatre/beach etc.?

▸ What if you can’t jump on the server?

▸ Keep the document up-to-date 42

ANYTHING ELSE

OPEN DISCUSSION

43

KEEP IN TOUCH

THANKS & Q/A

▸ You!

▸ Conference committee

▸ Contact for further Q/A

▸ Twitter: @DenishPatel

▸ pateldenish.com

[email protected]

[email protected]