Why PostgreSQL for Analytics Infrastructure (DW)?

Huy NguyenCTO, Cofounder - Holistics.io

Grokking TechTalk - Database SystemsHo Chi Minh City - Aug 2016

● Cofounder

○ Data Reporting (BI) and Infrastructure SaaS

● Cofounder of Grokking Vietnam○ Building community of world-class engineers in Vietnam

● Previous○ Growth Team at Facebook (US)

○ Built Data Pipeline at Viki (Singapore)

About Me

Background: What is Analytics/DW?

- A Typical Web Application

Data-related Business Problems:

• Daily/weekly registered users by different platforms, countries?

• How many video uploads do we have everyday?

- A Typical Web Application

• Daily/weekly registered users by different platforms, countries?

• How many video uploads do we have everyday?

A Typical Data Pipeline

Analytics Database

CSVs / Excels / Google Sheets

Operational Data Data Warehouse Reporting / Analysis

Data Science / ML

Reporting / BI

Event Logs (behavioural

Live Databases

Live DatabasesProduction

DBsDaily Snapshot

Import

Pre-aggregate

Modify / Transform

Analytics Database

Data Science / ML

Reporting / BI

Live Databases

DBsDaily Snapshot

Import

Pre-aggregate

Modify / Transform

What database should we pick?

Transactional Applications vs Analytics Applications

Ref: http://www.slideshare.net/PGExperts/really-big-elephants-postgresql-dw-15833438 (slide 5)

● Many single-row writes

● Current, single data

Queries:

● Generated by user activities; 10 to 1000 users

● < 1s response time

● Short queries

● Few large batch imports

● Years of data, many sources

Queries:

● Generated by large reports; 1 to 10 users

● Queries run for hours

● Long queries

Ref: http://www.slideshare.net/PGExperts/really-big-elephants-postgresql-dw-15833438 (slide 8)

Complex Query...

Why start with Postgres?

1. Simple to Get Started

2. Rich Features for Analytics

– Data Pipeline (ETL)

– Data Analysis

3. Scale Up

(3) Scale(1) Start (2) Grow

Data Growth

– Data Analysis

3. Scale Up

Data Growth

1 Simple to Get Started

● Data requests grow gradually as your company grows● Business users care about results (not backend)

Postgres:

● Free (open-source)● Easy to setup

→ Need something quick to start, easy to fine-tune along the way

1. Simple start 2. Rich features 3. Scale up

– Data Analysis

3. Scale Up

Data Growth

Analytics Database

Data Science / ML

Reporting / BI

Live Databases

DBsDaily Snapshot

Import

Pre-aggregate

Modify / Transform

Data Pipeline (ETL) Data Analysis

Analytics Database

Data Warehouse

Live Databases

● Managing Table Data: table partitioning

● Managing Disk Space: tablespace

● Write Performance: unlogged table

● Others: foreign data wrapper, point-in-time recovery

2 a- Data Pipeline (ETL) & Performance

Analytics tables hold lots of data

Managing Data Tables

pageviews_2015_06

pageviews_2015_07

pageviews_2015_08

pageviews_2015_09

Solution: Split (partition) to multiple tables

Problem:Difficult to query data across multiple months

⇒ Table grows big quickly, difficult to manage !

pageviews

(+ 100k records a day)

Managing Data Tables: parent table

pageviews_2015_06

pageviews_2015_07

pageviews_2015_08

pageviews_2015_09

ALTER TABLE pageviews_2015_09 INHERIT video_plays;

ALTER TABLE pageviews_2015_09 ADD CONSTRAINTCHECK date_d >= '2015-09-01' AND date_d < '2015-10-01';

pageviews_parent (parent table)

Analytics DB holds lots of data; hardware spaces are limited

● SSD: fast, expensive● SATA: cheap, slow

Data have different accessfrequency

● Hot Data● Warm Data● Cold Data

Managing Disk-spaces

Tablespace: Define where your tables are stored on disks

Managing Disk-spaces: tablespace

CREATE TABLESPACE hot_data LOCATION /disk0/ssd/CREATE TABLESPACE warm_data LOCATION /disk1/sata2/

# beginning of the month

CREATE TABLE pageviews_2016_08 TABLESPACE hot_data;ALTER TABLE pageviews_2016_07 TABLESPACE warm_data;

Combining TABLESPACE and PARENT TABLE

pageviews_2015_06

pageviews_2015_07

pageviews_2015_08

pageviews_2015_09

pageviews_parent (parent table)

Analytics Database

Data Warehouse

Live Databases

Analytics tables can be rebuilt from source

CREATE TABLE daily_summary(...) UNLOGGED;

INSERT INTO daily_summary …;

Write Performance: unlogged table● Transactional Safety: Every update is 2 writes:

○ Update data inside table

○ Write WAL (Write Ahead Log)

● UNLOGGED TABLE○ Skip WAL log○ Improved Write Performance

http://pgsnaga.blogspot.com/2011/10/data-loading-into-unlogged-tables-and.html

● Extract / transform● Aggregate / summarize● Statistical analysis

2- b- Data Analysis (writing SQLs)

Analytics Database

Data WarehouseReporting /

Analysis

Data Science / ML

Reporting / BI

● SQL features

○ WITH clause

○ Window functions

○ Aggregation functions

○ Statistical functions

● Data structures

○ JSON / JSONB

○ Arrays

○ PostGIS (geo data)

○ Geometry (point, line, etc)

○ HyperLogLog (extension)

2- b - Data Analysis with Postgres● PL/SQL

● Full-text search (n-gram)

● Performance:

○ Parallel queries (pg9.6)

○ Materialized views

○ BRIN index

● Others:

○ DISTINCT ON

○ VALUES

○ generate_series()

○ Support FULL OUTER JOIN

○ Better EXPLAIN

SELECT ... FROM (SELECT ... FROM t1 JOIN (SELECT ... FROM ...) a ON (...) ) b JOIN (SELECT ... FROM ...) c ON (...)

CTE - Problem with Nested QueriesNested queries are

a) hard to readb) cannot be reused

CTE - Common Table Expressions (WITH clause)

WITH a AS ( SELECT ... FROM ...), b AS ( SELECT ... FROM t1 JOIN a ON (...)), c AS ( SELECT ... FROM ...)SELECT ... FROM b JOIN c ON ...

● SQL’s “private methods”

● WITH view can be referred multiple times

● Allows chaining instead of nesting

CTE (cont.)● Recursive CTE● Writeable CTE

# move data from A to BWITH deleted_rows AS (

DELETE FROM a WHERE ...RETURNING *

)INSERT INTO bSELECT * FROM deleted_rows;

SELECT gender, COUNT(1) AS signupsFROM usersGROUP BY 1

● GROUP BY aggregate: reduce a partition of data into 1 value

Limitation of GROUP BY aggregate

What if we want to work through each row of each partition?

● Window functions: moving frame of 1 partition data

● Examples:○ Calculate moving average○ Cumulative sum○ Ranking by partition○ …

Window functions

SELECT created_at::date AS date_d, COUNT(1) AS daily_signups, SUM(COUNT(1)) OVER (ORDER BY dated_d) AS cumulative_signupsFROM users UGROUP BY 1ORDER BY 1

| date_d | daily_signups | cumulative_signups || 2016-08-01 | 100 | 100 || 2016-08-02 | 50 | 150 || 2016-08-03 | 80 | 230 |

Example: Cumulative Sum

CREATE TABLE users ( id INT, gender VARCHAR(10), created_at TIMESTAMP);

SELECT gender, name, RANK() OVER (PARTITION BY gender ORDER BY created_at) AS signup_rnkFROM users U ORDER BY 1, 3;

| gender | name | signup_rnk || male | Hung | 1 || male | Son | 2 || ... || female | Lan | 1 || female | Tuyet | 2 |

Example: Group by Gender and rank by signup time

CREATE TABLE users ( id INT, name VARCHAR, gender VARCHAR(10), created_at TIMESTAMP);

● SQL features

○ WITH clause

● Data structures

○ JSON / JSONB

○ Arrays

2 b- Data Analysis with Postgres

● PL/SQL

● Performance:

○ BRIN index

● Others:

○ DISTINCT ON

○ VALUES

○ Better EXPLAIN

PostgreSQL is well suited for data analysis!

Analytics Database

Data Science / ML

Reporting / BI

Live Databases

DBsDaily Snapshot

Import

Pre-aggregate

Modify / Transform

Data Pipeline (ETL) Data Analysis

– Data Analysis

3. Scale Up

Data Growth

● PostgreSQL downsides:○ Optimized for transactional applications

○ Single-core execution; row-based storage

● CitusDB Extension○ Automated data sharding and parallelization○ Columnar Storage Format (better storage and performance)

● Vertica (HP)○ Columnar Storage, Parallel Execution

○ Started by Michael Stonebraker (Postgres original author)

● Amazon Redshift○ Fork of PostgreSQL 8.2 -- ParAccel DB○ Columnar Storage & Parallel Executions

3- Scaling Up

Other Proprietary DW Databases (Relational)● Greenplum

● Teradata

● Infobright

● Google BigQuery

● Aster Data

● Paraccel (Postgres fork)

● Vertica (from Postgres author)

● CitusDB (Postgres extension)

● Amazon Redshift (from Paraccel)

Why PostgreSQL for Analytics Infrastructure (DW)?

Data & Analytics

Data warehousing with PostgreSQL - PostgreSQL wiki

PostgreSQL and XML - Prague PostgreSQL Developers Day … · PostgreSQL and XML Peter Eisentraut petere@postgresql.org Prague PostgreSQL Developers’ Day 2008

What’s New in PostgreSQL 9.6, by PostgreSQL contributor · Hive PostgreSQL 9.1.3 PostgreSQL 9.6.0 •Comparing DBT-3 benchmark result with Hive (SF1 - SF100) •Single PostgreSQL

Really Big Elephants: PostgreSQL DW

PostgreSQL as GPU Database for Real-Time Analytics€¦ · PostgreSQL as GPU Database for Real-Time Analytics Vortrag, Swiss PUG, Zürich, 9. November 2017 . About Scalability Scale-up

Streaming IoT Analytics with the PI Integrator for Azure ... · Big Data Store Data Lake, SQL, DW OLTP, DW, Hadoop, EDSs Hadoop, Teradata, Linux, Windows Information Management Data

PostgreSQL for developers · PostgreSQL for developers Dimitri Fontaine PostgreSQL Major Contributor A BOOK ABOUT POSTGRESQL BY DIMITRI FONTAINE

Расширяемость PostgreSQL для хакеров и архитекторовmegera/postgres/talks/PostgreSQL... · Расширяемость PostgreSQL для хакеров

Manually Upgrading PostgreSQL 9.1to PostgreSQL 9.4

Mastering PostgreSQL · Mastering PostgreSQL In Application Development Dimitri Fontaine PostgreSQL Major Contributor 1st Edition !

DW - DW VOX

Business Intelligence and Data Warehousing (BI/DW) … Intelligence and Data... · Business Intelligence and Data Warehousing (BI/DW) Akidev provides business analytics services on

Using the PostgreSQL Extension Ecosystem for Advanced Analytics

April, 2013 Leveraging IBM PureData for Analytics for ...public.dhe.ibm.com/software/dw/puresystems/tech...© 2012 IBM Corporation Leveraging IBM PureData for Analytics for Increased

EiB Analytics Supported Infrastructures€¦ · EiB Analytics BI + Core MS Stack of Excel + SQL / SQL OLAP Corporates Standard Microsoft Accepted DW Architecture Security At All Levels

Building a SaaS developer platform using Postgresql...Postgresql is a great building block for building distributed systems. Where we use Postgres Metrics and analytics Cluster state

BIG DATA & Advanced Analytics Roadshow...Hadoop and SPARK on- premises. Provisioning HDInsight clusters, Azure SQL DW databases, Machine Learning, Stream Analytics & Power BI. Enabling

Cessnock Correctional Centre upgrade - Review of ... · dw dw dw dw dw 86.575 85.836 85.591 85.26185.333 dw dw dw dwdw 85.54985.603 85.568 86.132 86.841 dw dw dw dw dw 86.876 86.129

Reference Manual IQ Administrator Pro and PostgreSQL ...€¦ · Reference Manual IQ Administrator Pro and PostgreSQL Database Server Installation Guide Honeywell Analytics, Inc

The PostgreSQL Global Development Group · 2019-09-26 · PostgreSQL 9.5.19 Documentation The PostgreSQL Global Development Group. PostgreSQL 9.5.19 Documentation by The PostgreSQL