41
Copyright©2018 NTT Corp. All Rights Reserved. Introducing PMDK into PostgreSQL Challenges and implementations towards PMEM-generation elephant Takashi Menjo , Yoshimi Ichiyanagi NTT Software Innovation Center PGCon 2018, Day 1 (May 31, 2018)

Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

Copyright©2018 NTT Corp. All Rights Reserved.

Introducing PMDK into PostgreSQLChallenges and implementations towards PMEM-generation

elephant

Takashi Menjo†, Yoshimi Ichiyanagi†

†NTT Software Innovation Center

PGCon 2018, Day 1 (May 31, 2018)

Page 2: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

2Copyright©2018 NTT Corp. All Rights Reserved.

• Worked on system software

• Distributed block storage (Sheepdog)

• Operating system (Linux)

• First time to dive into PostgreSQL

• Try to refine open-source software by a new storage and a

new library

• Choose PostgreSQL because the NTT group promotes it

My background

Any discussions and comments are welcome :-)

Page 3: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

3Copyright©2018 NTT Corp. All Rights Reserved.

1. Introduction

• Persistent Memory (PMEM), DAX for files, and PMDK

2. Hacks and evaluation

i. XLOG segment file (Write-Ahead Logging)

ii. Relation segment file (Table, Index, …)

3. Tips related to PMEM

• Programming, benchmark, and operation

4. Conclusion

Overview of my talk

Page 4: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

4Copyright©2018 NTT Corp. All Rights Reserved.

1. Introduction

• Persistent Memory (PMEM), DAX for files, and PMDK

2. Hacks and evaluation

i. XLOG segment file (Write-Ahead Logging)

ii. Relation segment file (Table, Index, …)

3. Tips related to PMEM

• Programming, benchmark, and operation

4. Conclusion

Overview of my talk

Page 5: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

5Copyright©2018 NTT Corp. All Rights Reserved.

• Emerging memory-like storage device

• Non-volatile, byte-addressable, and as fast as DRAM

• Several types released or announced

• NVDIMM-N (Micron, HPE, Dell, ...)

• Based on DRAM and NAND flash

• HPE Scalable Persistent Memory

• Intel Persistent Memory (in 2018?)

Persistent Memory (PMEM)

We use this.

Page 6: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

6Copyright©2018 NTT Corp. All Rights Reserved.

How PMEM is fast

Source: J. Arulraj and A. Pavlo. How to Build a Non-Volatile Memory Database Management System (Table 1). Proc. SIGMOD '17.

Page 7: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

7Copyright©2018 NTT Corp. All Rights Reserved.

Databases on the way to PMEM

“Non-volatile Memory Logging”

in PGCon 2016 [4]

[1] https://www.snia.org/sites/default/files/PM-Summit/2017/presentations/Tom_Talpey_Persistent_Memory_in_Windows_Server_2016.pdf

[2] https://www.snia.org/sites/default/files/PM-Summit/2017/presentations/Zora_Caklovic_Bringing_Persistent_Memory_Technology_to_SAP_HANA.pdf

[3] https://mariadb.com/about-us/newsroom/press-releases/mariadb-launches-innovation-labs-explore-and-conquer-new-frontiers

[4] https://www.pgcon.org/2016/schedule/attachments/430_Non-volatile_Memory_Logging.pdf

Microsoft SQL Server 2016

utilizes NVDIMM-N for Tail-of-Log [1]

SAP HANA is leveraging

PMEM for Main Store [2]

MariaDB Labs is

collaborating with Intel [3]

Page 8: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

8Copyright©2018 NTT Corp. All Rights Reserved.

• Hardware support

• BIOS detecting and configuring PMEM

• ACPI 6.0 or later: NFIT

• Asynchronous DRAM Refresh (ADR)

• :

• Software support

• Operating system (device drivers)

• Direct-Access for files (DAX)

• Linux (ext4 and xfs) and Windows (NTFS)

• Persistent Memory Development Kit (PMDK)

• Linux and Windows, on x64

What we need to use PMEM

For NVDIMM, at least.

Page 9: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

9Copyright©2018 NTT Corp. All Rights Reserved.

DAX, PMDK, and what we did

Memory-

mapped file

PostgreSQL

PMEM

Library

(PMDK)

DAX-enabled

filesystem

Traditional

filesystem

Page cache

PostgreSQL PostgreSQL

User

Kernel

File I/O

APIs

File I/O

APIsLoad/Store

instructions

Context

switch

Context

switch

No context

switch

Page 10: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

10Copyright©2018 NTT Corp. All Rights Reserved.

DAX, PMDK, and what we did

Memory-

mapped file

PostgreSQL

PMEM

Library

(PMDK)

DAX-enabled

filesystem

Traditional

filesystem

Page cache

PostgreSQL PostgreSQL

User

Kernel

File I/O

APIs

File I/O

APIsLoad/Store

instructions

Context

switch

Context

switch

What we did

No context

switch

Page 11: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

11Copyright©2018 NTT Corp. All Rights Reserved.

• With DAX only

• Use PMEM faster without change of the application

• With DAX and PMDK

• Improve the performance of I/O-intensive workload

• By reducing context switches and the overhead of API calls

• Micro-benchmark

• DAX+PMDK is 2.5X as fast as

DAX-only

• Try to introduce

PMDK into PostgreSQL

Benefits of DAX and PMDK

(HPE NVDIMM, Linux kernel 4.16, ext4, PMDK 1.4)

2.5

X

Page 12: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

12Copyright©2018 NTT Corp. All Rights Reserved.

1. Introduction

• Persistent Memory (PMEM), DAX for files, and PMDK

2. Hacks and evaluation

i. XLOG segment file (Write-Ahead Logging)

ii. Relation segment file (Table, Index, …)

3. Tips related to PMEM

• Programming, benchmark, and operation

4. Conclusion

Overview of my talk

Page 13: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

13Copyright©2018 NTT Corp. All Rights Reserved.

• How to hack

Replace read/write calls with memory copy

Easier way, reasonable for our first step

• Have data structures on DRAM persist on PMEM directly,

similar to in-memory database

• What we hack

i) XLOG segment files

Critical for transaction performance

ii) Relation segment files

Many writes occur during checkpoint

• Other files (CLOG, pg_control, ...)

Approach

Page 14: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

14Copyright©2018 NTT Corp. All Rights Reserved.

read/write (POSIX) libpmem (PMDK)

Open fd = open(path, ...);

pmem = pmem_map_file(path, len, ...);

Write nbytes = write(fd, buf, count);

pmem_memcpy_nodrain(pmem, buf, count);

Sync ret = fdatasync(fd); pmem_drain();

Read nbytes = read(fd, buf, count);

memcpy // from <string.h>(buf, pmem, count);

Close ret = close(fd); ret = pmem_unmap(pmem, len);

How to hack

Replace read/write calls with memory copy

Page 15: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

15Copyright©2018 NTT Corp. All Rights Reserved.

• Contains Write-Ahead Log records

• Guarantees durability of updates

• By having the records persist before committing transaction

• Fixed length (16-MiB per file)

• Each file has a monotonically increasing “segment number”

• Critical for transaction performance

• Backend cannot commit a transaction before the commit log

record persists on storage

• A transaction takes less time if the record persists sooner

i) XLOG segment file

Page 16: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

16Copyright©2018 NTT Corp. All Rights Reserved.

• Memory-map every segment file

• Fixed-length (16-MiB) file is highly compatible with memory-

mapping

• Memory-copy to it from the XLOG buffer

• Patch <backend/access/xlog.c> and so on

• 15 files changed, 847 insertions, 174 deletions

• Available on pgsql-hackers mailing list (search “PMDK”)

i) How we hack XLOG

Page 17: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

17Copyright©2018 NTT Corp. All Rights Reserved.

i) Evaluation setup

Hardware

CPU E5-2667 v4 x 2 (8 cores per node)

DRAM [Node0/1] 32 GiB each

PMEM (NVDIMM-N) [Node0] 64 GiB (HPE 8GB NVDIMM x 8)

NVMe SSD [Node0] Intel SSD DC P3600 400GB

Optane SSD [Node0] Intel Optane SSD DC 4800X 750GB

Software

Distro Ubuntu 17.10

Linux kernel 4.16

PMDK 1.4

Filesystem ext4 (DAX available)

PostgreSQL base a467832 (master @ Mar 18, 2018)

postgresql.conf

{max,min}_wal_size 20GB

shared_buffers 16085MB

checkpoint_timeout 12min

checkpoint_completion_target 0.7

32 GiB

DRAM

64 GiB

PMEM

32 GiB

DRAM

CPU

CPU

Node0

Node1

Server

Client

NUMA node

NVMe

SSD

Optane

SSD

Page 18: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

18Copyright©2018 NTT Corp. All Rights Reserved.

• Compare transaction throughput by using pgbench

• pgbench -i -s 200

• pgbench -M prepared -h /tmp -p 5432 -c 32 -j 32 -T 1800

• The checkpoint runs twice during the benchmark

• Run 3 times to find the median value for the result

• Conditions:

i) Evaluation of XLOG hacks

(a) (b) (c) (d) (e)

wal_sync_method fdatasync open_datasync fdatasync open_datasync libpmem (New)

PostgreSQL No hack No hackHacked

with PMDK

FS for XLOG ext4 (No DAX) ext4 (DAX enabled)

Device for XLOG Optane SSD PMEM

PGDATA NVMe SSD / ext4 (No DAX)

Page 19: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

19Copyright©2018 NTT Corp. All Rights Reserved.

i) Results

Hig

her

is b

ett

er

+3%

Page 20: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

20Copyright©2018 NTT Corp. All Rights Reserved.

• Improve transaction throughput by 3%

• Roughly the same improvement as Yoshimi reported on

pgsql-hackers

• Seems small in the percentage, but not-so-small in the

absolute value (+1,200 TPS)

• Future work

• Performance profiling

• Searching for a query pattern for which our hack is more

effective

i) Discussion

Abstract:

Page 21: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

21Copyright©2018 NTT Corp. All Rights Reserved.

1. Introduction

• Persistent Memory (PMEM), DAX for files, and PMDK

2. Hacks and evaluation

i. XLOG segment file (Write-Ahead Logging)

ii. Relation segment file (Table, Index, …)

3. Tips related to PMEM

• Programming, benchmark, and operation

4. Conclusion

Overview of my talk

Page 22: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

22Copyright©2018 NTT Corp. All Rights Reserved.

• So-called data file (or checkpoint file)

• Table, Index, TOAST, Materialized View, ...

• Variable length up to 1-GiB

• A huge table and so forth consist of multiple segment files

• Critical for checkpoint duration

• Dirty pages on the shared buffer are written back to the

segment files

• A checkpoint takes less time if the pages are written back

sooner

ii) Relation segment file

Page 23: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

23Copyright©2018 NTT Corp. All Rights Reserved.

• Memory-map only every 1-GiB segment file

• Memory-mapped file cannot extend or shrink

• Remapping the file seems difficult for me to implement

• Memory-copy to it from the shared buffer

• Patch <backend/storage/smgr/md.c> and so on

• 2 files changed, 152 insertions

• Under test, not published yet

ii) How we hack Relation

Page 24: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

24Copyright©2018 NTT Corp. All Rights Reserved.

ii) Evaluation setup

Hardware

CPU E5-2667 v4 x 2 (8 cores per node)

DRAM [Node0/1] 32 GiB each

PMEM (NVDIMM-N) [Node0] 64 GiB (HPE 8GB NVDIMM x 8)

NVMe SSD [Node0] Intel SSD DC P3600 400GB

Optane SSD [Node0] Intel Optane SSD DC 4800X 750GB

Software

Distro Ubuntu 17.10

Linux kernel 4.16

PMDK 1.4

Filesystem ext4 (DAX available)

PostgreSQL base a467832 (master @ Mar 18, 2018)

postgresql.conf

{max,min}_wal_size 20GB

shared_buffers 16085MB

checkpoint_timeout 1d

checkpoint_completion_target 0.0

32 GiB

DRAM

64 GiB

PMEM

32 GiB

DRAM

CPU

CPU

Node0

Node1

Server

Client

NUMA node

NVMe

SSD

Optane

SSD

<= To complete ckpt ASAP

<= Not to kick ckpt automatically

Page 25: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

25Copyright©2018 NTT Corp. All Rights Reserved.

• Compare checkpoint duration time as follows:

• Conditions:

ii) Evaluation of Relation hacks

(a) (b) (c)

PostgreSQL No hack No hack Hacked with PMDK

FS for PGDATA ext4 (No DAX) ext4 (DAX enabled)

Device for PGDATA Optane SSD PMEM

Profile by Linux perf? No Yes

Server

pgbench

Rel. seg.

Server

Rel. seg.

-i -s 800

Flush

Server

pgbench

Rel. seg.

Server

psql

Rel. seg.

7.37 GiB

CHECKPOINTRestart A few min.

log_

filename

total=?Shared

buffer

Page 26: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

26Copyright©2018 NTT Corp. All Rights Reserved.

ii) ResultsLow

er is

bette

r

-30%

Page 27: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

27Copyright©2018 NTT Corp. All Rights Reserved.

ii) Profiling

-20 points: Overhead of write() (Skyblue)

-7 points: Context switches (Green)

-3 points: Memory copy

(Deep blue and brown)+5 points: LockBufHdr (Orange)

Page 28: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

28Copyright©2018 NTT Corp. All Rights Reserved.

• Shorten checkpoint duration by 30%

• The server can give its computing resource to the other

purposes

• Reduce the overhead of system calls and the

context switches

• Benefits of using memory-mapped files!

• The time of LockBufHdr became rather longer

• Open issue…

ii) Discussion

Page 29: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

29Copyright©2018 NTT Corp. All Rights Reserved.

• (i) Improve transaction throughput by 3%

• With 1,000-line hack for WAL

• (ii) Shorten checkpoint duration by 30%

• With 150-line hack for Relation

• We must bring out more potential from PMEM

• Not so bad in an easier way, but far from “2.5X” in micro -

benchmark

• I think another way is to have data structures on DRAM

persist on PMEM directly

Conclusion of evaluation

Page 30: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

30Copyright©2018 NTT Corp. All Rights Reserved.

1. Introduction

• Persistent Memory (PMEM), DAX for files, and PMDK

2. Hacks and evaluation

i. XLOG segment file (Write-Ahead Logging)

ii. Relation segment file (Table, Index, …)

3. Tips related to PMEM

• Programming, benchmark, and operation

4. Conclusion

Overview of my talk

Page 31: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

31Copyright©2018 NTT Corp. All Rights Reserved.

• The data should reach nothing but PMEM

• Don’t stop at half, volatile middle layer such as CPU caches

• Or it will be lost when the program or system crashes

• x64 offers two instruction families

• CLFLUSH – Flush data out of CPU

caches to memory

• MOVNT – Store data to memory,

bypassing CPU caches

• PMDK supports both

• pmem_flush

• pmem_memcpy_nodrain

CPU cache flush and cache-bypassing store

CPU

Caches

PMEM

MOVNTCLFLUSH

Page 32: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

32Copyright©2018 NTT Corp. All Rights Reserved.

• The two are not compatible

• Memory-mapped file cannot be extended while being

mapped

• Neither naive way is perfect

• Remapping a segment file on extend is time-consuming

• Pre-allocating maximum size (1GiB per segment) wastes

PMEM free space

• We must rethink traditional data structure towards

PMEM era

Memory-mapped file and Relation extension

Page 33: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

33Copyright©2018 NTT Corp. All Rights Reserved.

• Hugepage will improve performance of PMEM

• By reducing page walk and Translation Lookaside Buffer

(TLB) miss

• PMDK on x64 considers hugepage

• By aligning the mapping address on hugepage boundary

(2MiB or 1GiB) when the file is large enough

• Pre-warming page table for PMEM will also make

the performance better

• By reducing page fault on main runtime

Page table and hugepage

Page 34: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

34Copyright©2018 NTT Corp. All Rights Reserved.

• Critical for stable performance on multi process-

ing system

• Accessing to local memory (DRAM and/or PMEM) is fast

while remote is slow

• This applies to PCIe SSD, but PMEM is more sensitive

• Binding processes to a certain node

• numactl --cpunodebind=X --membind=X

-- pgctl ...

Controlling NUMA effect

Page 35: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

35Copyright©2018 NTT Corp. All Rights Reserved.

• Something other than storage access could be

hotspot of transaction when we use PMEM

• Such as...

• Concurrency control such as locking

• Redundant internal memory copy

• Pre-processing such as SQL parse

• We fell into this trap and avoided it by prepared statement

New common sense of hotspot

PMEM

accessOthers

Disk accessOthers🔥

🔥

(Conceptual figure)

Page 36: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

36Copyright©2018 NTT Corp. All Rights Reserved.

1. Introduction

• Persistent Memory (PMEM), DAX for files, and PMDK

2. Hacks and evaluation

i. XLOG segment file (Write-Ahead Logging)

ii. Relation segment file (Table, Index, …)

3. Tips related to PMEM

• Programming, benchmark, and operation

4. Conclusion

Overview of my talk

Page 37: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

37Copyright©2018 NTT Corp. All Rights Reserved.

• Applied PMDK into PostgreSQL

• In an easier way to use memory-mapped files and memory

copy

• Achieved not-so-bad results

• +3% transaction throughput

• -30% checkpoint duration

• Showed tips related to PMEM

• PMEM will change software design drastically

• We should change software and our mind to bring out

PMEM’s potential much more

• Let’s try PMEM and PMDK :)

Conclusion

Page 38: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

38Copyright©2018 NTT Corp. All Rights Reserved.

Backup slides

Page 39: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

39Copyright©2018 NTT Corp. All Rights Reserved.

How to hack

Disk

XLOG

buffer

Shared

buffer

Rel.XLOG

write()

PMEM

XLOG

buffer

Shared

buffer

Rel.XLOG

write()

PMEM

XLOG

buffer

Shared

buffer

Rel.XLOG

Memory

copy

Our HacksCurrent

PostgreSQL

On DAX-enabled

filesystem

Commit Checkpoint

Memory-

mapped

Page 40: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

40Copyright©2018 NTT Corp. All Rights Reserved.

• Defines behavior between user space and OS

• Here we focus on NVM.PM.FILE mode

SNIA NVM Programming Model

https://www.snia.org/sites/default/files/technical_work/final/NVMProgrammingModel_v1.2.pdf

Page 41: Introducing PMDK into PostgreSQL - PGCon · Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf

41Copyright©2018 NTT Corp. All Rights Reserved.

read/write memory-mapped file PMDK (libpmem)

Open fd = open(path, flags, mode);

fd = open(path, flags, mode);

pmem = pmem_map_file(path, len, flags, mode,...);

Allocate - // map cannot be extended// so pre-allocate the fileerr = posix_fallocate(fd, 0, len);

Map - pmem = mmap(NULL, len, ..., fd, -1);

(Close) - // accessing file via mapped// address; not fd any moreret = close(fd);

Write nbytes = write(fd, buf, count);

memcpy(pmem, buf, count);

// bypassing cache if able // instead of flushing itpmem_memcpy_nodrain(pmem, buf, count);Flush - for(i=0; i<count; i+=64)

_mm_clflush(pmem[i]);

Sync ret = fdatasync(fd); _mm_sfence(); pmem_drain();

Read nbytes = read(fd, buf, count);

memcpy(buf, pmem, count);

memcpy(buf, pmem, count);

Unmap - ret = munmap(pmem, len); ret = pmem_unmap(pmem, len);

Close ret = close(fd); - -

API walkthrough

Blue: Intel Intrinsics; Red: PMDK (libpmem)