Bruce Momjian

Postgres Blog


This blog is about my work on the Postgres open source database, and is published on Planet PostgreSQL. PgLife allows monitoring of all Postgres community activity.

Online status:
Unread Postgres emails:
Email graphs: incoming, outgoing, unread, commits (details)



Administration Rerouting Server Log Entries

Friday, April 16, 2021

Postgres has a where, when, and what set of options to control database server log messages. One frequent request is to allow log message to be split into multiple files or sent to multiple servers. Postgres internally doesn't support this capability, but rsyslog, enabled with log_destination=syslog, allows complex filtering of database server log messages. Specifically, rsyslog's complex filters, combined with log_line_prefix, allows almost unlimited control over how log messages are recorded.

 


Administration The Power of Synchronous_commit

Wednesday, April 14, 2021

Postgres has a lot of settings — 329 in Postgres 13. Most of these settings control one specific thing, and the defaults are fine for most use-cases. Synchronous_commit is one of the settings that isn't simple and controls perhaps too many things, though there are operational restrictions which prevent it from being split into separate settings.

In fact, this setting is so complex that when I started to research what it controls, I realized our existing documentation was not completely clear on all of its capabilities. Therefore, I rewrote this setting's documentation in October of 2020, and the updated documentation was released in our November 12th minor releases. As you can see from the table added to the docs, synchronous_commit controls not only the local durability of commits, but also standby behavior.

First, let's focus on its local behavior, that is, without considering standby servers. The value off causes synchronous_commit to not wait for the records associated with the commit to be written to the write-ahead log. It does, however, guarantee data consistency, meaning, that in the event of a crash, the transactions it does recover will be consistent with each other. It just doesn't guarantee that all committed transactions will be restored in the event of a crash. This is in contrast to the setting fsync=off, which does not guarantee data consistency, meaning, after an operating system crash, some transactions might be partial or inconsistent with other records.

The local option waits for the commit to be durably recorded, meaning any crash recovery will restore all committed database changes. All the other options add waiting for standby servers, while synchronous_standby_names controls which or how many standby servers to wait for. The synchronous_commit option remote_write waits for one or more standby servers to have their transaction information durably stored in case of a Postgres crash. Option on (the default) is similar to remote_write but is durable against standby operating system crashes, not just Postgres crashes. Option remote_apply waits for the transaction to be replayed and visible to future read-only transactions on the standby server. (I covered the details of when commits are visible in a previous blog entry.)

Given the scope of synchronous_commit, you might assume changing it requires as server restart. On the contrary, it can be set to different values in different sessions on the same server, and can be changed during sessions. Yeah, kind of crazy, but it works, and it allows you to control transaction durability and visibility depending on the importance of the transaction. As Postgres covers in its documentation, while Postgres is a fully durable relational database, it does allow fine grained control over the durability and even standby visibility of transactions.

Post a Comment

Administration Shared Memory Sizing

Monday, April 12, 2021

Postgres makes extensive use of operating system shared memory, and I have already written a presentation about it. The Postgres documentation gives specific instructions on how to determine the amount of shared memory allocated, specifically for sizing huge pages.

However, exactly what is inside Postgres shared memory was somewhat of a mystery to end users until Postgres 13 added pg_shmem_allocations, which gives a detailed list of the uses and sizes of allocated shared memory:

SELECT *, pg_size_pretty(allocated_size) FROM  pg_shmem_allocations ORDER BY size DESC;
 
                name                 |    off     |    size    | allocated_size | pg_size_pretty 
-------------------------------------+------------+------------+----------------+----------------
 Buffer Blocks                       |   69908224 | 6442450944 |     6442450944 | 6144 MB
 Buffer Descriptors                  |   19576576 |   50331648 |       50331648 | 48 MB
 <anonymous>                         |     (null) |   43655168 |       43655168 | 42 MB
 Buffer IO Locks                     | 6512359168 |   25165824 |       25165824 | 24 MB
 Checkpointer Data                   | 6597411712 |   18874432 |       18874496 | 18 MB
 XLOG Ctl                            |      53888 |   16803472 |       16803584 | 16 MB
 Checkpoint BufferIds                | 6537524992 |   15728640 |       15728640 | 15 MB
 Xact                                |   16857856 |    2116320 |        2116352 | 2067 kB
 (null)                              | 6616370560 |    1921664 |        1921664 | 1877 kB
 Subtrans                            |   19107968 |     267008 |         267008 | 261 kB
 CommitTs                            |   18974208 |     133568 |         133632 | 131 kB
 MultiXactMember                     |   19441792 |     133568 |         133632 | 131 kB
 Serial                              | 6596798464 |     133568 |         133632 | 131 kB
 Backend Activity Buffer             | 6597153664 |     132096 |         132096 | 129 kB
 shmInvalBuffer                      | 6597332864 |      69464 |          69504 | 68 kB
 Notify                              | 6616303744 |      66816 |          66816 | 65 kB
 MultiXactOffset                     |   19374976 |      66816 |          66816 | 65 kB
 Backend Status Array                | 6597082240 |      54696 |          54784 | 54 kB
 Backend SSL Status Buffer           | 6597285760 |      42312 |          42368 | 41 kB
 Shared Buffer Lookup Table          | 6553253632 |      33624 |          33664 | 33 kB
 KnownAssignedXids                   | 6597042560 |      31720 |          31744 | 31 kB
 ProcSignal                          | 6597403392 |       8264 |           8320 | 8320 bytes
 Backend Client Host Name Buffer     | 6597145344 |       8256 |           8320 | 8320 bytes
 Backend Application Name Buffer     | 6597137024 |       8256 |           8320 | 8320 bytes
 KnownAssignedXidsValid              | 6597074304 |       7930 |           7936 | 7936 bytes
 AutoVacuum Data                     | 6616286208 |       5368 |           5376 | 5376 bytes
 Background Worker Data              | 6597328256 |       4496 |           4608 | 4608 bytes
 Fast Path Strong Relation Lock Data | 6594531840 |       4100 |           4224 | 4224 bytes
 PROCLOCK hash                       | 6593836672 |       2904 |           2944 | 2944 bytes
 PREDICATELOCK hash                  | 6594981376 |       2904 |           2944 | 2944 bytes
 SERIALIZABLEXID hash                | 6596446976 |       2904 |           2944 | 2944 bytes
 PREDICATELOCKTARGET hash            | 6594536064 |       2904 |           2944 | 2944 bytes
 LOCK hash                           | 6593141504 |       2904 |           2944 | 2944 bytes
 Async Queue Control                 | 6616301184 |       2492 |           2560 | 2560 bytes
 ReplicationSlot Ctl                 | 6616291584 |       2480 |           2560 | 2560 bytes
 Wal Receiver Ctl                    | 6616295936 |       2248 |           2304 | 2304 bytes
 BTree Vacuum State                  | 6616298880 |       1476 |           1536 | 1536 bytes
 Wal Sender Ctl                      | 6616294784 |       1040 |           1152 | 1152 bytes
 Shared MultiXact State              |   19575424 |       1028 |           1152 | 1152 bytes
 PMSignalState                       | 6597402368 |       1020 |           1024 | 1024 bytes
 Sync Scan Locations List            | 6616300416 |        656 |            768 | 768 bytes
 ReplicationOriginState              | 6616294144 |        568 |            640 | 640 bytes
 Proc Array                          | 6597041920 |        528 |            640 | 640 bytes
 Logical Replication Launcher Data   | 6616298240 |        424 |            512 | 512 bytes
 Control File                        |   16857472 |        296 |            384 | 384 bytes
 Proc Header                         | 6596932224 |        104 |            128 | 128 bytes
 PredXactList                        | 6596241792 |         88 |            128 | 128 bytes
 OldSnapshotControlData              | 6616298752 |         68 |            128 | 128 bytes
 CommitTs shared                     |   19107840 |         32 |            128 | 128 bytes
 Buffer Strategy Status              | 6593141376 |         28 |            128 | 128 bytes
 RWConflictPool                      | 6596505344 |         24 |            128 | 128 bytes
 FinishedSerializableTransactions    | 6596798336 |         16 |            128 | 128 bytes
 Prepared Transaction Table          | 6597328128 |         16 |            128 | 128 bytes
 SerialControlData                   | 6596932096 |         12 |            128 | 128 bytes

The null name is unused shared memory. Summing the allocated_size column yields a number similar to the output of pmap, which is described in the documentation above. On my system, pmap is ~1% higher than allocated_size, probably because shared library memory is also counted by pmap.

Post a Comment

Administration Replica Scaling by the Numbers

Friday, April 9, 2021

I have previously blogged about scaling of read-only and read/write workloads. Inspired by this email thread, I want to be more specific about exactly what scaling you can expect by using read-only replicas. First, as the email poster pointed out, read replicas still have to replay write transactions from the primary server. While the replica servers don't need to execute the write queries when using binary replication, they still need to perform the heap/index writes that the primary performs. (Logical replication does add significant cpu overhead.)

Let's look at rough numbers of the load on binary-replication replicas. Suppose the write load on the primary is 30% — that leaves 70% of the i/o for read-only queries, and even more than 70% of the cpu capacity since binary replication replay uses much less cpu than running the actual write queries. If there are three replicas, that means you have 210% of the primary's read-query capacity available on the replicas.

However, if you have a 75% write load on the primary, you only have an extra 25% capacity on the replicas, so your scaling ability is much more limited — it would take four replicas just to double the read capacity of the primary. This explains why high write workloads only effectively scale read-only processing using sharding.

Post a Comment

Administration Operating System Choice

Wednesday, April 7, 2021

In the 1990's, most server operating systems were closed-source and produced by a few billion-dollar companies. In recent years, Red Hat Enterprise Linux (rhel) became the default enterprise server operating system, with centos filling the need for installations on smaller platforms and in the cloud. (Centos was sometimes the only operating system used.) With the move of centos away from being a stable releases tracking rhel, there is renewed interest in other operating systems.

There are many viable operating system choices, and with so many choices, selection can be difficult. When choosing an operating system to host a database like Postgres, there are some important criteria:

  • Does the operating system have a server focus, or desktop/laptop/mobile focus?
  • How easy is it to administer, and does your staff have the appropriate skills?
  • What support options are available? What is the operating system's support life span?
  • Are upgrades easily handled? How often do operating system upgrades require rebooting?
  • Is the operating system easily installed on developer workstations, laptops, and smaller devices?
  • Is there sufficient hardware support? Is performance acceptable?
  • Is cloud deployment easy?

Once you deploy a database like Postgres on an operating system, changing operating system can require significant testing and disruption, so it is wise to choose the right operating system at the start. While I didn't mention the operating system's license before, the license can often influence the items listed above, like easy installation on smaller hardware and in the cloud, and multiple support providers. Debian, Ubuntu, FreeBSD, Fedora, Suse, Rocky Linux, and others are all going to have different strengths. I think the above bulleted list is a good start in evaluating those options. Fortunately, Postgres is supported by all popular operating systems, so Postgres should not be a limiting factor in choosing an operating system.

Post a Comment

Administration Many Upgrade Methods

Monday, April 5, 2021

Databases are often a critical part of enterprise infrastructure, so when and how to upgrade them is a common discussion topic among database administrators, and Postgres is no exception. The Postgres community has a web page about when to upgrade, and the Postgres documentation has a section about upgrading.

Minor upgrades with Postgres are simple. Though three primary major version upgrade methods are listed in the documentation, there are actually eight possible options:

  1. pg_dumpall/restore
  2. pg_dump/restore databases serially
  3. pg_dump/restore databases in parallel
  4. pg_dump/restore in parallel mode and databases serially
  5. pg_dump/restore in parallel mode and databases in parallel
  6. pg_upgrade in copy mode (with optional parallel mode)
  7. pg_upgrade in link mode (with optional parallel mode)
  8. Logical replication

Options 1–5 are based on pg_dump/pg_dumpall, options 6 & 7 use pg_upgrade, and #8 uses logical replication. The big question is which option to choose. Option one is the simplest and ideal for small clusters that can tolerate the downtime of serially dumping and reloading. Options 3–5 use the same method as 1 & 2 but allow parallelism in dump and restore. It can be tricky to figure out which databases to dump/reload in parallel, and how much parallelism to specify for each dump/restore operation — you basically have to figure out how to fully utilize the cpu and i/o resources without overwhelming them and making things slower. I wish there were simple rules people could follow when specifying parallelism but I have never seen them; it is probably necessary to do performance testing of various options.

Options 6 & 7 use pg_upgrade which can rapidly perform upgrades with minimal downtime. Option 6 allows restarting the old server even after the new server is started, but requires copying all of the user heap and index files, which can be slow. Option 7 can be very fast and its speed depends primarily on how quickly it can create empty user objects — terabyte upgrades can be performed in minutes. It also is the only option that doesn't require twice as much storage space for the upgrade. Options 6 & 7 do not remove bloat that might exist in the old cluster.

Option 8 uses logical replication, which requires significant configuration, but the upgrade can be very quick, even faster than pg_upgrade.

Hopefully this blog post has improved your understanding of available options for Postgres major version upgrades. Here are the criteria you need to consider when choosing an upgrade method:

  • Reliability/simplicity: 1 & 6
  • Minimal downtime: 7 & 8
  • Limited storage required: 7
  • Bloat removal: all but 6 & 7
Post a Comment