Bruce Momjian

Postgres Blog


This blog is about my work on the Postgres open source database, and is published on Planet PostgreSQL. PgLife allows monitoring of all Postgres community activity.

Online status:
Unread Postgres emails:
Email graphs: incoming, outgoing, unread, commits (details)



Performance Hash Aggregation

Wednesday, May 12, 2021

Hash aggregation has always confused me. I know what hash joins are, and even wrote about it, and I know what aggregates are. So, what is hash aggregation? Is it a hash join that uses an aggregate function?

Well, no. Hash aggregates allow accumulation/aggregation of duplicate values or removal of duplicate values. There are four ways to perform this aggregation, as listed in AggStrategy:

  1. Plain: simple agg across all input rows
  2. Sorted: grouped agg, input must be sorted
  3. Hashed: grouped agg, use internal hashtable
  4. Mixed: grouped agg, hash and sort both used

The first option does not retain or "pass upward in the executor plan" any individual values. The second option processes the rows in order, allowing easy accumulation or removal of duplicate values. The last two use hash aggregation, that is accumulation or removal of duplicates using hashing. Hashing aggregation is used to perform group by, distinct, and union (without all) processing.

Internally, how is hash aggregation performed? Prior to Postgres 13, hash aggregation could only create hash structures in memory, so was only used when the number of distinct hash values was expected to be small. In Postgres 13, hash aggregates can spill to disk — this allows hash aggregation to be used by many more queries.

 


PerformanceClustering a Table

Monday, May 10, 2021

Having written over 600 blog entries, I thought I would have already covered the complexities of the cluster command, but it seems I have not, so let's do that now.

Cluster is an unusual sql command because, like non-unique create index, it only affects performance. In fact, cluster requires the existence of an index. So, what does cluster do? Well, what does create index do? Let's look at how storage works in Postgres.

User data rows are stored in heap files in the file system, and those rows are stored in an indeterminate order. If the table is initially loaded in insert/copy order, later inserts, updates, and deletes will cause rows to be added in unpredictable order in the heap files. Create index creates a secondary file with entries pointing to heap rows, and index entries are ordered to match the values in the columns specified in the create index command. By quickly finding desired values in the index, index pointers can be followed to quickly find matching heap rows.

For most cases, creating indexes is sufficient to yield acceptable performance. However, there are a few cases where having the index ordered, but the heap unordered, causes major performance problems. This is where cluster becomes useful — it orders the heap to match the ordering of one of its indexes. (Some non-btree indexes can't be clustered because they lack linear ordering.)

How can this heap ordering improve performance? Well, if you are looking for only one row, it doesn't really matter where it is in the heap file — it is only going to take one heap access to retrieve it. However, suppose you are retrieving one-hundred rows matching an indexed column? Well, we can find the one-hundred matching index entries quickly, but what about the one-hundred heap rows? If they are scattered on one-hundred 8kB heap pages, that requires many i/o accesses. But what if the matching heap rows are on adjacent heap pages — that reduces the number of heap pages needed. If those heap pages are all in memory, it might not matter, but if some are on storage, reducing the number of heap accesses can yield significant performance benefits.

When does heap ordering help performance? I can think of three cases:

  • Accessing a single indexed value that has many duplicates, e.g., col = 5 where there many matching values
  • Accessing a range of values, e.g., col >= 10 AND col < 20
  • Accessing values that are frequently accessed by other sessions, e.g., unpaid invoice rows

For these workloads, ordering the heap can greatly reduce the number of heap accesses.

However, there are two big downsides to using cluster. First, while the cluster command is creating a new heap file ordered to match an index, no other sessions can read or write the table. Second, unlike index-organized tables (which Postgres does not support since they have serious downsides), the heap does not remain clustered — later insert and update operations will place rows in indeterminate order in the heap, causing the heap to become less ordered over time — a later cluster operation will be required to restore ideal ordering. (A non-default heap fill factor can improve update locality.) Fortunately, cluster does remember previous cluster operations and can automatically restore ideal clustering to all tables that have been previously clustered.

Let's see how explain makes use of heap rows being ordered. Actually, this is independent of the cluster command — Postgres maintains an indication of how ordered the heap is based on every column, and potentially expression indexes, not just columns involved in previous cluster operations. Cluster is really just a way of forcing heap ordering, but heap ordering might happen naturally based on the order of operations, and Postgres can take advantage of that.

In the example below, the rows are automatically ordered because of their insert order, and queries on pg_stats and pg_statistic verify the correlation is one:

-- use a second column so index-only scans are not used, and so the row has a typical length
CREATE TABLE public.cluster_test (x INTEGER, y TEXT);
 
CREATE INDEX i_cluster_test ON cluster_test (x);
 
INSERT INTO public.cluster_test
        SELECT *, repeat('x', 250) FROM generate_series(1, 100000);
 
-- compute correlation statistics
ANALYZE cluster_test;
 
-- uses view pg_stats
SELECT correlation
FROM pg_stats
WHERE schemaname = 'public' AND tablename = 'cluster_test' AND attname = 'x';
 correlation
-------------
           1
 
-- uses table pg_statistic, the SQL is from psql's '\d+ pg_stats'
SELECT CASE
        WHEN stakind1 = 3 THEN stanumbers1[1]
        WHEN stakind2 = 3 THEN stanumbers2[1]
        WHEN stakind3 = 3 THEN stanumbers3[1]
        WHEN stakind4 = 3 THEN stanumbers4[1]
        WHEN stakind5 = 3 THEN stanumbers5[1]
        ELSE NULL::real
        END AS correlation
FROM pg_namespace JOIN pg_class ON (pg_namespace.oid = relnamespace)
        JOIN pg_attribute ON (pg_class.oid = pg_attribute.attrelid)
        JOIN pg_statistic ON (pg_class.oid = starelid AND pg_attribute.attnum = staattnum)
WHERE nspname = 'public' AND relname = 'cluster_test' AND attname = 'x';
 correlation
-------------
           1
 
EXPLAIN SELECT * FROM cluster_test WHERE x < 74000;
                                         QUERY PLAN
--------------------------------------------------------------------------------------------
 Index Scan using i_cluster_test on cluster_test  (cost=0.29..4841.80 rows=73743 width=258)
   Index Cond: (x < 74000)
 
EXPLAIN SELECT * FROM cluster_test WHERE x < 75000;
                             QUERY PLAN
---------------------------------------------------------------------
 Seq Scan on cluster_test  (cost=0.00..4954.00 rows=74752 width=258)
   Filter: (x < 75000)

You can see the Postgres optimizer switches from index scan to sequential scan between 74k and 75k row accesses. This next example inserts the rows in random order, which yields a correlation near zero, and a lower value to stop using an index, i.e., 28k vs. 75k:

-- use a second column so index-only scans are not used
DELETE FROM public.cluster_test;
 
CREATE INDEX i_cluster_test ON cluster_test (x);
 
INSERT INTO public.cluster_test
        SELECT *, repeat('x', 250) FROM generate_series(1, 100000) ORDER BY random();
 
-- compute correlation statistics
ANALYZE cluster_test;
 
-- uses view pg_stats
SELECT correlation
FROM pg_stats
WHERE schemaname = 'public' AND tablename = 'cluster_test' AND attname = 'x';
  correlation
---------------
 -0.0048559047
 
EXPLAIN SELECT * FROM cluster_test WHERE x < 3;
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Index Scan using i_cluster_test on cluster_test  (cost=0.42..12.45 rows=2 width=258)
   Index Cond: (x < 3)
(2 rows)
 
EXPLAIN SELECT * FROM cluster_test WHERE x < 4;
                                 QUERY PLAN
-----------------------------------------------------------------------------
 Bitmap Heap Scan on cluster_test  (cost=4.44..16.30 rows=3 width=258)
   Recheck Cond: (x < 4)
   ->  Bitmap Index Scan on i_cluster_test  (cost=0.00..4.44 rows=3 width=0)
         Index Cond: (x < 4)
 
EXPLAIN SELECT * FROM cluster_test WHERE x < 27000;
                                    QUERY PLAN
-----------------------------------------------------------------------------------
 Bitmap Heap Scan on cluster_test  (cost=809.99..8556.02 rows=27042 width=258)
   Recheck Cond: (x < 27000)
   ->  Bitmap Index Scan on i_cluster_test  (cost=0.00..803.23 rows=27042 width=0)
         Index Cond: (x < 27000)
 
EXPLAIN SELECT * FROM cluster_test WHERE x < 28000;
                             QUERY PLAN
---------------------------------------------------------------------
 Seq Scan on cluster_test  (cost=0.00..8658.00 rows=28058 width=258)
   Filter: (x < 28000)

Notice that it switches from an index scan to a bitmap heap scan after three rows because statistics indicate that the matching rows are stored randomly in the heap. When using an index whose heap ordering closely matches the index ordering, there is no value in using a bitmap heap scan compared to an index scan.

Using cluster, we can force the heap to match the index ordering, and again cause the index to be used for a larger number of rows:

CLUSTER cluster_test USING i_cluster_test;
 
ANALYZE cluster_test;
 
EXPLAIN SELECT * FROM cluster_test WHERE x < 74000;
                                         QUERY PLAN
--------------------------------------------------------------------------------------------
 Index Scan using i_cluster_test on cluster_test  (cost=0.29..4836.03 rows=73642 width=258)
   Index Cond: (x < 74000)
 
EXPLAIN SELECT * FROM cluster_test WHERE x < 75000;
                             QUERY PLAN
---------------------------------------------------------------------
 Seq Scan on cluster_test  (cost=0.00..4954.00 rows=74696 width=258)
   Filter: (x < 75000)

Of course, explain only estimates the most efficient way to retrieve data — proper performance testing is required to determine when cluster is helpful.

Time-series data can be difficult to use with cluster. The most recent data is usually the most frequently accessed. If the table has few updates and deletes, new rows are typically appended to the end of the file, giving good correlated ordering which can be detected and exploited by Postgres. However, if there are many updates/deletes, inserted and updated rows could be placed in unused space anywhere in the table, so correlation would be low. In fact, if you previously clustered the table, and you are only accessing recent data, you might get a unrepresentatively-high correlation value and inefficient plans because, while most of the table rows are clustered, the new rows, which are the once most frequently accessed, are not ordered to match an index. Table partitioning can be thought of as a crude clustering, which can help with these workloads by improving data locality using time-based partitions. Anyway, hopefully this blog post has given you some tips on when clustering can be useful.

Post a Comment

Performance Postgres on Big Iron

Friday, May 7, 2021

Postgres has always run well on commodity hardware, and works well on Raspberry Pi and consumer devices too. However, we have always lagged in optimizing Postgres on very larger hardware, mostly because our community has limited access to such hardware. In recent years, Postgres support companies have given the community access to large hardware, and run performance test themselves, which has allowed the community to enhance Postgres for such hardware. However, since typical workloads do not use large hardware, there is always a lag in how quickly Postgres can adjust to ever-larger hardware platforms.

This email report from July of 2020 is about a server actively using 5k tables. When there are only a few hundred connections, basically 2–3x the number of cpus, the system works fine, but the system experiences a steep performance decline as thousands of active connections try to issue queries. Whether the thousands of connections happen because of increased user demand or increased slowness due to the number of active connections is unclear. The email thread suggests various lock table and autovacuum improvements that might improve the situation, but there is no clear conclusion. One bright spot is that two weeks after this report, another report complained about the performance of autovacuum in a database with many tables, and supplied a patch which was applied in December and will be in Postgres 14.

Ideally we would have been able to test this patch against the workload reported in the July email to see if it fixes the problem. However, it is difficult to get the patch into production use, or get a test server of suitable size and produce a workload that matches the report. This is why it is much slower to improve the performance of large systems compared to simple workloads on commodity hardware — we make progress, but it is slower.

Post a Comment

Object bundles Set of Record

Wednesday, May 5, 2021

I previously covered the use of jsonb_to_record(), which uses record types to cast multiple json values. Record types are effectively special sql rows.

There is an even more complicated sql type called set of record. It is a set of records, or set of rows. While the record type allows you to create a row with multiple columns, set of record allows you to create multiple rows, each having multiple columns. If you do \dfS in psql, you will see many functions defined as accepting or returning record values, and many returning set of record values, particularly database system information functions, e.g., pg_get_keywords().

As an example, the unnest function converts an array to a list of rows, effectively a set of records:

SELECT unnest(ARRAY[1,2]);
 unnest
--------
      1
      2
 
SELECT * FROM unnest(ARRAY[1,2]);
 unnest
--------
      1
      2
 
test=> SELECT * FROM unnest(ARRAY[1,2], ARRAY[3, 4]) AS x (a, b);
 a | b
---+---
 1 | 3
 2 | 4

While the single-column unnest can be done in the target list, without a from, multi-column rows/records do require a from clause. One other popular set of record-returning function is generate_series.

Post a Comment

NoSQL Jsonb Multi-Column Type Casting

Monday, May 3, 2021

I already covered jsonb type casting of single values. However, if you are extracting multiple values from jsonb that casting method can be cumbersome, and it also has limitations. Fortunately, there is another way — a way to retrieve multiple jsonb or json values and cast everything in one location.

Here is a simple table taken from my previous blog post:

CREATE TABLE test(x JSONB);
 
INSERT INTO test VALUES ('{"a": "xyz", "b": 5}');

Previously, to cast the first column as an sql text data type and the second column as integer, we would have to do:

SELECT (x->'a')::TEXT, (x->'b')::INTEGER FROM test;
 text  | int4
-------+------
 "xyz" |    5

though this can be done more simply using ->>, which also removes the double-quotes:

SELECT x->>'a', (x->'b')::INTEGER FROM test;
 ?column? | int4
----------+------
 xyz      |    5

Using jsonb_to_record(), it can be simplified to:

SELECT a, b, pg_typeof(a) AS a_type, pg_typeof(b) AS b_type
FROM test, jsonb_to_record(test.x) AS x (a TEXT, b INTEGER);
  a  | b | a_type | b_type
-----+---+--------+---------
 xyz | 5 | text   | integer

As you can see, the casting of both columns is done in one place. Of course, this casting method becomes more useful the more columns you have to cast.

Jsonb and json columns allow the creation of data rows inside of columns, unconstrained by the requirement of consistent column names or data types across rows. Effectively, the query above creates query-scoped relation x to extract columns from the "row inside a column" jsonb column x. Query-scoped relation columns x.a and x.b are the values of the two json keys (a, b) we wish to extract. The casts are happening as the values are extracted from the jsonb column, meaning that any reference to the query-scoped columns, e.g., in a where clause, will use the casted data types. For example, you can see in the queries below that b is treated as an integer:

SELECT x.a, b, pg_typeof(a) AS a_type, pg_typeof(b) AS b_type
FROM test, jsonb_to_record(test.x) AS x (a TEXT, b INTEGER)
WHERE b <= 4;
 a | b | a_type | b_type
---+---+--------+--------
 
SELECT x.a, b, pg_typeof(a) AS a_type, pg_typeof(b) AS b_type
FROM test, jsonb_to_record(test.x) AS x (a TEXT, b INTEGER)
WHERE b > 4;
  a  | b | a_type | b_type
-----+---+--------+---------
 xyz | 5 | text   | integer

While creating query-scoped relations might be overkill for simple casts that can be done in the target list, more complex cases easily benefit from query-scoped relations. Query-scoped relations are also useful for multi-key json access, because json stores the keys/values as text and the function parses it only once, even if multiple keys are requested. These email threads (1, 2) cover more of the details.

Post a Comment

Internals Separating Cluster and Database Attributes

Friday, April 30, 2021

I have already covered the use of clusters, databases, and schemas, but a recent email thread highlighted the complexity of restoring database attributes and the objects inside of databases. In summary, create database copies the contents of databases, but not their attributes, except for database attributes that are tied to the database contents, like collation and encoding. The thread ends with application of a documentation patch that mentions that database-level permissions are also not copied. Thanks to Tom Lane for clarifying this issue.

Post a Comment

Community Release Size Analysis, The 2021 Edition

Wednesday, April 28, 2021

In 2011, I wrote a query to analyze the sizes of all the major Postgres releases, so I thought it was time to updated it with the results below:

 version  |  reldate   | months | relnotes |  lines  | change | % change
----------+------------+--------+----------+---------+--------+----------
 4.2      | 1994-03-17 |        |          |  250872 |        |
 1.0      | 1995-09-05 |     18 |          |  172470 | -78402 |      -31
 1.01     | 1996-02-23 |      6 |          |  179463 |   6993 |        4
 1.09     | 1996-11-04 |      8 |          |  178976 |   -487 |        0
 6.0      | 1997-01-29 |      3 |          |  189399 |  10423 |        5
 6.1      | 1997-06-08 |      4 |          |  200709 |  11310 |        5
 6.2      | 1997-10-02 |      4 |          |  225848 |  25139 |       12
 6.3      | 1998-03-01 |      5 |          |  260809 |  34961 |       15
 6.4      | 1998-10-30 |      8 |          |  297918 |  37109 |       14
 6.5      | 1999-06-09 |      7 |          |  331278 |  33360 |       11
 7.0      | 2000-05-08 |     11 |          |  383270 |  51992 |       15
 7.1      | 2001-04-13 |     11 |          |  410500 |  27230 |        7
 7.2      | 2002-02-04 |     10 |      250 |  394274 | -16226 |       -3
 7.3      | 2002-11-27 |     10 |      305 |  453282 |  59008 |       14
 7.4      | 2003-11-17 |     12 |      263 |  508523 |  55241 |       12
 8.0      | 2005-01-19 |     14 |      230 |  654437 | 145914 |       28
 8.1      | 2005-11-08 |     10 |      174 |  630422 | -24015 |       -3
 8.2      | 2006-12-05 |     13 |      215 |  684646 |  54224 |        8
 8.3      | 2008-02-04 |     14 |      214 |  762697 |  78051 |       11
 8.4      | 2009-07-01 |     17 |      314 |  939098 | 176401 |       23
 9.0      | 2010-09-20 |     15 |      237 |  999862 |  60764 |        6
 9.1      | 2011-09-12 |     12 |      203 | 1069547 |  69685 |        6
 9.2      | 2012-09-10 |     12 |      238 | 1148192 |  78645 |        7
 9.3      | 2013-09-09 |     12 |      177 | 1195627 |  47435 |        4
 9.4      | 2014-12-18 |     15 |      211 | 1261024 |  65397 |        5
 9.5      | 2016-01-07 |     13 |      193 | 1340005 |  78981 |        6
 9.6      | 2016-09-29 |      9 |      214 | 1380458 |  40453 |        3
 10       | 2017-10-05 |     12 |      189 | 1495196 | 114738 |        8
 11       | 2018-10-18 |     12 |      170 | 1562537 |  67341 |        4
 12       | 2019-10-03 |     11 |      180 | 1616912 |  54375 |        3
 13       | 2020-09-24 |     12 |      178 | 1656030 |  39118 |        2
 Averages |            |     11 |      219 |         |        |     6.70

While the average change in the the number of lines of code per release over the life of the project is 6.70%, the average for the last ten years is 4.67%. This is probably just symptomatic of a maturing code base and software. The number of features also shows a steady decline, but in many ways newer features are more complex to implement. The pace of releases has remained constant.

Post a Comment

Community Roads and Bridges: The Unseen Labor Behind Our Digital Infrastructure

Monday, April 26, 2021

Having been involved in open source for the past 30 years, I have always felt like an outsider in social settings, having to explain the purpose of open source and its communities. (I remember having to explain what email, Usenet, and the Internet was too, but that need has passed. )

It is not often that I get to learn something about open source from someone else, but this 143-page report by the Ford Foundation gives a very detailed review of the landscape and health of the open source ecosystem. It also highlights various challenges and how they might be addressed, many of which relate directly to Postgres and its continued health.

View or Post Comments

Business Database Software Bundles

Friday, April 23, 2021

When you buy a proprietary database from one of those billion-dollar companies, you get more than a database — you get backup, monitoring, high availability, and maybe scaling features. These all come from the same company, though some might require extra cost.

With Postgres, the story is more complicated. While the Postgres project produces an excellent, full-featured database, there are areas in backup, monitoring, and high availability where the project tooling doesn't meet everyone's needs, and users have to look outside the the project for solutions. Sometimes other open source projects fill the gap, but in other cases Postgres companies do — Cybertec, edb, HighGo, Ongres, Postgres Pro, sra oss, and others often provide the last mile of service and solutions to meet enterprise needs.

Ideally the Postgres project would provide for all the software needed to run Postgres efficiently, but is not clear there is a "one size fits all" solution for many of these needs, so having multiple groups is ideal because they can efficiently focus and adjust to those needs.

View or Post Comments

Business Oracle vs. PostgreSQL

Wednesday, April 21, 2021

Following my blog entry about assumptions, this long email thread gives an interesting comparison of Oracle and Postgres. The original poster compares install times, initial database sizes, and relative bug counts. From there, things get complicated. There are various negative comments about the complexity of administering Oracle, and the simplicity of Postgres, but the discussion takes a turn when someone says comparing Postgres to Oracle is like comparing a rubber duck and 300,000 ton super tanker. He also states, "Oracle is also the single most feature-rich database out there — the feature set of Postgres isn't even 1% of what Oracle has."

It is actually good someone said that since it brought the discussion to a higher level where assumptions could be addressed. Someone easily disputed the 1% mention, and then got into a finer-grained comparison:

I'd say it's more in the vicinity of 80% or 90%, depending on which features you find more important, would be more realistic.

But then Postgres has features that Oracle has not, like …

From a dba point of view, the percentage is probably lower than 80%, from a developer's point of view, Oracle lacks a lot of things and the percentage would be greater than 100%.

I thought that was interesting — to compare the dba and developer feature sets separately, and I think that is accurate. Also, when someone mentions over 100%, it makes it clear that there are useful features Postgres has that Oracle lacks. The email thread ends by discussing the lack of simple cross-database queries in Postgres. My previous blog post explained the use of multiple clusters, databases, and schemas; the mismatch between how Postgres and other databases handle these can be confusing.

View or Post Comments

Business Challenging Assumptions

Monday, April 19, 2021

We all operate on assumptions, e.g., we expect to be alive tomorrow, we expect our family to always support us, we assume our car will start. Everyone needs to make some assumptions to function. But assumptions aren't always true, e.g., some day the sun will rise, and you will not be alive to see it. So, where does that leave us? We have to make assumptions to survive, but what we assume today might be false tomorrow. Fundamentally, successful people, while they rely on assumptions, are always alert for signs that their assumptions are false.

There are a number of assumptions about software that drive user behavior. While some of these assumptions might have been true in the past, they are now of questionable validity:

  • Software that requires payment is always better than free software
  • Open source code is less secure because it makes security flaws visible
  • Open source isn't developed by serious people
  • Oracle is the best database
  • I will always be employed for my Oracle skills

People who actively question assumptions and regularly test their validity are more likely to take advantage of new opportunities, and I think the list above validates that. I often speak to current and prospective edb customers, and they ask many questions — sometimes I think they are going to ask questions forever. However, often the source of the questions are based on incorrect or no longer valid assumptions they have about the software world. I find that explicitly stating the assumptions, and explaining why they might no longer be valid, to be an effective way to get beyond specific questions. It helps them realize that Postgres is a great opportunity for their organizations, but only if they are willing to see the world as it is now, and not as it was.

View or Post Comments

Administration Rerouting Server Log Entries

Friday, April 16, 2021

Postgres has a where, when, and what set of options to control database server log messages. One frequent request is to allow log message to be split into multiple files or sent to multiple servers. Postgres internally doesn't support this capability, but rsyslog, enabled with log_destination=syslog, allows complex filtering of database server log messages. Specifically, rsyslog's complex filters, combined with log_line_prefix, allows almost unlimited control over how log messages are recorded.

View or Post Comments

Administration The Power of Synchronous_commit

Wednesday, April 14, 2021

Postgres has a lot of settings — 329 in Postgres 13. Most of these settings control one specific thing, and the defaults are fine for most use-cases. Synchronous_commit is one of the settings that isn't simple and controls perhaps too many things, though there are operational restrictions which prevent it from being split into separate settings.

In fact, this setting is so complex that when I started to research what it controls, I realized our existing documentation was not completely clear on all of its capabilities. Therefore, I rewrote this setting's documentation in October of 2020, and the updated documentation was released in our November 12th minor releases. As you can see from the table added to the docs, synchronous_commit controls not only the local durability of commits, but also standby behavior.

First, let's focus on its local behavior, that is, without considering standby servers. The value off causes synchronous_commit to not wait for the records associated with the commit to be written to the write-ahead log. It does, however, guarantee data consistency, meaning, that in the event of a crash, the transactions it does recover will be consistent with each other. It just doesn't guarantee that all committed transactions will be restored in the event of a crash. This is in contrast to the setting fsync=off, which does not guarantee data consistency, meaning, after an operating system crash, some transactions might be partial or inconsistent with other records.

The local option waits for the commit to be durably recorded, meaning any crash recovery will restore all committed database changes. All the other options add waiting for standby servers, while synchronous_standby_names controls which or how many standby servers to wait for. The synchronous_commit option remote_write waits for one or more standby servers to have their transaction information durably stored in case of a Postgres crash. Option on (the default) is similar to remote_write but is durable against standby operating system crashes, not just Postgres crashes. Option remote_apply waits for the transaction to be replayed and visible to future read-only transactions on the standby server. (I covered the details of when commits are visible in a previous blog entry.)

Given the scope of synchronous_commit, you might assume changing it requires as server restart. On the contrary, it can be set to different values in different sessions on the same server, and can be changed during sessions. Yeah, kind of crazy, but it works, and it allows you to control transaction durability and visibility depending on the importance of the transaction. As Postgres covers in its documentation, while Postgres is a fully durable relational database, it does allow fine grained control over the durability and even standby visibility of transactions.

View or Post Comments

Administration Shared Memory Sizing

Monday, April 12, 2021

Postgres makes extensive use of operating system shared memory, and I have already written a presentation about it. The Postgres documentation gives specific instructions on how to determine the amount of shared memory allocated, specifically for sizing huge pages.

However, exactly what is inside Postgres shared memory was somewhat of a mystery to end users until Postgres 13 added pg_shmem_allocations, which gives a detailed list of the uses and sizes of allocated shared memory:

SELECT *, pg_size_pretty(allocated_size) FROM  pg_shmem_allocations ORDER BY size DESC;
 
                name                 |    off     |    size    | allocated_size | pg_size_pretty 
-------------------------------------+------------+------------+----------------+----------------
 Buffer Blocks                       |   69908224 | 6442450944 |     6442450944 | 6144 MB
 Buffer Descriptors                  |   19576576 |   50331648 |       50331648 | 48 MB
 <anonymous>                         |     (null) |   43655168 |       43655168 | 42 MB
 Buffer IO Locks                     | 6512359168 |   25165824 |       25165824 | 24 MB
 Checkpointer Data                   | 6597411712 |   18874432 |       18874496 | 18 MB
 XLOG Ctl                            |      53888 |   16803472 |       16803584 | 16 MB
 Checkpoint BufferIds                | 6537524992 |   15728640 |       15728640 | 15 MB
 Xact                                |   16857856 |    2116320 |        2116352 | 2067 kB
 (null)                              | 6616370560 |    1921664 |        1921664 | 1877 kB
 Subtrans                            |   19107968 |     267008 |         267008 | 261 kB
 CommitTs                            |   18974208 |     133568 |         133632 | 131 kB
 MultiXactMember                     |   19441792 |     133568 |         133632 | 131 kB
 Serial                              | 6596798464 |     133568 |         133632 | 131 kB
 Backend Activity Buffer             | 6597153664 |     132096 |         132096 | 129 kB
 shmInvalBuffer                      | 6597332864 |      69464 |          69504 | 68 kB
 Notify                              | 6616303744 |      66816 |          66816 | 65 kB
 MultiXactOffset                     |   19374976 |      66816 |          66816 | 65 kB
 Backend Status Array                | 6597082240 |      54696 |          54784 | 54 kB
 Backend SSL Status Buffer           | 6597285760 |      42312 |          42368 | 41 kB
 Shared Buffer Lookup Table          | 6553253632 |      33624 |          33664 | 33 kB
 KnownAssignedXids                   | 6597042560 |      31720 |          31744 | 31 kB
 ProcSignal                          | 6597403392 |       8264 |           8320 | 8320 bytes
 Backend Client Host Name Buffer     | 6597145344 |       8256 |           8320 | 8320 bytes
 Backend Application Name Buffer     | 6597137024 |       8256 |           8320 | 8320 bytes
 KnownAssignedXidsValid              | 6597074304 |       7930 |           7936 | 7936 bytes
 AutoVacuum Data                     | 6616286208 |       5368 |           5376 | 5376 bytes
 Background Worker Data              | 6597328256 |       4496 |           4608 | 4608 bytes
 Fast Path Strong Relation Lock Data | 6594531840 |       4100 |           4224 | 4224 bytes
 PROCLOCK hash                       | 6593836672 |       2904 |           2944 | 2944 bytes
 PREDICATELOCK hash                  | 6594981376 |       2904 |           2944 | 2944 bytes
 SERIALIZABLEXID hash                | 6596446976 |       2904 |           2944 | 2944 bytes
 PREDICATELOCKTARGET hash            | 6594536064 |       2904 |           2944 | 2944 bytes
 LOCK hash                           | 6593141504 |       2904 |           2944 | 2944 bytes
 Async Queue Control                 | 6616301184 |       2492 |           2560 | 2560 bytes
 ReplicationSlot Ctl                 | 6616291584 |       2480 |           2560 | 2560 bytes
 Wal Receiver Ctl                    | 6616295936 |       2248 |           2304 | 2304 bytes
 BTree Vacuum State                  | 6616298880 |       1476 |           1536 | 1536 bytes
 Wal Sender Ctl                      | 6616294784 |       1040 |           1152 | 1152 bytes
 Shared MultiXact State              |   19575424 |       1028 |           1152 | 1152 bytes
 PMSignalState                       | 6597402368 |       1020 |           1024 | 1024 bytes
 Sync Scan Locations List            | 6616300416 |        656 |            768 | 768 bytes
 ReplicationOriginState              | 6616294144 |        568 |            640 | 640 bytes
 Proc Array                          | 6597041920 |        528 |            640 | 640 bytes
 Logical Replication Launcher Data   | 6616298240 |        424 |            512 | 512 bytes
 Control File                        |   16857472 |        296 |            384 | 384 bytes
 Proc Header                         | 6596932224 |        104 |            128 | 128 bytes
 PredXactList                        | 6596241792 |         88 |            128 | 128 bytes
 OldSnapshotControlData              | 6616298752 |         68 |            128 | 128 bytes
 CommitTs shared                     |   19107840 |         32 |            128 | 128 bytes
 Buffer Strategy Status              | 6593141376 |         28 |            128 | 128 bytes
 RWConflictPool                      | 6596505344 |         24 |            128 | 128 bytes
 FinishedSerializableTransactions    | 6596798336 |         16 |            128 | 128 bytes
 Prepared Transaction Table          | 6597328128 |         16 |            128 | 128 bytes
 SerialControlData                   | 6596932096 |         12 |            128 | 128 bytes

The null name is unused shared memory. Summing the allocated_size column yields a number similar to the output of pmap, which is described in the documentation above. On my system, pmap is ~1% higher than allocated_size, probably because shared library memory is also counted by pmap.

View or Post Comments

Administration Replica Scaling by the Numbers

Friday, April 9, 2021

I have previously blogged about scaling of read-only and read/write workloads. Inspired by this email thread, I want to be more specific about exactly what scaling you can expect by using read-only replicas. First, as the email poster pointed out, read replicas still have to replay write transactions from the primary server. While the replica servers don't need to execute the write queries when using binary replication, they still need to perform the heap/index writes that the primary performs. (Logical replication does add significant cpu overhead.)

Let's look at rough numbers of the load on binary-replication replicas. Suppose the write load on the primary is 30% — that leaves 70% of the i/o for read-only queries, and even more than 70% of the cpu capacity since binary replication replay uses much less cpu than running the actual write queries. If there are three replicas, that means you have 210% of the primary's read-query capacity available on the replicas.

However, if you have a 75% write load on the primary, you only have an extra 25% capacity on the replicas, so your scaling ability is much more limited — it would take four replicas just to double the read capacity of the primary. This explains why high write workloads only effectively scale read-only processing using sharding.

View or Post Comments

Administration Operating System Choice

Wednesday, April 7, 2021

In the 1990's, most server operating systems were closed-source and produced by a few billion-dollar companies. In recent years, Red Hat Enterprise Linux (rhel) became the default enterprise server operating system, with centos filling the need for installations on smaller platforms and in the cloud. (Centos was sometimes the only operating system used.) With the move of centos away from being a stable releases tracking rhel, there is renewed interest in other operating systems.

There are many viable operating system choices, and with so many choices, selection can be difficult. When choosing an operating system to host a database like Postgres, there are some important criteria:

  • Does the operating system have a server focus, or desktop/laptop/mobile focus?
  • How easy is it to administer, and does your staff have the appropriate skills?
  • What support options are available? What is the operating system's support life span?
  • Are upgrades easily handled? How often do operating system upgrades require rebooting?
  • Is the operating system easily installed on developer workstations, laptops, and smaller devices?
  • Is there sufficient hardware support? Is performance acceptable?
  • Is cloud deployment easy?

Once you deploy a database like Postgres on an operating system, changing operating system can require significant testing and disruption, so it is wise to choose the right operating system at the start. While I didn't mention the operating system's license before, the license can often influence the items listed above, like easy installation on smaller hardware and in the cloud, and multiple support providers. Debian, Ubuntu, FreeBSD, Fedora, Suse, Rocky Linux, and others are all going to have different strengths. I think the above bulleted list is a good start in evaluating those options. Fortunately, Postgres is supported by all popular operating systems, so Postgres should not be a limiting factor in choosing an operating system.

View or Post Comments

Administration Many Upgrade Methods

Monday, April 5, 2021

Databases are often a critical part of enterprise infrastructure, so when and how to upgrade them is a common discussion topic among database administrators, and Postgres is no exception. The Postgres community has a web page about when to upgrade, and the Postgres documentation has a section about upgrading.

Minor upgrades with Postgres are simple. Though three primary major version upgrade methods are listed in the documentation, there are actually eight possible options:

  1. pg_dumpall/restore
  2. pg_dump/restore databases serially
  3. pg_dump/restore databases in parallel
  4. pg_dump/restore in parallel mode and databases serially
  5. pg_dump/restore in parallel mode and databases in parallel
  6. pg_upgrade in copy mode (with optional parallel mode)
  7. pg_upgrade in link mode (with optional parallel mode)
  8. Logical replication

Options 1–5 are based on pg_dump/pg_dumpall, options 6 & 7 use pg_upgrade, and #8 uses logical replication. The big question is which option to choose. Option one is the simplest and ideal for small clusters that can tolerate the downtime of serially dumping and reloading. Options 3–5 use the same method as 1 & 2 but allow parallelism in dump and restore. It can be tricky to figure out which databases to dump/reload in parallel, and how much parallelism to specify for each dump/restore operation — you basically have to figure out how to fully utilize the cpu and i/o resources without overwhelming them and making things slower. I wish there were simple rules people could follow when specifying parallelism but I have never seen them; it is probably necessary to do performance testing of various options.

Options 6 & 7 use pg_upgrade which can rapidly perform upgrades with minimal downtime. Option 6 allows restarting the old server even after the new server is started, but requires copying all of the user heap and index files, which can be slow. Option 7 can be very fast and its speed depends primarily on how quickly it can create empty user objects — terabyte upgrades can be performed in minutes. It also is the only option that doesn't require twice as much storage space for the upgrade. Options 6 & 7 do not remove bloat that might exist in the old cluster.

Option 8 uses logical replication, which requires significant configuration, but the upgrade can be very quick, even faster than pg_upgrade.

Hopefully this blog post has improved your understanding of available options for Postgres major version upgrades. Here are the criteria you need to consider when choosing an upgrade method:

  • Reliability/simplicity: 1 & 6
  • Minimal downtime: 7 & 8
  • Limited storage required: 7
  • Bloat removal: all but 6 & 7
View or Post Comments