Bruce Momjian

Postgres Blog

This blog is about my work on the Postgres open source database, and is published on Planet PostgreSQL. PgLife allows monitoring of all Postgres community activity.

Online status:
Unread Postgres emails:
Email graphs: incoming, outgoing, unread, commits (details)

Thoughts Two Interviews

Friday, July 10, 2020

I have done two interviews in the past month. The first one was done by the Linux Inlaws and was published on Hacker Public Radio. This interview discusses the history and open source aspects of the Postgres project, and its health and future direction.

The second interview is more personal, discussing how I got involved in computers and my early experiences with Postgres. It also discusses technology disruption, and its challenge to time management. The final quarter covers religious topics.


News Postgres Marketing

Wednesday, July 8, 2020

Postgres is mostly a technology-driven community, so marketing often suffers. However, one great thing about the community is that it is distributed, so anyone can get involved and help. Here are some examples of very successful community-driven marketing ideas:

Here are front and rear images of these objects. Of course, I have also accumulated many Postgres pins over the years. I am inspired by these marketing efforts and hope they continue.

Post a Comment

Indexing Boolean Indexes

Friday, July 3, 2020

For btree and hash indexes to be used for lookups, values being requested must be very restrictive, roughly 3-5% of a table's rows. Let's consider a boolean column — it can contain only three values: true, false, and null. By definition, at least one of those three values will be in more than 5% of the table's rows, so why index them? A better solution, particularly for boolean fields, is to create a partial index so only rare values are indexed. Partial indexes can also be used to index non-boolean fields when there are a large number of common values that aren't worth indexing.

Post a Comment

Indexing Global Indexes

Wednesday, July 1, 2020

Postgres indexes can only be defined on single tables. Why would you want to have indexes that reference multiple tables, i.e., global indexes?

This email covers some of the reasons why global indexes can be useful. One big use-case is the ability to create indexes on partitioned tables that index all its child tables, rather than requiring a separate index on each child table. This would allow references to partitioned tables as foreign keys without requiring the partition key to be part of the foreign key reference; Postgres 12 allows such foreign keys if they match partition keys.

A second use-case for global indexes is the ability to add a uniqueness constraint to a partitioned table where the unique columns are not part of the partition key. A third use-case is the ability to index values that only appear in a few partitions, and are not part of the partition key. A global index would avoid having to check each partition table's index for the desired value.

It is still unclear if these use-cases justify the architectural changes needed to enable global indexes. Some of these features can be simulated using triggers and user lookup tables. A large global index might also reintroduce problems that prompted the creation of partitioning in the first place.

Post a Comment

Hardware Hardware Acceleration for Databases

Monday, June 29, 2020

There is a long history of hardware acceleration, i.e., hardware modules helping the cpu. There was the 80287 math coprocessor, sound cards, and video cards. The computer industry is constantly moving things from the cpu to the motherboard and external cards, and back again. Movement is mostly determined by whether the cpu is able to efficiently perform the task, the transfer bandwidth needed to perform the task, and the flexibility of replaceable external cards.

This year, the big questions for database software is if and how to make use of graphics processing unit (gpu) and field-programmable gate arrays (fpga). This article does a good job of explaining the history of hardware acceleration, and mentions Netezza's (no longer used) use of fpgas as hardware acceleration for databases.

The same historic hardware acceleration questions apply to database acceleration today:

  • Are they better suited than cpus to do some database processing tasks, and how common are those tasks?
  • Is there sufficient transfer bandwidth to gpus and fpgas to justify their use?
  • Is hardware acceleration worth the deployment complexity?

PgOpenCL, PG-Strom, and HeteroDB are projects that are experimenting with the value of gpus and fpgas in Postgres. As cloud providers increase the availability of gpus and fpgas, we might start see their usage increase.

Post a Comment

Data modeling Can Case Comparison Be Controlled?

Friday, June 26, 2020

Computer tasks are one of the most precise activities we do on a daily basis. Driving, cooking, walking, and reading are fairly imprecise compared to computer interaction.

Computers represent symbols like "a" and "A" precisely and require external facilities to define relationships between them. This email thread makes a convincing argument that you usually want case-preserving, but less-precise case-insensitive behavior.

Let's go over some Postgres case-precision behaviors like the handling of character strings, identifiers, and keywords. For example, these queries do the same thing:

Select Count(*) From Pg_Class;

This is because Postgres, and the sql standard, ignore the case of keywords, e.g., select,. They also ignore the case of identifiers, e.g., pg_class, when not double-quoted. Double-quoting adds case precision to identifiers:

SELECT "count"() FROM "pg_class";
SELECT "COUNT"() FROM "pg_class";
ERROR:  function COUNT() does not exist
LINE 1: SELECT "COUNT"() FROM "pg_class";
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
SELECT "Count"() FROM "Pg_Class";
ERROR:  relation "Pg_Class" does not exist
LINE 1: SELECT "Count"() FROM "Pg_Class";
ERROR:  relation "PG_CLASS" does not exist

There is no ability to add case precision to keywords:

"SELECT" COUNT() FROM pg_class;
ERROR:  syntax error at or near ""SELECT""
LINE 1: "SELECT" COUNT() FROM pg_class;

When comparing values, Postgres is precise by default:

SELECT 'a' = 'A';
SELECT '-' = '_';
SELECT '.' = ',';

For certain symbols, is it sometimes visually hard to see the difference.

As shown above, double-quotes adds precision to identifiers. For value comparisons, you have to be explicit to remove precision:

SELECT upper('a') = upper('A');
SELECT lower('a') = lower('A');

Most people aren't comparing constants in sql but compare column values:

SELECT oid FROM pg_class WHERE relname = 'pg_class';
SELECT oid FROM PG_CLASS WHERE relname = 'pg_class';
SELECT oid FROM pg_class WHERE relname = 'Pg_Class';
SELECT oid FROM pg_class WHERE relname = 'PG_CLASS';

Notice that these queries use pg_class as an identifier (without single quotes) and as a value (with single quotes). The identifier usage is case insensitive; the value usage is case sensitive. You can explicitly reduce comparison precision using function calls:

SELECT oid FROM pg_class WHERE lower(relname) = lower('pg_class');
SELECT oid FROM pg_class WHERE lower(relname) = lower('Pg_Class');
SELECT oid FROM pg_class WHERE lower(relname) = lower('PG_CLASS'^);

These convert the column values and constants to lower case before comparison. (Upper case could also have been used.) Indexes are also case sensitive by default, and it obviously would be inefficient to lower-case every index entry for comparison, so function calls on columns cannot use an ordinary column index:

EXPLAIN SELECT oid FROM pg_class WHERE relname = 'pg_class';
                                        QUERY PLAN
 Index Scan using pg_class_relname_nsp_index on pg_class  (cost=0.27..8.29 rows=1 width=4)
   Index Cond: (relname = 'pg_class'::name)
EXPLAIN SELECT oid FROM pg_class WHERE lower(relname) = lower('pg_class');
                       QUERY PLAN
 Seq Scan on pg_class  (cost=0.00..19.76 rows=2 width=4)
   Filter: (lower((relname)::text) = 'pg_class'::text)

They can use expression indexes that are created to match function calls:

-- create a user table because users can't create indexes on system tables
CREATE TABLE my_pg_class AS SELECT * FROM pg_class;
-- create non-expression index
CREATE INDEX i_my_pg_class_relname ON my_pg_class (relname);
-- create expression index
CREATE INDEX i_my_pg_class_relname_lower ON my_pg_class ((lower(relname)));
-- The optimizer needs statistics
-- Doing the analyze after the expression index creation allows creation of statistics on the expression.
-- see
-- Autovacuum would have eventually done this automatically.
ANALYZE my_pg_class;
-- use non-expression index
EXPLAIN SELECT oid FROM my_pg_class WHERE relname = 'pg_class';
                                       QUERY PLAN
 Index Scan using i_my_pg_class_relname on my_pg_class  (cost=0.27..8.29 rows=1 width=4)
   Index Cond: (relname = 'pg_class'::name)
-- use expression index
EXPLAIN SELECT oid FROM my_pg_class WHERE lower(relname) = lower('pg_class');
                                          QUERY PLAN
 Index Scan using i_my_pg_class_relname_lower on my_pg_class  (cost=0.27..8.29 rows=1 width=4)
   Index Cond: (lower((relname)::text) = 'pg_class'::text)

The citext extension allows the creation of columns whose values are automatically compared in a case-insensitive manner:

-- 'x' column added so the row has a typical length
CREATE TABLE my_pg_class2 AS SELECT oid, relname::citext, repeat('x', 256) FROM pg_class;
CREATE INDEX i_my_pg_class_relname2 ON my_pg_class2 (relname);
ANALYZE my_pg_class2;
\d my_pg_class2
            Table "public.my_pg_class2"
 Column  |  Type  | Collation | Nullable | Default
 oid     | oid    |           |          |
 relname | citext | C         |          |
 repeat  | text   |           |          |
    "i_my_pg_class_relname2" btree (relname)
SELECT oid FROM my_pg_class2 WHERE relname = 'pg_class';
SELECT oid FROM my_pg_class2 WHERE relname = 'PG_CLASS';
EXPLAIN SELECT oid FROM my_pg_class2 WHERE relname = 'pg_class';
                                        QUERY PLAN
 Index Scan using i_my_pg_class_relname2 on my_pg_class2  (cost=0.27..8.29 rows=1 width=4)
   Index Cond: (relname = 'pg_class'::citext)
EXPLAIN SELECT oid FROM my_pg_class2 WHERE relname = 'PG_CLASS';
                                        QUERY PLAN
 Index Scan using i_my_pg_class_relname2 on my_pg_class2  (cost=0.27..8.29 rows=1 width=4)
   Index Cond: (relname = 'PG_CLASS'::citext)

There are more facilities available to further reduce precision:

If Postgres 12 or later is compiled with the icu library support (view the system table column pg_collation.collprovider to check), you can use nondeterministic collations that are case and accent-insensitive.

Post a Comment

Data modeling Force One Row

Monday, June 22, 2020

How can you force a table to have at most one row? It is actually very easy by creating a unique expression index on a constant, with no column name references:

-- this adds a single row
INSERT INTO onerow VALUES (1, 'foo')
        ON CONFLICT ((1)) DO UPDATE SET a = excluded.a, b = excluded.b;
SELECT * FROM onerow;
 a |  b
 1 | foo
-- this updates the single row
INSERT INTO onerow VALUES (2, 'bar')
        ON CONFLICT ((1)) DO UPDATE SET a = excluded.a, b = excluded.b;
SELECT * FROM onerow;
 a |  b
 2 | bar
-- this also updates the single row
INSERT INTO onerow VALUES (3, 'baz')
        ON CONFLICT ((1)) DO UPDATE SET a = excluded.a, b = excluded.b;
SELECT * FROM onerow;
 a |  b
 3 | baz
-- try INSERT without ON CONFLICT
INSERT INTO onerow VALUES (4, 'foo2');
ERROR:  duplicate key value violates unique constraint "onerow_expr_idx"
DETAIL:  Key ((1))=(1) already exists.

By using on conflict, it is possible to add a row, but if a row already exists, to replace it with a new value.

Post a Comment

Data modeling Storing Binary Data in the Database

Friday, June 19, 2020

There are some very good responses in an email thread about whether to store binary data in Postgres or externally. The binary storage options discussed were:

  • In the database (toast helps with performance)
  • In another database, like SQLite
  • In a local or network file system
  • Using cloud storage.

This email reply had many good insights, and this wiki page has even more. I have covered data storage outside of databases before.

View or Post Comments

Conference Dinner Q&A

Wednesday, June 17, 2020

My employer, EnterpriseDB, has been organizing events where potential customers and interested people can ask me questions while enjoying a meal. I thought the idea was strange, but I have done it ten times, and they have gone very well. The Q&A portion usually lasts one hour and forty-five minutes. During a November, 2019 event in Utrecht, the Netherlands, a sketch artist was present. The artist illustrated my ideas as I spoke and created this diagram, which I found quite interesting.

View or Post Comments

Client Controlling Server Variables at Connection Time

Monday, June 15, 2020

I have recently covered the importance of libpq environment variables and connection specification options. While most libpq options control how to connect to the Postgres server, there is one special option that can change variables on the server you connect to, e.g.:

$ psql 'options=-cwork_mem=100MB dbname=test'
psql (13devel)
Type "help" for help.
test=> SHOW work_mem;

This can also be done using environment variables, with all the benefits of environment variables:

$ PGOPTIONS="-c work_mem=100MB" psql test

These settings can also be set at the user, database, and cluster level on the database side too, but having control on the client side is often useful.

View or Post Comments

Client Connect Parameter Specification Options

Friday, June 12, 2020

I have previously covered the importance of libpq and environment variables. While you can specify discrete connection command-line parameters and environment variables, there is a catch-all setting that allows connection options to be specified in a single string, e.g.:

$ psql -d test
psql (13devel)
Type "help" for help.
test=> \q
$ psql --dbname test
psql (13devel)
Type "help" for help.
test=> \q
$ psql 'dbname=test'
psql (13devel)
Type "help" for help.
test=> \q

The first psql command uses a single-letter command-line option. The second one uses a long-format option. The third uses a parameter key word. Multiple key words can be used to specify multiple connection options:

$ psql ' port=5433 dbname=test'

You can also use a uri syntax to specify the same parameters as above:

$ psql postgresql://

View or Post Comments

Client Controlling Connection Parameters Using Environment Variables

Wednesday, June 10, 2020

Libpq is the Postgres connection library used by almost every non-jdbc application. It allows many connection parameters, which can be specified on the command line or embedded in applications:

$ psql -h -d mydb

In the above case, the psql host name and database name are specified on the command-line and interpreted by libpq. However, it is also possible to specify parameters using environment variables, which are also interpreted by libpq:

$ PGDATABASE=mydb psql

There is obviously no value in specifying libpq parameters using environment variables in this example, but there are use cases. For example, if you want to perform multiple operations on the same host name and database, you can do:

$ export
$ export PGDATABASE=mydb
$ vacuumdb
$ reindexdb

This avoids specifying the host and database names multiple times, though with a loss of clarity. Sometimes environment variables are best used as defaults when connection options are not specified:

$ export
$ export PGDATABASE=mydb
$ reindexdb
$ reindexdb --dbname mydb2
$ reindexdb --host --dbname mydb3

This reindexes databases mydb and mydb2 on host, and database mydb3 on

Another use-case for environment variables is to set parameters for users, without having to pass them as parameters to commands:

$ # must use 'sh' so the redirect happens as root
$ sudo sh -c "echo PGHOST='' >> ~bruce/.profile"

By appending this to bruce's .profile file, all applications that use libpq without a specific host name will connect to automatically. Once bruce logs out and back in again, all his applications will start using the new .profile pghost setting.

Finally, environment variables make it possible to set default connection values for all users. For example, on Debian, to default all tcp (non-Unix Domain socket) connections to fully verify ssl certificates, you can do:

$ sudo sh -c "echo PGSSLMODE='verify-full' >> /etc/profile.d/"

Of course, you can change environment variables set at login and applications can override connection parameters set by environment variables.

View or Post Comments

Administration Safety Systems Can Reduce Safety

Monday, June 8, 2020

What is the purpose of safety systems? To make things safer? To make them appear safer? To satisfy some external requirement? The purpose of safety systems is not always clear, but even for safety systems whose sole purpose is to increase safety — do they always succeed in increasing safety? The simple answer is "no". Here are three examples:

In all three cases, safety systems did not prevent disasters — they caused them. Safety systems are often very useful, and the modern world could not operate with them. However, they also add complexity, and that added complexity can introduce failure modes that did not exist without the safety systems.

So, where does that leave us? Safety systems are useful, but too many of them are bad? How many is too many? These are hard questions, but there are some guidelines:

  • How serious is the failure that the safety system is trying to prevent?
  • How likely is the failure that the safety system is trying to prevent?
  • How likely is the safety system to fail?
  • What impact will failure of the safety system have on the overall system?

This gets into a very complex calculus where you are computing the likelihood and seriousness of the failure that the safety system is trying to prevent, and the likelihood and seriousness of safety system failure. The big point is that while you are computing the likelihood and seriousness of failures and adding safety systems to compensate for them, you have to be aware of the cost of adding those safety systems, in both the likelihood and impact of their failure.

What does this have to do with databases? Well, you might have a database on an airplane or at a nuclear reactor site, in which case the database is part of a complex system. However, databases are also complex systems, and we regularly add safety systems to increase their reliability. How often to do we consider the cost of those safety systems, in terms of the likelihood and seriousness of failures? Let's look at some typical database safety systems:

  • Backups
  • Error reporting
  • Replication
  • Connection pooling

Let's walk through the calculus for backups:

  • How serious is the failure that the safety system is trying to prevent? High
  • How likely is the failure that the safety system is trying to prevent? High
  • How reliable is the safety system? Medium
  • What impact will failure of the safety system have on the overall system? Low

The last question is not considering the seriousness of a failed backup on performing it safety goal, but rather how likely is a backup to cause a failure on its own? It could fill up the disk with backup files, or cause too much load on the system, but those risks are low. Similarly, error reporting has minimal risk on destabilizing the system, except for consuming I/O and storage space.

Replication and connection poolers are in a different class of risk. Replication's goal is to allow for controlled switchover or failover in case of primary server failure, but what is its reliability and the impact if it fails? If synchronous replication is used, failure to replicate will cause the system to stop accepting writes. This can be caused by a network failure to the replicas, a replica outage, or even misconfiguration. Even failure of asynchronous replication can cause the write-ahead log directory to fill up storage, causing an outage.

Similarly, the failure of a connection pooler can cause a total outage. You can use multiple poolers, but what if the failure of one pooler prevents the other from working, or if they both work and conflict with each other. Multiple poolers can also add complexity to the system which makes debugging other problems harder. This is a great example where, to avoid the problems of safety system failure, you create two safety systems, but the two safety systems can interact in ways that make two safety systems less reliable than one safety system. Multi-master replication can have similar downsides. Even auto-failover has risks, and auto-failback, even more.

The bottom line is that safety systems can be useful, but they can also add complexity which makes systems more fragile and harder to control; consider how safety systems interact and implement them only where their value is clear.

View or Post Comments

Administration When Does a Commit Happen?

Wednesday, June 3, 2020

Most people who deal with relational databases think of transaction commits as binary operations — the query is running and not yet committed, then it is completed and committed. However, internally, there are many stages to a commit:

  1. Write commit record to the write-ahead log
  2. Flush the write-ahead log to durable storage
  3. Update the pg_xact (slide 57)
  4. Transfer to replicas
  5. Mark the commit as visible to other sessions (ProcArrayEndTransaction() updates PGPROC)
  6. Communicate commit to the client

These steps are implemented in RecordTransactionCommit().

What is interesting is that this process can be interrupted at anytime, by perhaps a server crash or network failure. For example:

  • The commit could be flushed to disk but not yet visible to other clients
  • The commit could be replicated (and visible to queries on replicas) but not visible to queries on the primary
  • Queries could appear committed to other sessions before the client issuing the query receives notification

This email thread explains the problem that commit with standbys is not always an atomic operation. Postgres has the function txid_status() which allows clients to check if a transaction, perhaps from a previous disconnected session, was committed.

View or Post Comments

Presentation Lessons from the Online Conference Trenches

Monday, June 1, 2020

Having presented at two online conferences in the past two weeks, presenting at two this coming week, and presenting at many edb-sponsored webinars, I have learned a few things about online presentations that might be helpful for Postgres organizers, speakers, and attendees:

For Organizers: With no need for speakers and attendees to travel to online conferences, there are many more potential speakers available, and many more potential attendees than for in-person conferences. However, the technical challenges of hosting an online conference are significant because producing and consuming content can require multiple platforms that must be integrated seamlessly for a positive user experience. The content production platform, e.g. Zoom, is often different than the consumption platform, e.g. YouTube, Facebook Live. If the user experience is poor, people will leave because they are not bound to a physical location like an in-person event. Adjusting to the time zones of speakers and attendees can be complex — doing a 24-hour conference like Precona Live (organizer tips) solves many of the time zone problems, but requires moderators from many time zones. The moderator's job is much more extensive for online conferences since they control access, deal with technical problems, and manage the all-important chat channel. For online conferences, chat is the best way to promote attendee engagement. If chat is done well, user engagement during presentations can be even better than in-person conferences.

For Speakers: Just like for organizers, speakers have more technical challenges than in-person conferences — it is harder to engage the audience, more things can go wrong, and attendees can more easily leave. As a speaker, I have a checklist that I always references before each presentation:

  • Mute phone, chat, email, and upgrade notifications
  • Use a laptop on AC power with wired Ethernet, for reliability
  • Have a count-up clock to keep track of the talk duration
  • Use a headset so you don't lean toward a microphone
  • Use a presentation remote so you don't lean forward to change slides
  • Turn on your video camera to increase audience engagement
  • Make sure the lighting is good and the background is uncluttered

For attendees: Many online conferences are a mix of different technologies, not always seamlessly integrated, so anticipate that it will take time to get initially connected. Consider watching the conference on a large television, or from a tablet you can carry around. Use chat to engage with the speaker and other attendees. Feel free to switch to a more interesting presentation without guilt. When you get frustrated, consider how much time you are saving by not having to travel.

View or Post Comments

Administration Visualizing Collations

Friday, May 29, 2020

There is still significant confusion about characters sets, encodings, and collations. This is because in the real, non-digital world, we usually treat languages, their characters, and ordering as unified, but in the digital world, they are treated separately, and their distinction can be hard to visualize.

These two posted queries illustrate collation in a very creative way. The first query, SELECT chr(i) FROM generate_series(1, 50000) i ORDER BY chr(i) COLLATE "C", outputs characters in their "C" binary order, with ascii as the first 128 characters, successive groups of languages following, and ending with pictographic languages.

The second query, SELECT chr(i) FROM generate_series(1, 50000) i ORDER BY chr(i) COLLATE "en_US.utf8", outputs the same 50,000 characters in "United States English" utf8 order. The output starts with pictographic languages, not ascii. The Latin alphabet appears, but not until line 19068. What is interesting is that there are 118 symbols grouped together that look like 'a', 'a' with diacritics, or have 'a' as part of their symbol. Then 'b' appears with a group of 38 symbols that look like or use 'b', and so on through the Latin alphabet. (If you highlight a character and paste it into a search box, Google will tell you about that Unicode symbol.)

I found it interesting that it groups letters that look like Latin letters, even if the they are not from Latin alphabets and don't sound like Latin letters. Cyrillic is grouped in a section after the Latin alphabet section. These sql queries are the clearest example I have seen of collation ordering.

If I had used a different collation, instead of "United States English", there would have been a different ordering. This is why index storage is sensitive to collations, i.e., indexes with different collations store the same stings in a different order. Collation affects other things like upper/lower case processing, the ordering of query output, and certain optimizations.

View or Post Comments

Administration What is an Lsn?

Wednesday, May 27, 2020

You might have seen that there is a pg_lsn data type:

test=> \dTS pg_lsn
              List of data types
   Schema   |  Name  |       Description
 pg_catalog | pg_lsn | PostgreSQL LSN datatype

Client programs pg_receivewal and pg_recvlogical have options that take lsn values, but what is an lsn? It stands for "Log Sequence Number" — it is a 64-bit value that represents a position in the write-ahead log. It is usually displayed as two 32-bit hex values, separated by a slash. For example, pg_controldata displays lsn values:

$ pg_controldata
Latest checkpoint location:           0/15AE1B8
Latest checkpoint's REDO location:    0/15AE180

So, the next time you view or need to specify a write-ahead log location, you are using an lsn.

View or Post Comments

Administration Taking Snapshots of Clusters Which Use Tablespaces

Monday, May 25, 2020

Postgres already documents the ability to backup the database cluster using file system snapshots. Unfortunately, database clusters that use tablespaces often cannot use this method if the storage system doesn't support simultaneous snapshots across file systems.

However, simultaneous snapshots across file systems might not be a hard requirement for Postgres snapshot backups. It might be possible for snapshots to be non-simultaneous as long as the write-ahead log that spans the time frame between snapshots is included in the backup, and checkpoints do not happen during that time frame.

Internally, starting Postgres from a snapshot backup replays write-ahead log records to make a single file system snapshot consistent. Potentially, it could do the same for non-simultaneous snapshots of multiple file systems. However, documenting this, giving users a reliable list of steps to perform, and making sure it always works is probably too complex to justify.

View or Post Comments

Administration Moving Tables, Indexes, and Tablespaces Between Clusters

Friday, May 22, 2020

Currently, it is impossible to move tables, indexes, and entire tablespaces from one cluster to another — that is because each table and index file is bound to the cluster's infrastructure because of:

  • Table and index definitions
  • pg_xact (commit/abort/in-progress transaction status records)
  • pg_multixact (used for multi-session row locking)

Fyi, you can easily move tablespaces to new directories as long as it remains in the same cluster, and move tables and indexes between tablespaces.

So, how could it be made possible? Freeze can remove references to pg_xact and pg_multixact, assuming there are no active transactions during the freeze operation. Table and index definitions can be more complex, but it certainly seems possible. This requires more research.

View or Post Comments

Administration Why Pgdata Should Not Be at the Top of a Mount Point

Wednesday, May 20, 2020

This email thread is illustrative of why it is unwise to place the Postgres data directory (pgdata) at the top of a mount point. Instead, create a subdirectory under the mount point and put pgdata there. This has the advantage of avoiding possible data corruption if mounting fails, and allows more efficient use of pg_upgrade.

View or Post Comments

Conference Percona Live Online

Monday, May 18, 2020

I am planning to virtually attend and present at the Percona Live Online conference tomorrow, May 19. It starts at 10am, Eastern us time, and spans 24 hours, so it covers every time zone. I am giving my Will Postgres Live Forever? presentation at noon, Eastern us time.

Attendance is free, so you might want to check it out. I saw some interesting topics on the program. I am also curious to experience a 24-hour virtual conference, though I am unlikely to remain awake that long.

View or Post Comments

Administration Using Non-Login Roles

Monday, May 18, 2020

When we talk about database roles, most people immediately think of login roles, which allow people to log in. However, another user management feature is the ability to create non-login roles, formerly called groups. Non-login roles can also be assigned permissions, e.g., via grant, and can have login roles as members. Non-login roles can be even be members of other non-login roles.

What is the value of using non-login roles? They allow a group of people to be assigned as members of a non-login role, and that role can be used to abstract permission assignment. For example, if you have shop foremen, you can configure the login roles of all foremen to be members of a non-login foreman role. As people are added and removed from that staff position, they can be added/removed from the non-login role without the need to change permissions for the foreman role.

A further advantage of non-login roles, as explained in a recent email, is that Postgres can start to suffer performance problems if more than a few dozen roles are granted permission on an object. A much simpler and more manageable solution is to add users to a non-login role and assign object permissions to that non-login role.

View or Post Comments

News Draft of Postgres 13 Release Notes

Friday, May 15, 2020

I have completed the draft version of the Postgres 13 release notes, containing 181 items. The release notes will be continually updated until the final release, which is expected to be in September or October of this year. Beta testing will start in the next few weeks.

View or Post Comments

Internals Avoiding Cache Wipe, Synchronized Scans

Wednesday, May 13, 2020

Whenever you are dealing with a lot of data, it helps to cache it. Postgres does this using shared_buffers. However, one risk of caching data is that a large query that accesses a lot of data might remove frequently-accessed data from the cache; this is called cache wipe. To avoid this, Postgres limits the number of shared buffers used by data operations that are expected to access a lot of data.

Looking at C function GetAccessStrategy(), you can see there are four shared buffer access strategies. The first one, bas_normal, is used for normal scans; the rest are to avoiding cache wiping:

  • bas_bulkread is enabled for large reads and uses 256 kilobytes of shared buffers (typically 32 8kB shared buffers)
  • bas_bulkwrite: is enabled for large writes and uses 16 megabytes of shared buffers
  • bas_vacuum is for vacuum operations and uses 256 kilobytes of shared buffers

For example, function initscan() uses bas_bulkread if the scan is expected to access more than one-quarter of shared buffers. Similarly, table rewrites, create table as, and copy from use bas_bulkwrite. Bas_bulkwrite is larger because we can't discard written buffers from the cache until they are written to storage, unlike unmodified buffers which can be discarded anytime.

You might wonder, with operations using so few shared buffers, what happens if another session needs to scan the same data? Doesn't performance suffer? Well, another Postgres facility, that was developed independently, helps with this: synchronized scans. The top of syncscan.c explains it well:

When multiple backends run a sequential scan on the same table, we try to keep them synchronized to reduce the overall I/O needed. The goal is to read each page into shared buffer cache only once, and let all backends that take part in the shared scan process the page before it falls out of the cache.

Both synchronized scans and shared buffer access strategies work automatically, so most Postgres users don't even know they exist, but they do improve Postgres performance.

View or Post Comments

Internals Why Do We Freeze?

Monday, May 11, 2020

You might have seen autovacuum running, and noticed that it sometimes performs freeze operations on transaction ids (32 bits) and multi-xacts (used for multi-session row locking). The frequency of freeze operations is controlled by autovacuum_freeze_max_age and autovacuum_multixact_freeze_max_age. You can reduce the frequency of freezing if you are sure the freeze operation will complete before transaction wraparound is reached.

There is regular discussion about how freezing could be avoided, and this email does the best job of explaining the options. We could expand transaction ids to 64 bits, either on each row or perhaps with a page-level default, but pg_xact (commit/abort/in-progress transaction status records) still need cleanup. This area probably needs more thought.

View or Post Comments

Internals Postgres Internals Website

Friday, May 8, 2020

I am often asked how someone can learn more about Postgres, particularly the internals. There is the Postgres developer page, which has links to many resources, and the developer's faq item about learning about the internals. One link on that page I was not aware of is Hironobu Suzuki's very detailed website about Postgres internals. It has a lot of details I have never seen written before, so I suggest those interested should check it out.

View or Post Comments

Internals Portability's Surprising Win

Wednesday, May 6, 2020

When writing software, it is necessary to decide whether to use external facilities available in command-line tools, libraries, frameworks, and the operating system, or write the facilities yourself. Why would you write them yourself? You might be worried about adding reliance on an external facility or a facility might not have sufficient flexibility or performance.

The Postgres development team has had to make similar decisions. Fortunately, we have tended to favor reliance on common operating system interfaces, tools, and libraries, e.g., OpenSSL, bison. We have avoided reliance on external facilities that are uncommon or not well maintained.

Postgres has reproduced facilities that were commonly available in the operating system, tools, or libraries only when there was a clear benefit. Reproducing such facilities for a small benefit, like a little more control or a little more performance, is rarely wise. While relying on external facilities often makes Postgres less flexible and perhaps less performant, there are long-term benefits:

  • As external facilities improve their feature-set and performance, Postgres benefits from these improvements with little effort
  • As new demands are required of these external facilities, Postgres again benefits effortlessly

Let's be specific. In the early days of Postgres, file systems were not optimized for database storage. Fortunately, Postgres never implemented file systems on raw devices. Now that modern file systems, like ext4, give good database performance, Postgres benefits from file system improvements with almost no effort. Even when ssds started being used, the only change needed in Postgres was the ability to set random_page_cost at the tablespace level to handle databases where some tablespaces are on ssds and some are on magnetic storage.

When virtualization, cloud, containers, and container orchestration (e.g., Kubernetes) became popular, Postgres had to do almost nothing to run well on these platforms. (It is true that some of the Postgres enterprise deployment tools required repackaging and re-engineering.)

Because of this philosophy, Postgres has remained relatively light-weight compared to other relational database systems, and this has benefited Postgres in environments where nimble deployments are favored.

View or Post Comments

Performance With ... Materialized and Optimizer Control

Monday, May 4, 2020

Before Postgres 12, queries specified as common table expressions (with clauses) always behaved as optimization barriers, meaning that common table expression queries were executed independently, and were not moved to later parts of the query.

Starting in Postgres 12, if a common table expression is referenced only once, and the keyword materialized is not used, it can be moved to a place later in the query where it can be better optimized; this improves optimization possibilities. However, if the movement of common table expression queries increases the from clause table count above the geqo_threshold, Postgres will decide it can't efficiently optimize such a high table count query and will use the genetic query optimizer.

So, while the new Postgres 12 behavior of in-lining common table expressions usually increases the quality of optimized plans, in some cases it can decrease them by enabling the genetic query optimizer. In a way, in pre-Postgres 12 or with the use of materialized, the query author is doing the optimization by creating common table expressions, while in other cases, the optimizer has greater control, though even the optimizer can determine the query is too complex and fall back to less-precise genetic query optimization.

View or Post Comments

Performance Background Writes

Friday, May 1, 2020

Postgres must guarantee durability and good performance. To meet these objectives, Postgres does writes to the file system and storage in the background as much as possible. In fact, there are only two major cases where writes happen in the foreground:

  1. Write-ahead log writes happen before commits are acknowledged to the client
  2. A needed shared_buffer is dirty and must be written to storage so it can be replaced

Write-ahead log writes (#1) can be controlled using various settings. Dirty shared buffer writes (#2) that happen in the foreground are minimized if the background writer is operating efficiently. You can monitor such writes by viewing the probe buffer-write-dirty-start and buffer-write-dirty-done.

View or Post Comments

Performance Optimal Use of Ssds

Wednesday, April 29, 2020

Ssds have different performance characteristics than magnetic disks, and using them optimally isn't always clear. Ssds have several performance benefits:

  1. Very fast fsyncs
  2. Much faster random reads and writes
  3. Faster sequential reads and writes

So, if all your data is stored on ssds, you will certainly improve performance. If you are mixing ssds and magnetic disks, ideally you should use ssds in ways that give the greatest benefit. Starting with number one, putting the write-ahead log on ssds is a great way to improve fsync performance.

For number two, moving indexes to tablespaces using ssd storage can greatly improve performance because index access is usually random. When using ssds, the default value for random_page_cost should be lowered, perhaps to 1.1. This can be set at the tablespace level if there is a mix of tablespaces on ssds and magnetic disks.

For number three, it is also possible to create tablespaces on ssds for current data, and place archive data on tablespaces that use magnetic disks. By using table partitioning, a partitioned table can transparently span ssds and magnetic disk tablespaces.

View or Post Comments

Performance Does Postgres Support Compression?

Monday, April 27, 2020

I am often asked if Postgres supports compression, and my answer is always a complicated dance around what "compression" level they are asking about. There are six possible levels of database compression:

  1. single field
  2. across rows in a single page
  3. across rows in a single column
  4. across all columns and rows in a table
  5. across tables in a database
  6. across databases

Number one (single field) is currently done by toast. Number two (across rows in a single page) is a practical optimization where a compression routine blindly looks for repeating values in a page without understanding its structure. The difficulty of implementing this happens when a page is stored using its compressed length (rather than the uncompressed 8k), the page contents change, and the new contents compress less well than the previous contents. In this case, the compressed page contents would be larger and it would be very complex to fit the page into the existing space in the file. A different file layout is really required for this, so pages can be placed anywhere in the file, without affecting index access. A team is working on adding this feature using Postgres's table access method interface.

Number three (across rows in a single column) is the classic definition of a columnar database. A team is also working on that. Just like number two, this requires using a different storage layout than Postgres's default, and the table access method interface makes this possible.

Number four can be done using file system compression. Numbers five and six would be nice, but it unclear how this could be done efficiently without adding unacceptable complexity to the database.

View or Post Comments

Performance Multi-Host Technologies

Friday, April 24, 2020

There are so many multi-host technologies and it is hard to remember the benefits of each one, so I decided to create a list:

High availability: Streaming replication is the simplest way to have multiple copies of your database, ready for fail over

Read scaling: Pgpool allows replicas (slide 17) to handle a database's read-only workload

Write scaling: Sharding allows for write scaling

Partial replication: Logical replication allows partial replication

Reduce latency: Multi-master replication allows servers to be located close to users, reducing transmission latency

Hopefully this is helpful to people.

View or Post Comments

Performance Performance Goalposts

Wednesday, April 22, 2020

In talking to EnterpriseDB customers, I am often asked about the performance limits of Postgres: How many connections can it handle? How many tps?

Well, those are good questions, but it is hard to give accurate answers since so much depends on the hardware and workload. Eventually, testing of actual workloads on intended hardware has to be done, but not giving ball-park answers is unhelpful. What answer can I give? Well, I came up with this chart:

1Connections< 250direct connect
2Queries< 250pgbouncer
3Write queries< 250Pgpool with read-only replicas (slide 17)
4Write queries>= 250sharding

Earlier items use simpler architectures than later items, and are therefore preferred.

For under 250 simultaneous connections(#1), it isn't generally necessary to use any complex architecture, though using a pooler is recommended for workloads with many short-lived sessions to reduce connection startup time. Even when under this limit, performance can be limited by the number of cpus. Oltp databases typically cannot process more than 5 x cpu cores without having to time-slice among cpus. Olap uses 2 x cpu cores, or less if parallelism is used.

For over 250 simultaneous connections but under 250 simultaneous queries(#2), a connection pooler can be beneficial because it reduces the overhead of managing many open connections. For over 250 simultaneous queries but under 250 simultaneous write queries(#3), a combination of Pgpool with read queries routed to streaming replicas can be very efficient. For over 250 simultaneous write queries(#4), a sharding solution should be considered.

These numbers can vary greatly based on workload and hardware, but I think they are good starting points to consider.

View or Post Comments

Performance Fast Enough?

Monday, April 20, 2020

Everyone one wants their software to perform as fast as possible. Some people think that unless the program ends before it begins, it's too slow.

However, realistically, making something as fast as possible is not a universal good, meaning that increasing performance beyond a certain point can cause problems that far exceed the value of the improved performance. Let's give some examples:

  • Most people use high-level languages that are compiled into cpu instructions or interpreted while being run. In an ideal world, every program would be written in assembly language. (Postgres does use some assembly language code for locking.) However, writing something like a database in assembly language, though it might give slightly better performance if developers knew the behavior of every cpu, would be a huge challenge for even the most expert developers, and code maintenance and feature additions might be impossible. In general, there just isn't enough benefit to using assembly language for anything but the most discrete, performance-critical functions.
  • Java isn't often chosen for its performance, but rather its ability to allow development teams to produce complex software efficiently. Lighter-weight languages might give better performance, but they don't offer the same development efficiency as Java.
  • Sql is a heavy-weight way to request data, but it is very efficient for developers since they can express their requests in a declarative way. You could probably write a custom data storage program to run faster than sql, but the effort involved to create and maintain it would be nearly insurmountable.
  • You can even layout data in sql in more efficient ways, and sometimes it is worth it, but data maintainability, access flexibility, and storage efficiency can suffer.

The bottom line is that performance is rarely a universal good — it has to be balanced against development time, solution flexibility, and maintainability. Sometimes people suggest that Postgres should offer some "super fancy" optimization, and sometimes we can implement it, but we always have to balance development time, solution flexibility, and maintainability with improved performance.

View or Post Comments

Community No Travel

Friday, April 17, 2020

With the Coronavirus outbreak, almost all Postgres events through June have been either cancelled, rescheduled, or moved online. This has given me time to consider my past travel. I have been blessed to visit so many places, not as a tourist, but rather a guest. I see countries more as a native than as a tourist, and I have many event organizers and hosts to thank for this.

It is hard to know when on-site events will resume, but I can remember what it was like to travel roughly 90 days a year. Going to social outings at home often felt riding a train through my home town, waving from behind a train window to my friends on the platform. I would sit at home and wonder how long I would be there until I had to leave again. Now, sitting at home, the images of our family travels appearing on our kitchen slideshow show places that seem farther away than ever, and I wonder if I will ever see these places again.

I am sure many others have similar feelings, and I have hope that, someday, we will all return to the road to spend time together again.

View or Post Comments

Thoughts Database Interoperability at Risk

Monday, March 16, 2020

This article parallels Oracle's copying of the sql syntax from ibm in the late 1970's with Google's copying of the Java api. It also explains the possible impact of the current case soon to be decided by the US Supreme Court.

One thing it does not fully cover is the impact on sql-level interoperability between databases. If Oracle can claim the Java api as copyrightable, the sql language could be considered copyrightable, allowing ibm to sue all relational database vendors and users for copyright infringement. It might also allow database vendors to sue competitors when their sql-level features are copied, requiring either huge payments or the removal of interoperability syntax.

The Postgres copyright is open, so any database vendor copying Postgres syntax is fine, but Postgres copying the syntax of other databases could be problematic. Relational database interoperability has been of huge benefit to data management, and this single case could call that into question.

View or Post Comments

Presentation Databases, Containers, and the Cloud

Saturday, March 7, 2020

A few months ago, I wrote a high-level presentation about the deployment benefits of using containers and cloud infrastructure for databases. I am now ready to share this presentation. I have also added QR codes to the first and last slides of all my presentations, linking to my website.

View or Post Comments