Bruce Momjian: Postgres Blog

Bruce Momjian

Postgres Blog

This blog is about my work on the Postgres open source database, and is published on Planet PostgreSQL. PgLife allows monitoring of all Postgres community activity.

Online status:
Unread Postgres emails:
Email graphs: incoming, outgoing, unread, commits (details)

View Category Index

The Future of Major Version Upgrades

Monday, December 31, 2018

Pg_upgrade has matured to become a popular method to perform major version upgrades. This email thread considers what better upgrades would look like. Options include

Migrating analyze statistics
Logical replication (requires double the disk space, managing schema changes)
Multi-master replication (similar to previous, but also requires conflict resolution)
Allowing the new server to read the old server's data files

Upgrade improvements have to be either significantly better in one of these measurements: faster, easier, more reliable, less required storage, and not significantly worse in any of those. For example, a solution that is 50% faster but is more complex or less reliable will be hard to gain acceptance. Of course, if a solution is one-hundred-times faster, it can be worse in some of those areas.

View or Post Comments

Zero Downtime Pg_upgrade

Friday, December 28, 2018

Pg_upgrade can upgrade a multi-terabyte system in 3–5 minutes using link mode. There are some things that can be done to make it faster — particularly, migration of analyze statistics from the old to new clusters.

However, even if pg_upgrade could perform an upgrade in zero time, would that be a zero-downtime upgrade? I am not sure, since my previous blog post explains that the work of switching clients from the old cluster to the new one seems to be downtime in the sense that running sessions are either terminated and restarted, or new connections must wait for old connections to complete. Multi-master replication seems to be unique in offering a way for new sessions to start on the new cluster while old sessions finish on the old cluster, but unfortunately it adds complexity.

View or Post Comments

Threaded Postgres

Wednesday, December 26, 2018

This amazing work by Konstantin Knizhnik created some experimental numbers of the benefits of moving Postgres from process forking to threading. (Much slower CreateProcess is used on Windows.)

His proof-of-concept showed that you have to get near 100 simultaneous queries before you start to see benefits. A few conclusions from the thread are that threading Postgres would open up opportunities for much simpler optimizations, particularly in parallel query and perhaps a built-in connection pooler. The downside is that some server-side languages like PL/Perl and PL/Python have interpreters that cannot be loaded multiple times into the same executable, making them of limited use in a threaded database server. Languages like PL/Java, that are made to run multiple threads safely, would benefit from threaded Postgres.

View or Post Comments

Why Use Multi-Master?

Monday, December 24, 2018

Multi-master replication sounds great when you first hear about it — identical data is stored on more than one server, and you can query any server. What's not to like? Well, there is actually quite a bit not to like, but it isn't obvious. The crux of the problem relates to the write nature of databases. If this was a web server farm serving static data, using multiple web servers to handle the load is easily accomplished. However, databases, because they are frequently modified, make multi-master configurations problematic.

For example, how do you want to handle a write to one of the database servers in a multi-master setup? Do you lock rows on the other servers before performing the write to make sure they don't make similar conflicting writes (synchronous), or do you tell them later and programmatically or administratively deal with write conflicts (asynchronous)? Locking remote rows before local writes can lead to terrible performance, and telling them later means your data is inconsistent and conflicts need to be resolved.

In practice, few people use synchronous multi-master setups — the slowdown is too dramatic, and the benefits of being able to write to multiple servers is minimal. Remember all the data still must be written to all the servers, so there is no write-scaling benefit. (Read load balancing can be accomplished with streaming replication and Pgpool-II.)

The only use case where I know multi-master can be a win is geographically distributed servers, where there is some locality of access, e.g., Asian customers are most likely to write first to Asian servers. The servers can be located close to large user communities, reducing latency, and conflicting writes are minimized.

Some users like multi-master setups because they make upgrades and load balancing seem easier, but the down sides are significant.

Update: Good https://percona.community/blog/2025/06/18/postgresql-active-active-replication-do-you-really-need-it/:blog post about multi-master tradeoffs 2025-06-18

View or Post Comments

Compiled PL/pgSQL?

Friday, December 21, 2018

PL/pgSQL has good alignment with sql. When first executed in a session, the PL/pgSQL source is compiled to an abstract syntax tree which is then executed every time the PL/pgSQL function is executed in the session. Other languages have different compile behavior:

PL/Perl compiles to bytecode on first call, similar to PL/pgSQL
PL/Python loads pyc bytecode files
Spi loads machine-language instruction files (object files)

This email thread covers some of the details. Keep in mind that most server-side functions spend the majority of their time running sql queries, so the method of compilation is often insignificant.

View or Post Comments

Wal Directory Location

Wednesday, December 19, 2018

You might have noticed that initdb has a -X/--waldir option to relocate the write-ahead log (pg_wal) directory. This allows the wal I/O traffic to be on a device different from the default data directory. (This is similar to the use of tablespaces.) If you wish to move the pg_wal (or older pg_xlog) directory to a different device after running initdb, you can simply shut down the database server, move pg_wal to a new location, and create a symbolic link from the data directory to the new pg_wal location.

View or Post Comments

Windows and Shared Buffers

Monday, December 17, 2018

Perhaps one overlooked change in Postgres 10 was the removal of the recommendation of smaller shared_buffers on Windows. This email thread discussed its removal. So, if you have been minimizing the size of shared_buffers on Windows, you can stop now.

View or Post Comments

Micro-Optimizations

Friday, December 14, 2018

This email thread discusses the tradeoffs of adding optimization improvements that affect only a small percentage of queries, i.e., micro-optimizations. The big sticking point is that we don't know if the time required to check for these optimizations is worth it. For short-running queries, it probably isn't, but for long-running queries, it probably is. The problem is that we don't know the final cost of the query until the end the optimization stage — this makes it impossible to decide if checks for micro-optimizations are worthwhile during optimization.

During the email discussion, optimizing X = X clauses was considered to be a win for all queries, so was applied. Optimization to convert or queries to use union is still being considered. Figuring out a way to estimate the cost before optimization starts was recently discussed.

View or Post Comments

Optimizer Hints

Wednesday, December 12, 2018

You might know that Postgres has optimizer hints listed as a feature we do not want to implement. I have covered this topic in the past.

This presentation from Tatsuro Yamada covers the trade-offs of optimizer hints in more detail than I have ever seen before. Starting on slide 29, Yamada-san explains the number-one reason to use optimizer hints — as a short-term fix for inefficient query plans. He uses pg_hint_plan (produced by ntt) to fix inefficient plans in his use case. He then goes on to consider possible non-hint solutions to inefficient plans, such as recording data distributions found during execution for use in the optimization of later queries.

Most interesting to me was his reproduction of the optimizer hints discussion on the Postgres wiki, including his analysis of how pg_hint_plan fits that criteria. There are certainly environments where optimizer hints are helpful, and it seems Yamada-san has found one. The reason the community does not plan to support hints is that it is considered likely that optimizer hints would cause more problems for users than they solve. While Postgres has some crude manual optimizer controls, it would certainly be good if Postgres could come up with additional solutions that further minimize the occurrence of significantly inefficient plans.

View or Post Comments

Allocating work_mem

Monday, December 10, 2018

I have already discussed the complexities of allocating shared_buffers, work_mem, and the remainder as kernel cache. However, even if these could be optimally configured, work_mem (and its related setting maintenance_work_mem) are configured per query, and potentially can be allocated multiple times if multiple query nodes require it. So, even if you know the optimal amount of work_mem to allocate for the entire cluster, you still have to figure out how to allocate it among sessions.

This detailed email thread explores some possible ways to simplify this problem, but comes up with few answers. As stated, it is a hard problem, and to do it right is going to take a lot of work. Even DB2's solution to the problem was criticized, though it was our initial approach to solving the problem. With additional parallelism, configuring work_mem is getting even more complex for users, to say nothing of the complexity of allocating parallel workers itself.

This is eventually going to have to be addressed, and doing it in pieces is not going to be fruitful. It is going to need an entirely new subsystem, perhaps with dynamic characteristics unseen in any other Postgres modules.

View or Post Comments

The Memory Resource Triad

Friday, December 7, 2018

There are three resources that affect query performance: cpu, memory, and storage. Allocating cpu and storage for optimal query performance is straightforward, or at least linear. For cpu, for each query you must decide the optimal number of cpus to allocate for parallel query execution. Of course, only certain queries can benefit from parallelism, and the optimal number of cpu changes based on the other queries being executed. So, it isn't simple, but it is linear.

For storage, it is more of a binary decision — should the table or index be stored on fast media (ssds) or slow media (magnetic disks), particularly fast random access. (Some nas servers have even more finely-grained tiered storage.) It takes analysis to decide the optimal storage type for each table or index, but the Postgres statistic views can help to identify which objects would benefit.

Unfortunately, for memory, resource allocation is more complex. Rather than being a linear or binary problem, it is a multi-dimensional problem — let me explain. As stated above, for cpus you have to decide how many cpus to use, and for storage, what type. For memory, you have to decide how much memory to allocate to shared_buffers at database server start, and then decide how much memory to allocate to each query for sorting and hashing via work_mem. What memory that is not allocated gets used as kernel cache, which Postgres relies on for consistent performance (since all reads and writes go through that cache). So, to optimize memory allocation, you have to choose the best sizes for:

shared buffers (fixed size at server start)
work_mem (potentially optimized per query based on workload)
Kernel buffers (the remainder)

You have to choose a #1 that you can't change once the server starts, a #2 that you can change but you need visibility into the memory allocation of other sessions and the amount of unused memory to optimize it, and #3 which is whatever has not been allocated by the previous two items.

As you can see, this is an almost impossible problem, which is why administrators normally choose reasonable values, monitor for resource exhaustion, and adjust specific problem queries. The recommended default for #1 (shared buffers) is 25% of ram. You can go larger, particularly if a larger value allows you to keep more of your working set in memory, and if you don't need a large kernel cache to absorb many writes. For #2, you can monitor log_temp_files and increase it when you see queries using temporary storage for sorts or hashes. One creative idea is to use alter system to automatically increase or decrease work_mem based on the amount of free kernel buffers. (I suggested a similar solution for autovacuum.) Finally, you should probably be monitoring free kernel buffers anyway to make sure #1 and #2 have not allocated too much memory.

The good news is that Postgres is less sensitive to non-optimal memory allocation than many other relational systems. This is because less-than-optimal memory allocation means there are more kernel buffers, and more kernel buffers help reduce the impact of less-than-optimal shared buffers or work_mem. In contrast, more-than-optimal memory allocation can cause sessions to receive memory allocation errors, the Linux oom killer to kill processes, or the kernel to be starved of kernel buffers to absorb writes.

In summary, Postgres is much more forgiving of under-allocation than over-allocation. Sometimes it is more helpful to understand how a system behaves when imperfectly configured than to understand how to perfectly configure it, especially when perfect configuration is impossible.

View or Post Comments

The Meaning of Wal

Wednesday, December 5, 2018

The write-ahead log (wal) is very important for Postgres reliability. However, how it works is often unclear.

The "write-ahead" part of the write-ahead log means that all database changes must be written to pg_wal files before commit. However, shared buffers dirtied by a transaction can be written (and fsync'ed) before or after the transaction commits.

Huh? Postgres allows dirty buffers to be written to storage before the transaction commits? Yes. When dirty buffers are written to storage, each modified row is marked with the currently-executing transaction id that modified it. Any session viewing those rows knows to ignore those changes until the transaction commits. If it did not, a long transaction could dirty all the available shared buffers and prevent future database changes.

Also, Postgres allows dirty buffers to be written to storage after the transaction commits? Yes. If Postgres crashes after the transaction commits but before the modified shared buffer pages are written during checkpoint, the wal is replayed during recovery, which restores any changes made by the transaction.

It is really writes to pg_xact files that record commits, which is consulted by all running sessions to detect which transaction ids are committed, aborted, or in-process. Allowing shared buffer writes to happen independently of transaction status recording allows for flexible handling of workloads.

View or Post Comments

Views vs. Materialized Views

Monday, December 3, 2018

Views and materialized views are closely related. Views effectively run the view query on every access, while materialized views store the query output in a table and reuse the results on every materialized view reference, until the materialized view is refreshed. This cache effect becomes even more significant when the underlying query or tables are slow, such as analytics queries and foreign data wrapper tables. You can think of materialized views as cached views.

View or Post Comments

Extendibility

Friday, November 30, 2018

Extendibility was built into Postgres from its creation. In the early years, extendibility was often overlooked and made Postgres server programming harder. However, in the last 15 years, extendibility allowed Postgres to adapt to modern workloads at an amazing pace. The non-relational data storage options mentioned in this presentation would not have been possible without Postgres's extendibility.

View or Post Comments

Data Storage Options

Wednesday, November 28, 2018

Since I have spent three decades working with relational databases, you might think I believe all storage requires relational storage. However, I have used enough non-relational data stores to know that each type of storage has its own benefits and costs.

It is often complicated to know which data store to use for your data. Let's look at the different storage levels, from simplest to most complex:

Flat files: Flat files are exactly what they sound like — an unstructured stream of bytes stored in a file. There is no structure defined in the file itself — structure must be implemented in the application that reads the file. This is obviously the simplest way to store data, and works well for small data volumes when only a single user and a few well-coordinated applications need access. File locking is required to serialize multi-user access. Changes typically require a complete file rewrite.
Word processing documents: This is similar to flat files, but defines structure in the data file, e.g., highlighting, sections. The same flat file limitations apply.
Spreadsheet: This is similar to word processing documents, but adds more capabilities, including computations and the definition of relationships between data elements. Data is more atomized in this format than in the previous one.
NoSQL stores: This removes many of the limitations from previous data stores. Multi-user access is supported, including locking, and modification of single data elements does not require rewriting all data.
Relational databases: This is the most complex data storage option. Rigid structure is enforced internally, though unstructured options exist. Data access occurs using a declarative language that is dynamically optimized based on that structure. Multi-user and multi-application access is efficient.

You might think that since relational databases have the most features, everything should use it. However, with features come complexity and rigidity. Therefore, all levels are valid for some use cases:

Flat files are ideal for read-only, single-user, and single-application access, e.g., configuration files. Modification are easy using a text editor.
How many organizations do you know that use word processing documents or spreadsheets to manage their organizations? This is often used at the departmental level when only a few people must access or modify the data.
NoSQL became popular among application developers who didn't want the rigidity of relational systems, and wanted simpler multi-node deployment.
Relational databases still win for multi-user, multi-application workloads.

So, don't snicker the next time someone uses a spreadsheet to handle their workload — it might be the best tool for the job.

View or Post Comments

First Wins, Last Wins, Huh?

Monday, November 26, 2018

Someone recently pointed out an odd behavior in Postgres's configuration files. Specifically, they mentioned that the last setting for a variable in postgresql.conf is the one that is honored, while the first matching connection line in pg_hba.conf in honored. They are both configuration files in the cluster's data directory, but they behave differently. It is clear why they behave differently — because the order of lines in pg_hba.conf is significant, and more specific lines can be placed before more general lines (see the use of reject lines.) Still, it can be confusing, so I wanted to point it out.

View or Post Comments

Submitting Talks to Conferences

Monday, November 5, 2018

Having attended many conferences, I have a few suggestions on how to submit successful conference talks. First, determine the type of conference. Then, try to submit talks that match the conference type; possible topics include

New Postgres features
User case studies
Internals
New workloads
Performance
Application development

Of course, only some of these topics match specific types of conferences.

Second, submit multiple talks. It is very possible that someone better known than you, or someone with a better abstract, will also submit to the conference. By submitting more than one topic, you increase your chances of submitting something unique and interesting.

Third, try to stand out. Maybe you have an interesting story about your topic, or an image that captures what you are going to talk about. Try to get that in front of the people who will be deciding if your talk should be accepted. Have you presented successfully before? Find a way to mention that too.

And finally, try to submit something that will attract attendees or will leave a lasting impression on your audience. You have to think creatively to accomplish this.

Oh, and once you are accepted, the hard part begins — you have to write your slides. However, don't think that some amazing idea will come to you 30 minutes before you present your talk. Accept the unlikelihood of that and create your slides weeks before your present them. Then, if you think of a way to improve your talk in the next few weeks, you have time to improve it. Creating slides solidifies your ideas, and often improvements will come to you as you think about how you will present your talk. A word of warning — don't change your slides the day of your talk; you are more likely to confuse yourself with unfamiliar slides than to improve the content.

Update: This blog covers some of the benefits of giving presentations. 2022-01-13

Update: Useful talk submission suggestions 2024-09-10

View or Post Comments

Users vs. Developers

Thursday, November 1, 2018

Some open source projects have a distinction between the developers of the open source software and its users. Since Postgres was originally developed in a university, and none of the university developers continued when Internet-based development started in 1996, all our active developers see themselves as stewards of code developed before we arrived. This causes a flatter organizational structure and helps to forge closer user/developer ties.

View or Post Comments

Long-Haul Behavior

Friday, October 19, 2018

People new to the Postgres community are often confused by the deliberateness of the community on some issues, and the willingness to break backward compatibility in other areas. The source code was created 32 years ago, and many of us have been involved with Postgres for 10-20 years. With that length of involvement, we are acutely aware of the long-term consequences of our decisions. Many proprietary and open source developers don't think years in advance, but with Postgres, it is our normal behavior.

This leads to some unusual Postgres behaviors:

Weekly source code comment improvements
Yearly source code reformatting
Exhaustive discussions about even minor changes, particularly user-visible ones
Api redesigns

This focus on detail often strikes new users as unusual, but it makes sense when a multi-year view is considered.

View or Post Comments

"Get Off My Lawn"

Wednesday, October 17, 2018

As you might know, Postgres is old-school in its communication methods, relying on email for the bulk of its development discussion, bug reports, and general assistance. That's the way it was done in 1996, and we are still doing it today. Of course, some projects have moved on to github or Slack.

As a project, we try to take advantage of new technologies while retaining technologies that are still optimal, and email is arguably one of them. Email hasn't changed much since 1996, but email clients have. Previously all email was composed using a text-editor-like interface, which allowed for complex formatting and editing. New email tools, like Gmail, provide a more simplified interface, especially on mobile devices. This simplified interface is great for composing emails while commuting on a train, but less than ideal for communicating complex topics to thousands of people.

This email captures the requirements of communicating technical points to thousands of people, and the need for fine-grained composition. Basically, when you are emailing thousands of people, taking time to make the communication clear is worthwhile. However, this email explains the reality that many email tools are more tailored for effortless communication on devices with limited features.

I am not sure what the community can do to improve the situation. Crafting complex technical communication is always going to be important for this project, but the tooling available to make this happen is getting rarer.

Update: This blog post explains many of the mechanics of email lists vs more modern methods. 2018-10-17

View or Post Comments

Three-Year Cycle

Monday, October 15, 2018

In the early years of Postgres's open source development, we focused on features that could be accomplished in a few weekends. Within a few years, we had completed many of those, and were challenged trying to accomplish big project with a mostly volunteer workforce. Soon, however, large companies started to sponsor developers' time and we launched into big feature development again.

Currently, the calendar is our only real challenge. We have major releases every year, but many features take multiple years to fully implement. We have seen this multi-year process with

Windows port
Point-in-time recovery
Json
Streaming replication

and are in the process of completing even more:

Parallelism
Partitioning
Just-in-time compilation (jit)
Sharding

So, if it seems like features appear in a limited form in one release, and the next two-to-three releases finally complete the feature, it is not your imagination.

View or Post Comments

Installing PL/v8

Friday, October 12, 2018

PL/v8 is the JavaScript server-side language for Postgres. It has been available for several years, but a change in the way Google packages the v8 languages has made it burdensome for packagers to build PL/v8 packages.

Therefore, few package managers still distribute PL/v8. This is disappointing since it undercuts some of our NoSQL story. We previously supported json storage and a json-specific server-side language. This second option is effectively no longer available, and those users who are using PL/v8 will need to find alternatives. This highlights the risk of software significantly relying on other software that it does not control and cannot maintain itself.

View or Post Comments

Multi-Host Libpq

Wednesday, October 10, 2018

Libpq is used by many client interface languages to communicate with the Postgres server. One new feature in Postgres 10 is the ability to specify multiple servers for connection attempts. Specifically, it allows the connection string to contain multiple sets of host, hostaddr, and port values. These are tried until one connects.

NoSQL solutions have used this method of multi-host connectivity for a while, so it is good Postgres can now do it too. It doesn't have all the features of a separate connection pooler, but it doesn't have the administrative or performance overhead of a separate connection pooler either, so it certainly fits a need for some environments.

View or Post Comments

Trigger Me Writable

Tuesday, October 9, 2018

Postgres support for hot standby servers allows read-only queries to be run on standby servers, but how are read-only sessions handled when promoting a standby server to primary? After a standby is promoted to primary, new connections are read/write, but existing connections also change to read/write:

SELECT pg_is_in_recovery();
 pg_is_in_recovery
-------------------
 t
 
SHOW transaction_read_only;
 transaction_read_only
-----------------------
 on
 
\! touch /u/pg/data2/primary.trigger
 
-- wait five seconds for the trigger file to be detected
CREATE TABLE test (x INTEGER);
 
SELECT pg_is_in_recovery();
 pg_is_in_recovery
-------------------
 f
 
SHOW transaction_read_only;
 transaction_read_only
-----------------------
 off

Notice the session did not need to disconnect and reconnect — it was promoted to read/write automatically.

View or Post Comments

Moving Tablespaces

Wednesday, October 3, 2018

Tablespaces are designed to allow Postgres clusters to be spread across multiple storage devices. Create tablespace creates a symbolic link in the pg_tblspc directory in the cluster's data directory pointing to the newly-created tablespace directory.

Unfortunately, though there is a command to move tables and indexes between tablespaces, there is no command to move tablespaces to different directories. However, since Postgres 9.2, the process of moving tablespaces is quite simple:

Record the oid of the tablespace you want to move
Shut down the Postgres cluster
Move the tablespace directory, either within the same file system or to a different file system
Update the oid symbolic link that represents the moved tablespace to the new tablespace directory location
Restart the server

Here's an example of moving a tablespace:

$ mkdir /u/postgres/test_tblspc
 
$ psql test
 
CREATE TABLESPACE test_tblspc LOCATION '/u/postgres/test_tblspc';
 
CREATE TABLE test_table (x int) TABLESPACE test_tblspc;
 
INSERT INTO test_table VALUES (1);
 
SELECT oid, * FROM pg_tablespace;
  oid  |   spcname   | spcowner | spcacl | spcoptions
-------+-------------+----------+--------+------------
  1663 | pg_default  |       10 |        |
  1664 | pg_global   |       10 |        |
 16385 | test_tblspc |       10 |        |
 
SELECT pg_tablespace_location(16385);
 pg_tablespace_location
-------------------------
 /u/postgres/test_tblspc
 
\q
 
$ pg_ctl stop
 
$ mv /u/postgres/test_tblspc /u/postgres/test2_tblspc/
 
$ cd $PGDATA/pg_tblspc/
 
$ ls -l
lrwxrwxrwx 1 postgres postgres 23 Sep  5 22:20 16385 -> /u/postgres/test_tblspc
 
$ ln -fs /u/postgres/test2_tblspc 16385
 
$ ls -l
lrwxrwxrwx 1 root root 24 Sep  5 22:25 16385 -> /u/postgres/test2_tblspc
 
$ pg_ctl start
 
$ psql test
 
SELECT * FROM test_table;
 x
---
 1
 
SELECT pg_tablespace_location(16385);
  pg_tablespace_location
--------------------------
 /u/postgres/test2_tblspc

View or Post Comments

Switchover/Failover and Session Migration

Monday, October 1, 2018

I have already covered switchover and failover. In the case of failover, the old primary is offline so there are no options for migrating clients from the old primary to the new primary. If there is a switchover, there are options for client migration. Assuming you are using streaming replication, only one server can accept writes. Therefore, for client migration, you can either

Force all clients on the old primary to exit, then promote the new primary
Wait for all clients on the old primary to exit, then promote the new primary

If you choose force, it will disrupt applications; they must be designed to handle disconnections and potentially reconfigure their sessions, e.g., session variables, cursors, open transactions. If you choose wait, how do you handle clients that want to connect while you are waiting for the existing clients to disconnect? Minor upgrades, which require a restart of the database server, have similar issues.

The only clean solution is to use multi-master replication so new clients can connect to the new primary while waiting for old-primary clients to finish and disconnect. However, it is operationally expensive to support multi-master just to minimize switchover disruption.

View or Post Comments

Postgres 11 Features Presentation

Friday, September 14, 2018

Now that I have given a presentation about Postgres 11 features in New York City, I have made my slides available online.

View or Post Comments

Multi-Host Pg_dump

Wednesday, September 12, 2018

You have probably looked at logical dumps as supported by pg_dump and restores by pg_restore or, more simply, psql. What you might not have realized are the many options for dumping and restoring when multiple computers are involved.

The most simple case is dumping and restoring on the same server:

$ pg_dump -h localhost -Fc test > /home/postgres/dump.sql
$ pg_restore -h localhost test < /home/postgres/dump.sql

or with a plain text dump:

$ pg_dump -h localhost -f /home/postgres/dump.sql test
$ psql -h localhost -f /home/postgres/dump.sql test

Where this gets interesting is with multiple hosts. You can:

$ # dump a remote database to your local machine
$ pg_dump -h remotedb.mydomain.com -f /home/postgres/dump.sql test
 
$ # dump a local database and write to a remote machine
$ pg_dump -h remotedb.mydomain.com test | ssh postgres@remotedb.mydomain.com 'cat > dump.sql'
 
$ # dump a remote database and write to the same remote machine
$ pg_dump -h remotedb.mydomain.com test | ssh postgres@remotedb.mydomain.com 'cat > dump.sql'
 
$ # or a different remote machine
$ pg_dump -h remotedb1.mydomain.com test | ssh postgres@remotedb2.mydomain.com 'cat > dump.sql'

You also have similar restore options. I will use psql below but pg_restore works the same:

$ # dump a remote database and restore to your local machine
$ pg_dump -h remotedb.mydomain.com test1 | psql test2
 
$ # dump a local database and restore to a remote machine
$ pg_dump -h remotedb.mydomain.com test | ssh postgres@remotedb.mydomain.com 'psql test'
 
$ # dump a remote database and restore to the same remote machine
$ pg_dump -h remotedb.mydomain.com test1 | ssh postgres@remotedb.mydomain.com 'psql test2'
 
$ # or a different remote machine
$ pg_dump -h remotedb1.mydomain.com test | ssh postgres@remotedb2.mydomain.com 'psql test'

As you can see, there is a lot of flexibility possible.

View or Post Comments

Monitoring Complexity

Monday, September 10, 2018

I have always had trouble understanding the many monitoring options available in Postgres. I was finally able to collect all popular monitoring tools into a single chart (slide 96). It shows the various levels of monitoring: OS, process, query, parser, planner, executor. It also separates instant-in-time reporting and across-time alerting/aggregation options.

View or Post Comments

Signing Rows

Friday, September 7, 2018

With the rsa keys created in my previous blog entry, we can now properly sign rows to provide integrity and non-repudiation, which we did not have before. To show this, let's create a modified version of the previous schema by renaming the last column to signature:

CREATE TABLE secure_demo2 (
        id SERIAL, car_type TEXT, license TEXT, activity TEXT, 
        event_timestamp TIMESTAMP WITH TIME ZONE, username NAME, signature BYTEA);

Now, let's do the insert as before:

INSERT INTO secure_demo2
VALUES (DEFAULT, 'Mazda Miata', 'AWR-331', 'flat tire',
CURRENT_TIMESTAMP, 'user1', NULL) RETURNING *;
 id |  car_type   | license | activity  |        event_timestamp        | username | signature
----+-------------+---------+-----------+-------------------------------+----------+-----------
  1 | Mazda Miata | AWR-331 | flat tire | 2017-07-08 10:20:30.842572-04 | user1    | (null)

Now, we sign it using our private rsa key, rather than creating a random key for this:

SELECT ROW(id, car_type, license, activity, event_timestamp, username)
FROM secure_demo2
WHERE id = 1
 
-- set psql variables to match output columns
\gset
 
\set signature `echo :'row' | openssl pkeyutl -sign -inkey ~user1/.pgkey/rsa.key | xxd -p | tr -d '\n'`
 
UPDATE secure_demo2 SET signature = decode(:'signature', 'hex') WHERE id = 1;
 
SELECT * FROM secure_demo2;
 id |  car_type   | license | activity  |        event_timestamp        | username | signature
----+-------------+---------+-----------+-------------------------------+----------+-------------
  1 | Mazda Miata | AWR-331 | flat tire | 2017-07-08 10:20:30.842572-04 | user1    | \x857310...

To later verify that the data row has not been modified, we can do:

SELECT ROW(id, car_type, license, activity, event_timestamp, username), signature
FROM secure_demo2
WHERE id = 1;
                                     row                                     | signature
-----------------------------------------------------------------------------+-------------
 (1,"Mazda Miata",AWR-331,"flat tire","2017-07-08 10:20:30.842572-04",user1) | \x857310...
\gset
 
\echo `echo :'signature' | xxd -p -revert > sig.tmp`
\echo `echo :'row' | openssl pkeyutl -verify -pubin -inkey /u/postgres/keys/user1.pub -sigfile sig.tmp`
Signature Verified Successfully
 
\! rm sig.tmp

Because signature verification is done using the public certificate, anyone can verify that the data has not been modified. It also allows non-authors to verify that the data was created by the owner of the private certificate.

This and the previous two blog entries are related. The first one explained how to create and store a simple message authentication code (mac). The second one explained how to encrypt data on the client side using symmetric and public key cryptography. This blog entry shows how to do message authentication via public key signing, so anyone with access to the public key can verify authorship.

View or Post Comments

Client Row Access Control

Wednesday, September 5, 2018

Usually the database administrator controls who can access database data. However, it is possible for clients to completely control who can access data they add to the database, with the help of openssl.

First, let's create rsa keys for three users from the command line. We first create an rsa public/private key pair for each user in their home subdirectory and then make a copy of their rsa public key in the shared directory /u/postgres/keys:

# # must be run as the root user
# cd /u/postgres/keys
# for USER in user1 user2 user3
> do    mkdir ~"$USER"/.pgkey
>       chown -R "$USER" ~"$USER"/.pgkey
>       chmod 0700 ~"$USER"/.pgkey
>       openssl genpkey -algorithm RSA -out ~"$USER"/.pgkey/rsa.key
>       chmod 0600 ~"$USER"/.pgkey/*
>       openssl pkey -in ~"$USER"/.pgkey/rsa.key -pubout -out "$USER".pub
> done

A more sophisticated setup would include creating a certificate authority and signing certificates for each user using the created keys. This allows the certificate authority to prove that the public keys belong to the specified users.

With this in place, it is now possible to encrypt data on the client using a public key that can only be decrypted by someone with access to the matching private key. Here is an example for user user1:

# echo test4 | openssl pkeyutl -encrypt -pubin -inkey /u/postgres/keys/user1.pub
> openssl pkeyutl -decrypt -inkey ~"$USER"/.pgkey/rsa.key
test4

This encrypts text with user1's public key, then decrypts it with their private key.

Now, let's create a table to hold the encrypted data and add some encrypted data:

CREATE TABLE survey1 (id SERIAL, username NAME, result BYTEA);
 
\set enc `echo secret_message | openssl pkeyutl -encrypt -pubin -inkey /u/postgres/keys/user1.pub | xxd -p | tr -d '\n'`
INSERT INTO survey1 VALUES (DEFAULT, 'user1', decode(:'enc', 'hex'));
 
-- save data for the other two users
\set enc `echo secret_message | openssl pkeyutl -encrypt -pubin -inkey /u/postgres/keys/user2.pub | xxd -p | tr -d '\n'`
INSERT INTO survey1 VALUES (lastval(), 'user2', decode(:'enc', 'hex'));
 
\set enc `echo secret_message | openssl pkeyutl -encrypt -pubin -inkey /u/postgres/keys/user3.pub | xxd -p | tr -d '\n'`
INSERT INTO survey1 VALUES (lastval(), 'user3', decode(:'enc', 'hex'));

We could have placed all the user-encrypted data in the same row using a bytea array:

CREATE TABLE survey2 (id SERIAL, username NAME[], result BYTEA[]);
 
\set enc1 `echo secret_message | openssl pkeyutl -encrypt -pubin -inkey /u/postgres/keys/user1.pub | xxd -p | tr -d '\n'`
\set enc2 `echo secret_message | openssl pkeyutl -encrypt -pubin -inkey /u/postgres/keys/user2.pub | xxd -p | tr -d '\n'`
\set enc3 `echo secret_message | openssl pkeyutl -encrypt -pubin -inkey /u/postgres/keys/user3.pub | xxd -p | tr -d '\n'`
 
INSERT INTO survey2 VALUES (
        DEFAULT,
        '{user1, user2, user3}',
        ARRAY[decode(:'enc1', 'hex'), decode(:'enc2', 'hex'), decode(:'enc3', 'hex')]::bytea[]);

We could have stored the encrypted value only once using a random password and encrypted the password using each user's public key and stored those:

CREATE TABLE survey3 (id SERIAL, result BYTEA, username NAME[], keys BYTEA[]);
 
\set key `openssl rand -hex 32`
\set enc `echo secret_message | openssl enc -aes-256-cbc -pass pass\::key | xxd -p | tr -d '\n'`
 
\set enc1 `echo :'key' | openssl pkeyutl -encrypt -pubin -inkey /u/postgres/keys/user1.pub | xxd -p | tr -d '\n'`
\set enc2 `echo :'key' | openssl pkeyutl -encrypt -pubin -inkey /u/postgres/keys/user2.pub | xxd -p | tr -d '\n'`
\set enc3 `echo :'key' | openssl pkeyutl -encrypt -pubin -inkey /u/postgres/keys/user3.pub | xxd -p | tr -d '\n'`
 
INSERT INTO survey3 VALUES (
        DEFAULT,
        :'enc',
        '{user1, user2, user3}',
        ARRAY[decode(:'enc1', 'hex'), decode(:'enc2', 'hex'), decode(:'enc3', 'hex')]::bytea[]);

To decrypt data using the first schema (one user per row, no arrays), user1 would do:

SELECT * FROM survey1 WHERE username = 'user1';
 id | username |    result
----+----------+-------------
  1 | user1    | \x80c9d0...
 
-- set psql variables to match output columns
\gset
 
-- 'cut' removes \x
\set decrypt `echo :'result' | cut -c3- | xxd -p -revert | openssl pkeyutl -decrypt -inkey ~user1/.pgkey/rsa.key`
 
SELECT :'decrypt';
    ?column?
----------------
 secret_message

The process would be similar for survey2 and survey3. Of course, this does not prevent database administrators from removing data from the database, and because public keys are used to encrypt, they could add data too. A message authentication code (mac) would prevent this.

View or Post Comments

Cryptographically Authenticated Rows

Friday, August 31, 2018

When storing data in the database, there is an assumption that you have to trust the database administrator to not modify data in the database. While this is generally true, it is possible to detect changes (but not removal) of database rows.

To illustrate this, let's first create a table:

CREATE TABLE secure_demo (
        id SERIAL, car_type TEXT, license TEXT, activity TEXT, 
        event_timestamp TIMESTAMP WITH TIME ZONE, username NAME, hmac BYTEA);

The last column (hmac) is used for change detection. Let's insert a row into the table:

INSERT INTO secure_demo
VALUES (DEFAULT, 'Mazda Miata', 'AWR-331', 'flat tire',
CURRENT_TIMESTAMP, 'user1', NULL) RETURNING *;
 id |  car_type   | license | activity  |       event_timestamp        | username | hmac
----+-------------+---------+-----------+------------------------------+----------+------
  1 | Mazda Miata | AWR-331 | flat tire | 2017-07-06 20:15:59.16807-04 | user1    |

Notice that this query also returns a text representation of the inserted row, including the computed columns id and event_timestamp.

To detect row changes, it is necessary to generate a message authentication code (mac) which is generated with a secret known only to the client. It is necessary to generate the mac on the client so the secret is never transferred to the server. These psql queries update the inserted row to store the mac:

SELECT ROW(id, car_type, license, activity, event_timestamp, username)
FROM secure_demo
WHERE id = 1;
 
-- set psql variables to match output columns
\gset
 
\set hmac `echo :'row' | openssl dgst -sha256 -binary -hmac 'MY-SECRET' | xxd -p | tr -d '\n'`
 
UPDATE secure_demo SET hmac = decode(:'hmac', 'hex') WHERE id = 1;
 
SELECT * FROM secure_demo;
 id |  car_type   | license | activity  |       event_timestamp        | username |   hmac
----+-------------+---------+-----------+------------------------------+----------+-------------
  1 | Mazda Miata | AWR-331 | flat tire | 2017-07-06 20:15:59.16807-04 | user1    | \x9549f1...

To later verify that the data row has not been modified, do:

SELECT ROW(id, car_type, license, activity, event_timestamp, username), hmac
FROM secure_demo
WHERE id = 1;
                                    row                                     |   hmac
----------------------------------------------------------------------------+-------------
 (1,"Mazda Miata",AWR-331,"flat tire","2017-07-06 20:15:59.16807-04",user1) | \x9549f1...
 
\gset
 
\echo  ' E''\\\\x'`echo :'row' | openssl dgst -sha256 -binary -hmac 'MY-SECRET' | xxd -p | tr -d '\n'`''''
 E'\\x9549f10d54c6a368499bf98eaca716128c732132680424f74cb00e1d5a175b63'
 
\echo :'hmac'
 E'\\x9549f10d54c6a368499bf98eaca716128c732132680424f74cb00e1d5a175b63'

The database administrator could replace or remove the hmac value, but this would be detected. This is because computing a proper hmac requires the MY-SECRET key, which is never sent to the server.

The above solution only allows someone with access to the secret key to determine if the row has been modified, meaning only they can check the message's integrity. A more sophisticated solution would be to use a private key to sign a hash of the row value — this would allow anyone with access to the public key to check that the row has not been modified, but still only allow those with access to the private key to generate a new hmac. This would also allow for non-repudiation.

There is the risk that the row values returned by the insert do not match those that were supplied, so some client-side checks would need to be added. There is also no detection for removed rows; this is similar to the problem of trying to detect a blocked tls connection attempt.

Such a setup is clearly overkill for many databases, but there are some use cases where data integrity guaranteed by the client, independent of the database administrator, is useful. Non-repudiation using public key infrastructure is also sometimes useful.

View or Post Comments

Foreign Data Wrappers and Passwords

Wednesday, August 29, 2018

Foreign data wrappers (fdw) allow data to be read and written to foreign data sources, like NoSQL stores or other Postgres servers. Unfortunately the authentication supported by fdws is typically limited to passwords defined using create user mapping. For example, postgres_fdw only supports password-based authentication, e.g., scram. Though only the database administrator can see the password, this can still be a security issue.

Ideally, at least some of the Postgres fdws should support more sophisticated authentication methods, particularly SSL certificates. Another option would be to allow user authentication to be sent through fdws, so the user has the same permissions on the fdw source and target. There is no technical reason fdw authentication is limited to passwords. This problem has been discussed, and it looks like someone has a plan for solving it, so hopefully it will be improved soon.

View or Post Comments

Certificate Revocation Lists

Monday, August 27, 2018

If you are setting up Postgres server or client TLS/SSL certificates, be sure to also configure support for a certificate revocation list (crl). This list, distributed by the certificate authority, lists certificates that should no longer be trusted.

While the crl will initially likely be empty, a time will come when a private key used by a certificate or device is exposed in an unauthorized manner, or an employee who had access to private keys leaves your organization. When that happens, you will need the ability to invalidate certificates — having that ability pre-configured will help, especially during a crisis.

View or Post Comments

Oracle Real Application Clusters (Rac)

Monday, July 9, 2018

I get asked about Oracle RAC often. My usual answer is that Oracle RAC gives you 50% of high reliability (storage is shared, mirroring helps) and 50% of scaling (CPU and memory is scaled, storage is not). The requirement to partition applications to specific nodes to avoid cache consistency overhead is another downside. (My scaling presentation shows Oracle RAC.)

I said the community is unlikely to go the Oracle RAC direction because it doesn't fully solve a single problem, and is overly complex. The community prefers to fully-solve problems and simple solutions.

For me, streaming replication fully solves the high availability problem and sharding fully solves the scaling problem. Of course, if you need both, you have to deploy both, which gives you 100% of two solutions, rather than Oracle RAC which gives you 50% of each.

However, I do think database upgrades are easier with Oracle RAC, and I think it is much easier to add/remove nodes than with sharding. For me, this chart summarizes it:

                         HA   Scaling  Upgrade Add/Remove
        Oracle RAC       50%     50%    easy    easy
        Streaming Rep.  100%     25%*   hard    easy
        Sharding          0%    100%    hard    hard
        
        * Allows read scaling

(I posted this to pgsql-general in 2016.)

Update: Great video on the topic 2020-12-08

View or Post Comments

Query Planner Interview

Saturday, June 23, 2018

Software Engineering Radio has just posted a one-hour audio recording of an interview I did about the Postgres query optimizer. It is generic enough to be useful to anyone wanting to understand how relational databases optimize queries.

On an unrelated note, I am leaving soon for a 25-day European Postgres speaking tour. I am speaking at one-day conferences in Zürich, London, and Amsterdam, a user group in Frankfurt, and will be presenting at ten EnterpriseDB events across Europe.

View or Post Comments

Will Postgres Live Forever?

Thursday, June 7, 2018

I had the opportunity to present an unusual topic at this year's Postgres Vision conference: Will Postgres Live Forever? It is not something I talk about often but it brings out some interesting contrasts in how open source is different from proprietary software, and why innovation is fundamental to software usage longevity. For the answer to the question, you will have to read the slides.

View or Post Comments

Draft of Postgres 11 Release Notes

Thursday, May 17, 2018

I have completed the draft version of the Postgres 11 release notes. Consisting of 167 items, this release makes big advances in partitioning, parallelism, and server-side transaction control via procedures. One of the more unexpected yet useful features is "Allow 'quit' and 'exit' to exit psql when used in an empty buffer".

The release notes will be continually updated until the final release, which is expected in September or October of this year.

View or Post Comments

Intermediate Certificates

Monday, January 22, 2018

I previously mentioned the importance of high quality documentation, so we are always looking for improvements. This email thread from 2013 attempted to codify the rules for how to properly use intermediate ssl/tls certificates with Postgres. At this time, our documentation was updated to recommend storing intermediate certificates with root certificates because it was unclear under what circumstances intermediate certificates are transferred to the remote server to be chained to a trusted root certificate.

During research for my four security talks, I studied certificate handling. I found certificate chain resolution rules in the verify manual page. In testing various certificate locations, I also found that Postgres follows the same rules.

Based on this testing, I realized the conclusions reached in 2013 were inaccurate, or at least incomplete. While the documented procedure worked, the more practical and recommended approach is to store intermediate certificates (created with v3_ca extensions) with leaf certificates to be sent to the remote end. (I think the requirement of using the v3_ca extension when creating intermediate certificates is what caused much of the testing confusion in the past.)

This new procedure allows short-lived leaf and intermediate certificates to be replaced at expire time while long-lived root certificate stores remains unchanged. For example, for clients to verify the server's certificate, the server would contain the intermediate and server's leaf certificates, and clients only need root certificates, which rarely change.

The documentation of all supported Postgres versions has been updated to recommend this new procedure. I have also added sample scripts showing how to create root-leaf and root-intermediate-leaf certificate chains.

These changes will be distributed in the next minor Postgres releases, scheduled for next month. Until this new documentation is released, you can read the updates in the Postgres 11 docs in the server and libpq ssl sections. I am hopeful this clarified documentation will encourage people to use ssl and ssl certificate verification.

View or Post Comments

Four New Security Talks

Monday, January 15, 2018

In the past few months I have completed four new security talks, totaling 294 slides. The first and third talks explain the fundamentals of cryptography and cryptographic hardware, respectively. The second and fourth talks cover application of these fundametals. The second talk covers tls, including the use of ssl certificates by Postgres. The fourth covers the use of cryptographic hardware by applications, including Postgres.

View or Post Comments

Video of Russian Interview

Wednesday, January 10, 2018

I just did a two-hour interview in English with the Russian Postgres user group. A video recording of the interview is online and covers questions asked by the Russian attendees.

View or Post Comments

Web Forums?

Friday, January 5, 2018

This email thread explores the idea of the community supporting web forums instead of or in addition to the email lists, where the majority of community development and discussion happen. Reason stated for not pursuing web forums included

It has been tried before and failed
Larger web forum communities already exist, e.g., Stack Overflow
Established Postgres community members prefer email

Ultimately I think the Postgres community needs to do a better job of publicizing the existence of external communities that help Postgres users, e.g., Slack. As an example, the Postgres irc channel is well publicized and currently has 1,100 connected users. Also, EnterpriseDB created the Postgres Rocks web forum six months ago.

View or Post Comments

Wal and Xlog

Wednesday, January 3, 2018

Postgres isn't the best at naming things. Of course, there is the old computer saying, "There are only two hard things in Computer Science: cache invalidation and naming things." With Postgres being 31 years old and developed by several different project teams, naming can be even more inconsistent.

One naming inconsistency, which we have lived with for years, is the name of the write-ahead log. Postgres references this using the acronym wal in server variables like wal_level, but the PGDATA directory containing the write-ahead log files was called pg_xlog. In "pg_xlog," the "x" stands for "trans" which is short for "transaction", and of course "log" means "log", so "xlog" was short for "transaction log." This was also confusing because there is a clog directory which records "transaction status" information (commits, aborts). So, "xlog" or "transaction log" was already a bad name, and having it also referenced as wal just made it worse.

Postgres 10 has made the difficult change of removing references to "xlog" and "clog," and instead name them "wal" and "pg_xact" consistently. This email thread covers many of the gory details of what we changed and why. It isn't ideal to be changing the name of internal database objects, and it will cause some pain to those moving to Postgres 10, but future users of Postgres will have a more consistent experience of Postgres and how it works.

View or Post Comments