Bruce Momjian: Postgres Blog

Bruce Momjian

Postgres Blog

This blog is about my work on the Postgres open source database, and is published on Planet PostgreSQL. PgLife allows monitoring of all Postgres community activity.

Online status:
Unread Postgres emails:
Email graphs: incoming, outgoing, unread, commits (details)

View Category Index

Thirty Years of Continuous PostgreSQL Development

Wednesday, October 14, 2020

I did an interview with edb recently, and a blog post based on that interview was published yesterday. It covers the Postgres 13 feature set and the effects of open source on the software development process.

View or Post Comments

Community Impact of 2nd Quadrant Purchase

Wednesday, October 7, 2020

Last week 2nd Quadrant was purchased by edb. While this is certainly good news for these companies, it can increase risks to the Postgres community. First, there is an unwritten rule that the Postgres core team should not have over half of its members from a single company, and the acquisition causes edb's representation in the core team to be 60% — the core team is working on a solution for this.

Second, two companies becoming one reduces Postgres user choice for support and services, especially in the North American and western European markets. Reduced vendor options often results in a worse customer service and less innovation. Since the Postgres community does independent innovation, this might not be an issue for community software, but could be for company-controlled tooling around Postgres.

Third, there is the risk that an even larger company wanting to hurt Postgres could acquire edb and take it in a direction that is neutral or negative for the Postgres community. Employee non-compete agreements, and the lack of other Postgres support companies could extend the duration of these effects. There isn't much the community can do to minimize these issues but to be alert for problems.

Update: Hacker News thread and tweet 2020-10-07

View or Post Comments

The Economics of Open Source Contributions

Friday, October 2, 2020

This long article describes the many challenges of managing open source projects and the mismatch between resource allocation, e.g., money, and the importance of the software to economic activity. It highlights OpenSSL as an example where limited funding led to developer burnout and security vulnerabilities, even though so much of the Internet's infrastructure relies on it.

With proprietary software, there is usually a connection between software cost and its economic value, though the linkage varies widely. (How much of software's cost goes into software development, testing, bug fixing, and security analysis has even greater variability.) With open source, there is even less linkage.

The article explores various methods to increase the linkage. It is a complex problem, both to get money, and to distribute money in a way that helps and does not harm open source communities.

Postgres has been fortunate in this regard. Funding from Red Hat and Japanese companies (Fujitsu, ntt, sra) helped support critical Postgres activities in the community's early years. The special nature of database software has formed an environment where the Postgres community has had healthy infrastructure, governance, management, and contributors for over a decade.

On an individual level, we do see these problems. Some contributors take on tasks that yield them little or no short-term benefit, while incurring huge time and emotional costs. Some users ask the community for assistance with the minimal amount of information, assuming we will guess their intent, or that we will dig into their bug report with them doing no research. I often just delete such emails from my mailbox. I sometimes get private emails asking for assistance, so I created a stock email reply that starts with, "Due to time constraints, I do not directly answer general PostgreSQL questions," and information on where to get help, e.g., email lists, irc, faq. This Slashdot post paints a good picture of how to handle the many demands of open source development.

Open source has been mainstream for at least a decade, so it will be interesting to see if generally-accepted solutions are ever found, or if this will continue be an area of confusion and hand-wringing, even though open source software use continues to grow.

View or Post Comments

Three Postgres Adoption Groups in Enterprises

Wednesday, September 30, 2020

Having watched Postgres grow in popularity over the years, I have seen my share of organizations with competing teams, some promoting Postgres, other dismissing it. I came up with this diagram (slide 23) which shows the three groups typically involved in deciding Postgres adoption. The groups are Managers, Administrators, and Developers. In this diagram, each group has things that motivate them listed below the group name.

Postgres adoption goes most smoothly when all three groups see value in Postgres. When talking to different groups in a company, you should consider what things motivate the group you are speaking to. You will also find the diagram helpful in identifying which groups are not excited about Postgres, and how to motivate them.

View or Post Comments

Cloud Vendor Monetization of Open Source

Monday, September 28, 2020

Since open source became a powerful force in the software world, it has gone through several phases. The first phase was built around universities and volunteers, with little business involvement. As open source grew, companies like Red Hat were created to simplify deployment of open source software in enterprises. With the popularity of open source, companies that distributed their software as open source, but were company-controlled, started to proliferate, like MySQL. The distinction between openly-developed open source software and company-developed open source software is still not widely understood. One big distinction is that, for company-developed open source software, a company controls the development process, and bears the full cost of development, which leads to greater customer lock-in and potential future costs as the company maximizes its profits.

Cloud vendors have upended many of the plans of companies built around open source. While open source is often valued by vendors and customers for its rapid proliferation and deployment, there was always a hope that expertise in the open source software would yield sufficient revenue opportunities. What has upended their plans are cloud vendors, who already have a customer relationship by providing cloud infrastructure. They use open source software as an up-sell to cloud customers, bypassing vendors that specialize in open source. This excellent article exposes much of this shift, particularly with this statement:

From an economic perspective (which is what all the industry think pieces and analogies are about), the clouds seem to make a better business from open source than the companies built around particular projects. If you squint, open source could be seen as a very generous charitable donation to some of the largest and wealthiest corporations on the planet.

Instead of open source vendors providing software to be used by potential customers, it is also being used by actual competitors, often much larger and more visible competitors. Such is the problem of open source, that you can't easily control how it is used, though some company-developed open source companies, that do control the software license, are taking steps to avoid this situation, with predictable reactions from the cloud vendors.

The biggest "wow" moment for me in reading this article was the analysis of Red Hat. Company-developed open source companies were always at risk, since they bore the full cost of software development and had a clear monetization strategy — grow and then maximize monetization, but Red Hat was different. It was the poster-child of pure open source (collect from others and bundle) that should have been more immune to the cloud, but the graphs in the blog clearly show that the purchase of Red Hat by ibm was due to cloud competition on Red Hat. This blog entry from the same person goes into even more gory details.

It is not really clear where open source monetization is headed. If cloud vendors were smart, they would keep open source-dedicated companies alive with enough revenue to continue funding open source development and innovation. While open source companies have often yielded high valuations, they have rarely yielded high profits, except perhaps, years ago, for Red Hat. The high-profits world now seems even more distant for open source companies, as cloud vendors, with existing customer relationships, massive standardized infrastructure and services, and economies of scale, siphon even more profit from these businesses.

Update: Aws now more involved with open source 2022-07-19

View or Post Comments

Cloud Vendors as a Barrier

Friday, September 25, 2020

Cloud vendors are barriers like department stores and supermarkets are barriers. Huh? People associate these entities with providing a huge variety of goods and services, all in one place. How can that be a barrier?

Well, you are looking at it from the consumer perspective. For the producer, they are a mixed benefit. These "super sellers" allow access to much larger markets for most single-product producers, but they can have negatives for producers:

They become the place consumers associate with your product
They have the relationship with the consumer
You can easily be replaced if a better product appears
They take a markup

As the producer of a physical product, it is up to you to decide if working with department stores and supermarkets is a positive or negative. However, with open source software, there is no calculus. Unless your open source license prohibits modified or unmodified hosting of your software on cloud servers, you have no control over whether a cloud vendor is the way consumers interact with your open source software. The cloud vendor is the one who downloads the software, configures it, perhaps supports it, and guarantees uptime. The cloud vendor can leverage software revenue opportunities. To avoid cloud usage, some software producers have chosen or created licenses that restrict such usage:

In response to the license handling of Elasticsearch, aws forked (1, 2, 3) the Apache 2.0-licensed Elasticsearch source code and started writing code to replace the proprietary features. Sometimes even license changes don't protect open source projects from cloud vendor barriers.

This is nothing new. Commercial companies like Red Hat have supported open source software since their inception, and have generated revenue from that relationship and bundling. Cloud vendors are just another set of companies that are making open source easier to use, and benefiting from it. While software-bundle vendors like Red Hat can customize the software for their bundle, cloud vendors can also optimize the software for their hardware and infrastructure. This is a clear value to customers that is hard for software producers to match.

This article discusses the history of open source business models, This article describes cloud vendor behavior as "strip mining". There was even a recent conference where open source vendors met to discuss how to deal with cloud vendor competition (sponsored by aws).

There is one big distinction in all the open source software products I have listed above — they are all company-controlled open source, meaning the development is mostly controlled by a single company, which monetizes use of the software, while distributing it as open source. This is drastically different from community-controlled open source projects like Postgres. There are far fewer challenges to community-controlled open source projects from cloud vendors, mostly because there is no monetization goal. There is some concern that cloud vendor relationships with users will diminish community contributions, but it is not clear if this true. Cloud vendors clearly increase use of community-controlled software, and to the extent the cloud vendors reference open source communities in their interaction with customers, it helps these communities.

Just as bakeries, flower shops, produce markets, meat, cheese, and seafood sellers struggle to survive when supermarkets provide convenient, if somewhat less diverse, products, so company-controlled open source will suffer from cloud vendors. For open source communities, the only real risk is that companies that support its open source developers will struggle, and therefore reduce paid developers working on open source projects. That connection is hard to measure from an open source project perspective, so we will just have to wait and see how things evolve.

Update: HashiCorp makes a similar license change. 2023-08-11

Update: Redis also changes its license. 2024-03-23

Update: Another CockroachDB change. 2024-08-15

Update: Elastic agpl license. 2024-09-03

Update: Open source to proprietary summary 2024-12-19

View or Post Comments

Developers in Front

Wednesday, September 23, 2020

Most companies have marketing and sales people as the visible part of their company. Technical people, even in technology companies, are often kept in the back, and only brought out for brief periods when needed. Open source is different — there are no marketing or sales teams, so software developers are the faces of projects. This gives technical people an opportunity to attain world-wide recognition for their efforts. There are not many places technical people can truly shine, but open source is one such opportunity.

View or Post Comments

The Berkeley 39

Monday, September 21, 2020

Postgres turns 34 this year. Michael Stonebraker, during his 2015 Turing Award speech (which I blogged about previously), included the names of the 39 Berkeley students (plus co-leader Larry Rowe) who helped write the original version of Postgres. I was hoping to add this list to the Postgres release notes, but we recently stopped including back branch release notes in current releases, so now there is no good place to put them. As a thanks to them, and with the help of the community, I am listing them below:

I knew some of the names and initials from the C comments. Just like those of us who have worked on Postgres for decades, I doubt they suspected that their code would be used so many years later.

View or Post Comments

Postgres and the Artificial Intelligence Landscape

Friday, September 18, 2020

I recently wrote a presentation, Postgres and the Artificial Intelligence Landscape, which covers the basics of artificial intelligence and shows how Postgres can be used for this purpose. This week I presented it at the Chicago PostgreSQL Meetup Group so I am now publishing the slides.

View or Post Comments

Why Not to Choose Postgres

Wednesday, September 16, 2020

Postgres has made great strides in adding features to match proprietary databases, and it has many complex features that other databases don't have. However, that doesn't mean it is the best fit for every organization. There are still reasons not to use Postgres:

Skills in another relational database, and no desire or value to learn Postgres
Custom applications written for another database that you don't want to modify to work with Postgres
Using externally-developed applications, tools, or frameworks that don't support Postgres
Storage of non-transactional or cache data where Postgres's overhead is significant
Multi-host workloads with simple queries where NoSQL is a win
Small, single-user systems, where SQLite is best

This email thread had lot of discussion on the topic. What is interesting is that decades-old complaints about missing features, reliability, and performance are no longer mentioned.

View or Post Comments

Migrating Away from Hacks

Monday, September 14, 2020

Database applications are initially written using the simplest queries possible. During testing and in production, some application tasks might have unacceptable performance. This is where re-architecturing happens, and where simple queries and data schema layouts can get complex. They might get complex because it is required to accomplish the task, or it might be because of limitations in how the database software handles certain queries. Database and tooling upgrades can require further complex additions.

When switching to a new database like Postgres, all that complexity comes along for the ride. Sometimes complexity added to work around deficiencies in other databases work fine in Postgres, but often that complexity has to be removed to get good performance in Postgres. There also might be cases where complexity has to be added to get good performance in Postgres.

The bottom line is that complexity is bad for applications, so add complexity only when necessary. Wise application developers remove unnecessary complexity periodically, but it can be hard to know if database upgrades have made some complexity unnecessary. Porting to a new database is an ideal time to reevaluate if application simplification is possible.

View or Post Comments

Postgres in the Cloud: The Hard Way

Friday, September 11, 2020

I recently wrote a presentation, Postgres in the Cloud: The Hard Way, which shows how to create a cloud instance, and install and run Postgres, purely from the command line. This helps show how all the pieces fit together. I recently presented this at pgDay Israel so I am now publishing the slides. A recording of the presentation is available.

View or Post Comments

Changing Cars, Changing Databases

Thursday, September 3, 2020

It would be very easy if I drove the same car regularly, but because of my family size and travels, I don't have that luxury. Some cars I drive have smart keys, some mechanical keys. Some have gas tank doors on the driver's side, others from the passenger side. They steer differently, have different acceleration capabilities, even different service requirements. I have gotten used to switching cars, but still get confused when I have to fuel the car since I have to remember which side has the gas tank door.

If I had driven the same car for years, switching to a different car would be even harder — that is the circumstance for people moving from other databases to Postgres. They have often driven the same car/database for years, perhaps decades. They know the intricacies of how those databases behave that even the average database vendor employee might not know. Switching to another database can be traumatic, especially if their job relies on databases running reliably. They probably have all sorts of tricks and procedures for ensuring uptime, and, when switching databases, it isn't always clear if those procedures should be modified or abandoned, and what new procedures will be necessary.

There are some tasks that are common to all database: sql, backups, fail over, performance monitoring, but it would be nice if there was a cheat sheet of all the changes necessary in moving from one database to another. There are some guides in switching from other databases to Postgres, but they don't cover every detail, and many user procedures are based on their workload. It isn't always possible to even know how workloads will behave in Postgres.

It is not clear exactly what recommendation I can give for users switching to Postgres, except to review all existing procedures to determine if they are still necessary or need modification, and if new procedures are necessary. One thing is for sure — changes will be needed, and rigidly following procedures used for your previous database is not a wise plan.

View or Post Comments

The Inner Workings of Oracle Development

Monday, August 31, 2020

Having worked in open source for decades, where every success and failure is public knowledge, I have always wondered how proprietary development is done, particularly for databases. I have gotten some glimpses into that world from former employees, but this Y combinator thread is the most extensive view of Oracle development I have ever seen. It has useful lessons for all developers, as well as comments about the source code of other relational databases.

The first thing that jumped out to me is how overwhelmed the staff appears to be by the source code's complexity. Code cleanup is never fun but it is necessary for efficient coding. For companies, it is hard to justify developer time spent on code cleanup since the benefits of cleanup are often minor for each task but, in aggregate over time, huge. Code cleanup can rarely be tied to a direct deliverable, so it is often overlooked. For open source, clean code encourages developers to spend their free time coding, so there is a more direct incentive to do cleanup, plus deliverables for open source are much less rigid. There is a mention of Postgres in the thread:

I'm sure some of the difference (25M [Oracle] vs. 1.3M [Postgres}) can be attributed to code for Oracle features missing in PostgreSQL. But a significant part of it is due to careful development process mercilessly eliminating duplicate and unnecessary code as part of the regular PostgreSQL development cycle.

It's a bit heartbreaking at first (you spend hours/days/weeks working on something), and then a fellow [Postgres] hacker comes and cuts off the unnecessary pieces, but in the long run I'm grateful we do that.

Second, there is the reliance on test-driven development. Test-driven development certainly has value, but for Oracle database developers, it appears to have become a huge drag on development. The tests took 24+ hours to run, and because the code was so complex, it was hard to know if the changes made would even pass the tests. This video evaluates the pathologies of test-drive development. I think this comment captures it:

Tdd [test-drive development] is like relying on debugger to solve your problem. Is debugger a good tool? yes,it is a great tool. But using it as an excuse to avoid understanding what happens under the hood is plain wrong.

Postgres has succeeded and gained a reputation for reliability without relying on tests (though we have them) but rather by reviewing patches and looking for interactions with other parts of the system. We are also willing to restructure code where needed, and break backward compatibility where warranted. However, our big advantage is a development team that is unmatched in the relational database industry, and an open source development model that efficiently harnesses their skills.

View or Post Comments

Development Methods

Wednesday, August 26, 2020

I think there are three types of software development:

closed development & closed source distribution
closed development & open source distribution
open development & open source distribution

#1 is proprietary development, like Oracle and DB2.

#2 is development like MySQL and Mongo, where a single company controls all the development work, but the source is released, often with restrictions on how it is used. Since a company controls the development, they can focus on specific big customers and adjust usage restrictions in ways that encourage payment for certain features, for non-gpl use, or for cloud use.

#3 is Postgres, Linux, etc. The Postgres core team's independence from company control is just one aspect of the open development of Postgres.

Understanding the development and distribution categories of software can reliably predict how that software will evolve over time.

View or Post Comments

Why Database Software Is Unique

Monday, August 24, 2020

Having worked with databases for over three decades, I have found there are three aspects of database software that make it unique:

Variable workloads
Performance requirements
Durability

Most other software do not have these requirements. Because of them, databases typically need more tuning, monitoring, and maintenance than other software, like Java applications or even operating systems. These requirements also help make companies supporting database software profitable, including open source-based ones.

View or Post Comments

Standard Deviation

Friday, August 21, 2020

Postgres has a lot of aggregate functions to make your life, and data analysis, easier. Postgres 12 has 42 aggregate functions. You can see a full list by typing \daS in psql.

While you are probably familiar with common aggregates like count and avg, Postgres supports many more. There are aggregates for arrays, json, and statistical functions. When dealing with numeric data, people are usually concerned about the number or total of the entries, but when trying to determine how much values vary, standard deviation is helpful:

SELECT stddev(x) FROM (VALUES (25), (50), (75)) AS t(x);
       stddev
---------------------
 25.0000000000000000
 
SELECT stddev(x) FROM (VALUES (49), (50), (51)) AS t(x);
         stddev
------------------------
 1.00000000000000000000

Both sets of three values total 150, but the first set varies by 25, while the second set varies by only one. If you are analyzing your bank balance, you might not care how much the numbers vary, but if you are looking at web site response times, the variability of the numbers can be significant.

View or Post Comments

Keyset Pagination in Action

Monday, August 17, 2020

Having explained keyset pagination, let's look at how it behaves while inserts and deletes are happening in the table. Using the queries from the previous blog entry, let's see how deletions from previous pages affect displaying of page four from page three:

SELECT product_id, description
FROM product
WHERE product_id > 30
ORDER BY product_id
LIMIT 11;
 
 product_id | description
------------+-------------
         31 | Product 31
         32 | Product 32
         33 | Product 33
         34 | Product 34
         35 | Product 35
         36 | Product 36
         37 | Product 37
         38 | Product 38
         39 | Product 39
         40 | Product 40
         41 | Product 41
 
DELETE FROM product WHERE product_id <= 5;
 
SELECT product_id, description
FROM product
WHERE product_id > 30
ORDER BY product_id
LIMIT 11;
 
 product_id | description
------------+-------------
         31 | Product 31
         32 | Product 32
         33 | Product 33
         34 | Product 34
         35 | Product 35
         36 | Product 36
         37 | Product 37
         38 | Product 38
         39 | Product 39
         40 | Product 40
         41 | Product 41

As you can see, the delete had no affect because we were anchored on the last value on the page, rather than, if we had used offset, the first value of the result set. Inserts before page four are also ignored:

INSERT INTO product
SELECT id, 'Product ' || id::TEXT
FROM generate_series(1, 3) AS t(id);
 
SELECT product_id, description
FROM product
WHERE product_id > 30
ORDER BY product_id
LIMIT 11;
 
 product_id | description
------------+-------------
         31 | Product 31
         32 | Product 32
         33 | Product 33
         34 | Product 34
         35 | Product 35
         36 | Product 36
         37 | Product 37
         38 | Product 38
         39 | Product 39
         40 | Product 40
         41 | Product 41

Of course, inserts and deletes after the current page would display normally. We can also easily go backward from page four:

WITH page AS
(
        SELECT product_id, description
        FROM product
        WHERE product_id < 31
        ORDER BY product_id DESC
        LIMIT 11
)
SELECT product_id, description
FROM page
ORDER BY product_id;
 
 product_id | description
------------+-------------
         20 | Product 20
         21 | Product 21
         22 | Product 22
         23 | Product 23
         24 | Product 24
         25 | Product 25
         26 | Product 26
         27 | Product 27
         28 | Product 28
         29 | Product 29
         30 | Product 30

Going from page three to page two: (Remember, when paging backward, the first row (e.g., 20) is used as a marker to indicate if another previous page exists.)

WITH page AS
(
        SELECT product_id, description
        FROM product
        WHERE product_id < 21
        LIMIT 11
)
SELECT product_id, description
FROM page
ORDER BY product_id;
 
 product_id | description
------------+-------------
         10 | Product 10
         11 | Product 11
         12 | Product 12
         13 | Product 13
         14 | Product 14
         15 | Product 15
         16 | Product 16
         17 | Product 17
         18 | Product 18
         19 | Product 19
         20 | Product 20

Trying to view page one causes the problem outlined in the previous blog post because only eight rows are returned:

WITH page AS
(
        SELECT product_id, description
        FROM product
        WHERE product_id < 11
        ORDER BY product_id DESC
        LIMIT 11
)
SELECT product_id, description
FROM page
ORDER BY product_id;
 
 product_id | description
------------+-------------
          1 | Product 1
          2 | Product 2
          3 | Product 3
          6 | Product 6
          7 | Product 7
          8 | Product 8
          9 | Product 9
         10 | Product 10

This is where it is suggested that the first page be regenerated in its entirety, and the user informed:

SELECT product_id, description
FROM product
ORDER BY product_id
LIMIT 11;
 
 product_id | description
------------+-------------
          1 | Product 1
          2 | Product 2
          3 | Product 3
          6 | Product 6
          7 | Product 7
          8 | Product 8
          9 | Product 9
         10 | Product 10
         11 | Product 11
         12 | Product 12
         13 | Product 13

Page two would be generated using the new rows from page one:

SELECT product_id, description
FROM product
WHERE product_id > 12
ORDER BY product_id
LIMIT 11;
 
 product_id | description
------------+-------------
         13 | Product 13
         14 | Product 14
         15 | Product 15
         16 | Product 16
         17 | Product 17
         18 | Product 18
         19 | Product 19
         20 | Product 20
         21 | Product 21
         22 | Product 22
         23 | Product 23

Non-sort key updates are uninteresting. Primary key updates can easily cause rows from earlier pages to appear on later pages, or rows from later pages to appear on earlier previous pages, if the primary key is updated across the current page view result set. Let's show the page previous to page two before and after two updates:

WITH page AS
(
        SELECT product_id, description
        FROM product
        WHERE product_id < 13
        ORDER BY product_id DESC
        LIMIT 11
)
SELECT product_id, description
FROM page
ORDER BY product_id;
 
 product_id | description
------------+-------------
          1 | Product 1
          2 | Product 2
          3 | Product 3
          6 | Product 6
          7 | Product 7
          8 | Product 8
          9 | Product 9
         10 | Product 10
         11 | Product 11
         12 | Product 12
 
UPDATE product SET product_id = 4 WHERE product_id = 14;
 
UPDATE product SET product_id = 5 WHERE product_id = 15;
 
WITH page AS
(
        SELECT product_id, description
        FROM product
        WHERE product_id < 13
        ORDER BY product_id DESC
        LIMIT 11
)
SELECT product_id, description
FROM page
ORDER BY product_id;
 
 product_id | description
------------+-------------
          2 | Product 2
          3 | Product 3
          4 | Product 14
          5 | Product 15
          6 | Product 6
          7 | Product 7
          8 | Product 8
          9 | Product 9
         10 | Product 10
         11 | Product 11
         12 | Product 12

This causes the problem, also outlined in the previous blog post, where a query of the first page returns 11 rows, meaning the first page results should be recreated and the user informed:

SELECT product_id, description
FROM product
ORDER BY product_id
LIMIT 11;
 
 product_id | description
------------+-------------
          1 | Product 1
          2 | Product 2
          3 | Product 3
          4 | Product 14
          5 | Product 15
          6 | Product 6
          7 | Product 7
          8 | Product 8
          9 | Product 9
         10 | Product 10
         11 | Product 11

Hopefully this blog post gives you an idea of how keyset pagination behaves in an environment where the result set is changing between page requests.

View or Post Comments

Keyset Pagination

Wednesday, August 12, 2020

Having covered pagination in a previous blog post, I want to explore method #5, limit/offset and where, also called keyset or seek pagination. This method avoids many of the performance pitfalls and inconsistent results of the other methods, but it comes with its own complexities.

First, let's set expectations. Let's assume ten results per page, so each page should show ten results, except for the last page, which can contain 1–10 results. The last page should ideally not show zero results. Let's also suppose we are displaying products, and the page results are ordered by a unique product id. Here is a sample table:

CREATE TABLE product (product_id SERIAL PRIMARY KEY, description TEXT);
 
INSERT INTO product
SELECT id, 'Product ' || id::TEXT
FROM generate_series(1, 100) AS t(id);

Generating the first page is quite simple using limit. Paging forward is simple too. The user is not requesting to see previous pages, so any additions or removals in those pages are not significant. They are also not asking to see an updated display of the current page's values, so that can be ignored too. Using limit and where, you can see the next ten values in order relative to the currently viewed page. For example, you are on page 3 (result set rows 21–30), and product 30 is the last product on the page — this query will display the next page:

SELECT product_id, description
FROM product
WHERE product_id > 30
ORDER BY product_id
LIMIT 11;
 
 product_id | description
------------+-------------
         31 | Product 31
         32 | Product 32
         33 | Product 33
         34 | Product 34
         35 | Product 35
         36 | Product 36
         37 | Product 37
         38 | Product 38
         39 | Product 39
         40 | Product 40
         41 | Product 41

I have used limit 11 above so if less than eleven rows are returned, the "Next Page" option can be disabled, since there are no more results right now, though there might be later. Obviously, only the first ten rows should be displayed. An index can often be used for this query. Offset would be used to jump forward or backward multiple pages.

Paging backward gets tricky. As stated above, only the last page can display less then ten results, but rows might be added and removed from previous pages, so just because we are on page 3 doesn't mean there are twenty previous rows. We can't just do LIMIT 11 OFFSET 10 to display the previous page without a repeatable read or serializable snapshot — we might get more or less than ten rows. To do this properly, assuming the first displayed row on page 3 has a product_id of 21, it is necessary to reverse the order by clause, use limit, and then use a common table expression to re-order the results for display:

WITH page AS
(
        SELECT product_id, description
        FROM product
        WHERE product_id < 21
        ORDER BY product_id DESC
        LIMIT 11
)
SELECT product_id, description
FROM page
ORDER BY product_id;
 
 product_id | description
------------+-------------
         10 | Product 10
         11 | Product 11
         12 | Product 12
         13 | Product 13
         14 | Product 14
         15 | Product 15
         16 | Product 16
         17 | Product 17
         18 | Product 18
         19 | Product 19
         20 | Product 20

This gets the 11 result set rows before page 3, and in this case the first row is used to indicated if there is a valid "Previous Page". However there are more complexities:

If 11 rows are returned and your user interface suggests you should be on the first page, you should probably repopulate the first page and inform the user.
If only ten rows are returned, you can assume you are on the first page, even if that was not true based on previous queries.
If you get less then ten rows, you will need to re-issue the initial query to populate the first page with ten results, and perhaps inform the user since result rows might be duplicated from previous pages.

As stated earlier, the keyset pagination method has many advantages in terms of performance and consistency, but it can be complex to implement because you are effectively getting a new snapshot of the data each time you ask for another page, and making that line up with pagination requirements can be tricky.

View or Post Comments

Pagination Tips

Monday, August 10, 2020

We have all used applications that allow searches which return results in sections or "pages", like for products or flights. For example, you might search for a product, and there might be 1,500 matches, but you don't see 1,500 results — you only see ten results (1–10), and you have the option of seeing the next ten (11–20), and the next ten (21–30), etc. Pagination is done for several reasons:

The user probably doesn't want to see all the results at once
The user probably doesn't want to ever see all of the results
It might be inefficient to transfer and display all results

There are several ways to enable pagination using a client/server architecture like databases. One approach is to transfer all results to the client, and let the client do the pagination. That handles the first issue, but it doesn't handle cases where the transfer and storage of many results is inefficient. While this might work for 1,500 results, it is going to perform badly for 150 million results.

Therefore, some kind of pagination support on the server side is required if large result sets are possible. There are four common approaches:

Cursors: Open a non-with hold cursor in a transaction block and fetch rows in pages. The downside of this is that the entire result set often has be computed before returning any result. Also, the transaction must be kept open while users are paging through the results, leading to potentially long-running transactions, and therefore decreased cleanup efficiency. Using idle_in_transaction_session_timeout prevents sessions from keeping transactions open too long.
Using with hold cursors avoids keeping a transaction open during page fetches, but it does require the entire result set to be stored in server memory.
limit/offset: These keywords allow select to return partial results, ideally suited to pagination. Limit/offset also allow the optimizer (slide 52) to efficiently access limited result sets by often using indexes which are only efficient for small result sets. This also avoids creation of the entire result in memory.
Limit/offset with repeatable read: Limit/offset alone has a problem displaying accurate data if the underlying results change during page views. For example, if I am viewing results 21–30, and a row is added which should appear in range 11–20, if I request the next page, what was item 30 will appear again as item 31. If an earlier item is deleted, what was item 31 will now be item 30, and will not appear in range 31–40. Effectively, paging back and forward will cause items to not appear, or appear twice. The fix for this is to use limit/offset in a repeatable read or serializable transaction, so that all page requests use the same snapshot. This does have the same problem that non-with hold cursors had with long-running transactions, so the use of idle_in_transaction_session_timeout is recommended.
Limit/offset and where: It is possible to use limit/offset with a where restriction that only returns rows after the previous last row returned based on the result set ordering, or before the previous first row returned for previous page result sets. This avoids the duplicate/missing row problem of #4 without requiring a transaction to remain open. These blog entries (1, 2) have more details about this method.

One optimization to reduce query requests is to fetch a large page range in the application, display smaller page ranges to the user as requested, and request more page ranges as needed.

One other thing people expect from pagination is an estimate of the number of matches. While you can run count(*) to find the number of matches, this can be very inefficient. A more creative approach is to use the optimizer's statistics as an estimate, since it already maintains statistics to generate efficient plans. For example, if you create this function:

CREATE OR REPLACE FUNCTION row_count_estimate(query TEXT) RETURNS TEXT
AS $$
DECLARE str TEXT;
        row_count INTEGER;
BEGIN
        -- get just the first row
        EXECUTE $a$EXPLAIN $a$ || query INTO str;
        -- extract the row value
        SELECT substring(substring(str FROM ' rows=[0-9]+') FROM 7) INTO row_count;
        RETURN row_count;
END;    
$$
LANGUAGE plpgsql;

you can use it to estimate the number of rows in a result set. (Don't specify limit/offset since we want the total count.)

SELECT row_count_estimate('SELECT * FROM product WHERE name LIKE ''a%''');
 row_count_estimate
--------------------
 4

To improve the accuracy of the estimates, you can increase the frequency of statistics updates by modifying autovacuum's analyze settings (autovacuum_analyze_) either at the cluster or per-table level. You can also run analyze manually. You can also increase the granularity of statistics, which can improve estimates. This can also be set a the cluster or per-table level. A creative option is to get an exact row count using a count(*) query only if explain reports a low row count, meaning count(*) would be inexpensive.

View or Post Comments

Invalid Times

Wednesday, August 5, 2020

Have you ever wondered how invalid or ambiguous times are handled by Postgres? For example, during a daylight saving time transition in the usa, time switches either from 2am to 3am, or from 2am back to 1am. On a fall-forward day (the former), how would 2:30am be represented? Is 1:30am during a fall-back day represented as the first or second 1:30am of that day? This email thread explains the problem, and this patch documents the behavior. (November 4, 2018 was a "fall back" date in the usa.)

The original poster, Michael Davidson, was showing the first query, and Tom Lane was saying you would need to use one of the two later queries to qualify the 1am time:

SHOW TIME ZONE;
     TimeZone
------------------
 America/New_York
 
SELECT '2018-11-04 01:00:00'::timestamp WITH TIME ZONE;
      timestamptz
------------------------
 2018-11-04 01:00:00-05
 
SELECT '2018-11-04 01:00:00 EST'::timestamp WITH TIME ZONE;
      timestamptz
------------------------
 2018-11-04 01:00:00-05
 
SELECT '2018-11-04 01:00:00 EDT'::timestamp WITH TIME ZONE;
      timestamptz
------------------------
 2018-11-04 01:00:00-04

I am not sure how to specify this for time zones that don't have daylight/non-daylight saving time abbreviations — I suppose utc offsets would need to be used, e.g.:

SELECT '2018-11-04 01:00:00-05'::timestamp WITH TIME ZONE;
      timestamptz
------------------------
 2018-11-04 01:00:00-05

Interestingly, in this case the time zone abbreviation contains more information than the more generic time zone string 'America/New_York'.

View or Post Comments

Differences Between Dates

Monday, August 3, 2020

What is the difference between two dates? You would think there was one answer, but there isn't. You can give an answer in calendar terms (years/months/days), the number of days, or the number of seconds. Postgres offers all of these options:

SELECT age('2019-12-25', '2018-06-01');
          age
-----------------------
 1 year 6 mons 24 days
 
SELECT '2019-12-25'::timestamp - '2018-06-01'::timestamp;
 ?column?
----------
 572 days
 
SELECT '2019-12-25'::timestamptz - '2018-06-01'::timestamptz;
     ?column?
-------------------
 572 days 01:00:00
 
SELECT '2019-12-25'::date - '2018-06-01'::date;
 ?column?
----------
      572
 
SELECT EXTRACT(EPOCH FROM '2019-12-25'::timestamptz) - EXTRACT(EPOCH FROM '2018-06-01'::timestamptz);
 ?column?
----------
 49424400

Unless you are Mr. Spock , you probably can see the value in all of these methods. Age() provides the simplest option for human consumption. Timestamp subtraction (no tz) returns an interval, which can calculate day and second differences, but doesn't handle daylight saving time changes. Timestamptz subtraction also returns interval, but handles daylight saving time changes. Subtraction of date values returns an integer number of days and is good for cases where the smallest unit of calculation is a day, like bank loan interest or hotel nights. Extract with epoch is ideal when exact precision is required, like electricity consumption or astronomical measurements.

You can also think of it as the time difference calculations they ignore. For example, age() compares months/days/seconds units to find the simplest difference, and ignores daylight saving time changes, and sometimes even ignores leap days. Date and timestamp subtraction ignores daylight saving time changes. Subtraction using timestamptz and extract with epoch honor both. (Unfortunately, Postgres ignores leap seconds).

Here is an example of age() ignoring leap days when the day of the month is the same, but reflecting leap days if the the days of the month are different:

-- leap year, same day of the month
SELECT age('2020-03-01', '2020-02-01');
  age
-------
 1 mon
 
-- non-leap year, same day of the month
SELECT age('2021-03-01', '2021-02-01');
  age
-------
 1 mon
 
-- leap year, different day of the month
SELECT age('2020-03-01', '2020-02-02');
   age
---------
 28 days
 
-- non-leap year, different day of the month
SELECT age('2021-03-01', '2021-02-02');
   age
---------
 27 days

This is because age() tries to find the simplest difference. Here is an example of daylight saving time changes using age() and subtraction:

-- Daylight saving time started in America/New_York on March 8, 2020 02:00:00
SHOW timezone;
     TimeZone
------------------
 America/New_York
 
SELECT age('2020-03-09', '2020-03-08');
  age
-------
 1 day
 
SELECT age('2020-03-09 00:00:00'::timestamptz, '2020-03-08 00:00:00'::timestamptz);
  age
-------
 1 day
 
SELECT '2020-03-09'::date - '2020-03-08'::date;
 ?column?
----------
        1
 
SELECT '2020-03-09'::timestamp - '2020-03-08'::timestamp;
 ?column?
----------
 1 day
 
SELECT '2020-03-09'::timestamptz - '2020-03-08'::timestamptz;
 ?column?
----------
 23:00:00
 
-- a 23-hour day
SELECT EXTRACT(EPOCH FROM '2020-03-09'::timestamptz) - EXTRACT(EPOCH FROM '2020-03-08'::timestamptz);
 ?column?
----------
    82800
 
-- a 24-hour day
SELECT EXTRACT(EPOCH FROM '2020-03-10'::timestamptz) - EXTRACT(EPOCH FROM '2020-03-09'::timestamptz);
 ?column?
----------
    86400

If you are doing a later calculation using a computed date difference, consider how accurate you want the result to be:

Do you want it to be the same day based on the calendar? Use age()
Do you want the number of days to be exactly the same? Use date or timestamp subtraction
Do you want the number of seconds to be exactly the same? Use timestamptz subtraction or extract with epoch

I have already written about the complexities of using and computing interval values.

View or Post Comments

Computing Interval Values

Friday, July 31, 2020

The interval data type stores time duration as months, days, and seconds. Years are represented as a fixed number of months, and hours and minutes as a fixed number of seconds. Using interval values makes time computation very simple:

-- Daylight saving time started in America/New_York on March 8, 2020 02:00:00
SHOW timezone;
     TimeZone
------------------
 America/New_York
 
SELECT '2020-03-07 00:00:00'::timestamptz + '2 days';
        ?column?
------------------------
 2020-03-09 00:00:00-04
 
SELECT '2020-03-07 00:00:00'::timestamptz + '48 hours';
        ?column?
------------------------
 2020-03-09 01:00:00-04

This computation spans a usa daylight saving time change. The interval data type allows the user to specify whether they want days or hours added, with different results because of the daylight saving time change. This is a great use of the interval data type.

Using queries that output interval values is more nuanced. While you can specify units on input, e.g. days, hours, seconds, you can't specify whether you care about months, days, or seconds in interval output. Therefore, different operations are required to produce different outputs, e.g.:

SELECT '2020-03-09 00:00:00'::timestamptz - '2020-03-07 00:00:00'::timestamptz;
    ?column?
----------------
 1 day 23:00:00
 
SELECT age('2020-03-09 00:00:00'::timestamptz, '2020-03-07 00:00:00'::timestamptz);
  age
--------
 2 days

The first query uses subtraction, and computes based on seconds. The second query internally calls timestamp_age(). The comment in the source code is illustrative:

Calculate time difference while retaining year/month fields. Note that this does not result in an accurate absolute time span since year and month are out of context once the arithmetic is done.

The key part is the "out of context" mention. What it is saying is that when age() returns 2 days, that could have spanned 47, 48, or 49 hours, based on usa time zones, and that detail is lost in the output. Look at this:

SELECT '2020-03-07 00:00:00'::timestamptz + '1 day 23 hours';
        ?column?
------------------------
 2020-03-09 00:00:00-04
 
SELECT '2020-03-07 00:00:00'::timestamptz + '47 hours';
        ?column?
------------------------
 2020-03-09 00:00:00-04
 
SELECT '2020-03-07 00:00:00'::timestamptz + '2 days';
        ?column?
------------------------
 2020-03-09 00:00:00-04
 
SELECT '2020-03-07 00:00:00'::timestamptz + '2 days -1 hours';
        ?column?
------------------------
 2020-03-08 23:00:00-04

The first three return the same result, though the last does not because days is computed first, then hours. However, when the first and third queries are run in a different context, i.e., for a different date, they generate different outputs:

SELECT '2020-03-12 00:00:00'::timestamptz + '1 day 23 hours';
        ?column?
------------------------
 2020-03-13 23:00:00-04
 
SELECT '2020-03-12 00:00:00'::timestamptz + '2 days';
        ?column?
------------------------
 2020-03-14 00:00:00-04

When you are inputting interval values, be conscious of the month, day, and second values you specify, and when calling a function that produces interval output, consider how it computes its output.

View or Post Comments

pgFormatter

Wednesday, July 29, 2020

In my years with Postgres, I have seen some amazingly complex queries posted to the email lists. I have never understood how people can read complex queries with no formatting, e.g., no keyword capitalization, no indenting, no line breaks for new clauses:

select n.nspname as "Schema", p.proname as "Name",
pg_catalog.format_type(p.prorettype, null) as "Result data type", case
when p.pronargs = 0 then cast('*' as pg_catalog.text) else
pg_catalog.pg_get_function_arguments(p.oid) end as "Argument data
types", pg_catalog.obj_description(p.oid, 'pg_proc') as "Description"
from pg_catalog.pg_proc p left join pg_catalog.pg_namespace n on n.oid =
p.pronamespace where p.prokind = 'a' and n.nspname <> 'pg_catalog' and
n.nspname <> 'information_schema' and
pg_catalog.pg_function_is_visible(p.oid) order by 1, 2, 4;

Obviously, some people can read such queries, but I never can. I rely on clean formatting to conceptually understand queries and detect errors. Some people have no trouble understanding, but I find this much clearer:

SELECT n.nspname AS "Schema", p.proname AS "Name",
    pg_catalog.format_type(p.prorettype, NULL) AS "Result data type", CASE WHEN
    p.pronargs = 0 THEN
        CAST('*' AS pg_catalog.text)
    ELSE
        pg_catalog.pg_get_function_arguments(p.oid)
    END AS "Argument data types", pg_catalog.obj_description(p.oid, 'pg_proc')
        AS "Description"
FROM pg_catalog.pg_proc p
    LEFT JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace
WHERE p.prokind = 'a'
    AND n.nspname <> 'pg_catalog'
    AND n.nspname <> 'information_schema'
    AND pg_catalog.pg_function_is_visible(p.oid)
ORDER BY 1, 2, 4;

In the past, I used to clean up queries using a tool designed to reformat Informix 4GL programs, and I would use sed to fix some things the tool missed. That was obviously far from ideal, so last year I tried Gilles Darold's pgFormatter. My first test was to run the sql files I use for my presentations through the formatter. I found a few problems that Gilles fixed quickly. I then ran the Postgres regression test queries through the formatter, which turned up more problems. Again, Gilles fixed them, and I could then run all 78k lines of sql queries from the regression tests with satisfactory results.

Now, when I am presented with a query that looks like the first version, I run to pgFormatter, and with my favorite flags, I get a query that looks like I typed it.

View or Post Comments

Writing Style

Monday, July 27, 2020

There seem to be as many methods of writing sql queries as ways of writing essays. While spacing, capitalization, and naming are all personal preferences, there are also logical arguments for why certain styles are better than others. This web page outlines one set of styles, based on Joe Celko's SQL Programming Style. While I don't agree with all the style decisions, I feel it is a good way to think about your own style decisions and increase style consistency.

View or Post Comments

Encryption at Rest

Friday, July 24, 2020

There are many security guides that require encryption at rest. However, it is unclear exactly what "at rest" means. Encrypted at rest can potentially mean encrypted when the

Storage is powered off
File system is unmounted
Database is not running
Data is in the kernel's file system cache
Write-ahead log is in archive storage
Data is in backups
Data is in the database process's memory

Odds are some of these items are required to implement a guide's "encryption at rest," but maybe not all of them. The last one is particularly difficult, especially since the unlock key probably also has to be somewhere in process memory. When implementing encryption at rest, it is good to be clear exactly what encryption levels are required, and what risks they are designed to minimize.

View or Post Comments

Passwords in Log files

Wednesday, July 22, 2020

No one likes user passwords appearing in log files (except the bad guys/gals). Because Postgres uses sql queries to manage user accounts, including password assignment, it is possible for passwords to appear in the server logs if, for example, log_statement is enabled or an error is generated, depending on the setting of log_min_error_statement.

One solution is to encrypt the password manually before issuing the query, but that can be error-prone. A simpler solution is to use psql to automatically encrypt passwords before sending them to the server, e.g.:

SET client_min_messages = 'log';
SET log_statement = 'all';
 
\password postgres
Enter new password:
Enter it again:
LOG:  statement: show password_encryption
LOG:  statement: ALTER USER postgres PASSWORD 'md567429efea5606f58dff8f67e3e2ad490'

Notice psql runs show password_encryption to determine if md5 or scram-sha-256 should be used. psql then hashes the supplied password and issues an alter user command. While it is not ideal that hashed passwords can appear in the server logs, it is better than having user-typed passwords in the logs. (It might be wise to set log_min_error_statement to "panic" to prevent logging of error queries.) Another secure option is to use syslog to send the logs to a secure server.

View or Post Comments

Force Password Changes

Monday, July 20, 2020

I have written about the limitations of passwords before. Postgres has limited ability to control passwords — mostly how passwords are transferred and how long passwords remain valid. For more complex password requirements, the Postgres project recommends using an external authentication system that manages passwords like ldap or gssapi. Cert authentication has many advantages because it does not require a password.

Last year, one big change in password policy was the us nist announcement that periodic password changes are no longer recommended. This article explains why this is a logical approach to security. The new recommendations also suggests the limited value of password complexity requirements.

Users occasionally ask for for more complex password management features to be built into Postgres. These new guidelines make it less likely these features will ever be added to core Postgres.

Update: This email suggests using roles for password rotation. 2024-06-20

View or Post Comments

Credential Rotation Using Certificates

Friday, July 17, 2020

The traditional method of authenticating is local password. Authentication using external password managers is also very popular. However, passwords have known limitations, particularly password selection.

Managing password changes can also be complex. If users change their own passwords, the process is simple. However, if someone else needs to change the password, and there is a delay between the user requesting the change and the change being made, it can be difficult for the user to know when to switch to using the new password. For applications that connect to the database by supplying passwords automatically, password changes can be even more complex. Programming applications to try both old and new passwords is awkward.

Instead of using something you know, e.g., passwords, another authentication approach is to use something you have. One ideal method of "something you have" is certificate authentication. It allows for more effective credential rotation because the old and new certificates can be valid at the same time. The process of replacing ssl certificates is:

Create a new certificate, signed by a certificate authority in the same certificate chain as the server certificate
Replace the old client certificate with the new one
Add the old client certificate to the certificate revocation list (crl)

After step one, administrators don't need to rush step two because the old and new certificates are both valid. Once step two is complete, step three can be done. Step three is only useful if the server has been set up to honor certificate revocation lists. You also must set an expiration date for certificates to limit their lifetimes.

There is no similar way to do delayed authentication rotation using password-based methods.

View or Post Comments

Grouping Sets and Null Values

Wednesday, July 15, 2020

You might be aware that grouping sets, along with rollup and cube, allow queries to return additional group by rows. However, because these additional rows are added by Postgres, it is unclear what value to use for summary fields, so the sql standard specifies null for those fields. Let's start with a simple example using a simple group by:

CREATE TABLE test (x TEXT, y INTEGER);
 
INSERT INTO test VALUES ('a', 0), ('b', 0), (NULL, 0);
 
SELECT x, COUNT(*) FROM test GROUP BY x ORDER BY x;
 x | count
---+-------
 a |     1
 b |     1
   |     1

Each row represents a value in test and each row has a count of one since there is one matching row for each value. The last row is the number of nulls in the table, also one. Here is the same query using rollup:

SELECT x, COUNT(*) FROM test GROUP BY ROLLUP(x) ORDER BY x;
 x | count
---+-------
 a |     1
 b |     1
   |     3
   |     1

Notice there are two rows where the first column is null. Let's verify these are nulls and not something else using psql's \pset:

\pset null '(null)'
 
SELECT x, COUNT(*) FROM test GROUP BY ROLLUP(x) ORDER BY x;
   x    | count
--------+-------
 a      |     1
 b      |     1
 (null) |     3
 (null) |     1
 
-- reset the psql NULL display
\pset null ''

Yes, they are both null. The first null is the one generated by rollup, and the last row is the number of null values.

This is quite confusing. If we expect a column involved in a grouping set-type operation to perhaps contain nulls, how can we distinguish between those values and values added by grouping set-type operations? Well, your first inclination might be to use coalesce:

SELECT COALESCE(x, '(null)'), COUNT(*) FROM test GROUP BY ROLLUP(x) ORDER BY x;
 coalesce | count
----------+-------
 a        |     1
 b        |     1
 (null)   |     3
 (null)   |     1

(Remember we turned off psql's special displaying of nulls, so (null) is coming from coalesce.) That didn't help. The reason is that the select target list is executed last, as covered in a previous blog post. The right way to do it is to force coalesce to run before the grouping set-type operation using a common table expression:

WITH t AS (
        SELECT COALESCE(x, '(null)') AS x, y FROM test
)
SELECT x, COUNT(*) FROM t GROUP BY ROLLUP(x) ORDER BY x;
   x    | count
--------+-------
 a      |     1
 b      |     1
 (null) |     1
        |     3

That's what we wanted — (null) for the inserted null, and a real null for the grouping set-type added row.

Another option is to use grouping, which returns a non-zero value if the row is generated by a grouping set, e.g.:

SELECT x, COUNT(*), GROUPING(x) FROM test GROUP BY ROLLUP(x) ORDER BY x;
 x | count | grouping
---+-------+----------
 a |     1 |        0
 b |     1 |        0
   |     3 |        1
   |     1 |        0

This can be combined with case to control the label of the grouping set null column:

SELECT CASE
           WHEN GROUPING(x) != 0 THEN x
           WHEN x IS NOT NULL THEN x
           ELSE '(null)'
       END AS y,
       COUNT(*)
FROM test
GROUP BY ROLLUP(x)
ORDER BY y;
   y    | count
--------+-------
 a      |     1
 b      |     1
 (null) |     1
        |     3

View or Post Comments

Jsonb: A Container of Types

Monday, July 13, 2020

The popular json and jsonb data types are more than just a collection of single-data-type values like arrays. They can contain multiple data types:

CREATE TABLE test(x JSONB);
 
INSERT INTO test VALUES ('"abc"'), ('5'), ('true'), ('null'), (NULL);
 
\pset null (null)
 
SELECT x, jsonb_typeof(x), pg_typeof(x) FROM test;
   x    | jsonb_typeof | pg_typeof
--------+--------------+-----------
 "abc"  | string       | jsonb
 5      | number       | jsonb
 true   | boolean      | jsonb
 null   | null         | jsonb
 (null) | (null)       | jsonb

You can see I stored a string, number, boolean, json null, and an sql null in a single jsonb column, and Postgres knew the type of each value in the jsonb column, as shown by jsonb_typeof(). However, pg_typeof() still sees all these as jsonb values, and that is the type exposed to anything referencing the json column.

Here is another example that uses #>> to convert all json values to text:

SELECT x, x::TEXT, x#>>'{}', pg_typeof(x#>>'{}') FROM test;
   x    |   x    | ?column? | pg_typeof
--------+--------+----------+-----------
 "abc"  | "abc"  | abc      | text
 5      | 5      | 5        | text
 true   | true   | true     | text
 null   | null   | (null)   | text
 (null) | (null) | (null)   | text

(I am using #>> rather than the more popular ->> because it lets me access json values that are not associated with keys.) Notice that using :: to cast to text retains the double-quotes, while #>> removes them. (->> does also.) However, the return value for #>> did not change based on the json contents — it always returned text or sql null. Also notice that :: casting returns the string "null" for the jsonb null value, while #>> returns an sql null. (I used psql's \pset null above.)

There is a fundamental problem with the interaction between jsonb and sql data types. Postgres knows the data type of each json value inside the jsonb field (as shown by jsonb_typeof), but each field can have multiple json values inside, and each row can be different as well. Therefore, you only have two choices for interfacing jsonb to sql data types:

Use ->> or #>> to map all values to text
Cast values to an sql data type

For number one, all data types can be converted to text, so it is a simple solution, though it doesn't work well if you need to process the values in a non-textual way, e.g., a numeric comparison. You can cast jsonb values to any sql data type (number two) as long as all jsonb values can be cast successfully. For example:

DELETE FROM test;
 
-- all inserted values must cast to a jsonb type
INSERT INTO test VALUES ('"1"'), ('2'), ('3e1'),  ('4f1'),  ('true'), ('null'), (NULL);
ERROR:  invalid input syntax for type json
LINE 1: ...NSERT INTO test VALUES ('"1"'), ('2'), ('3e1'),  ('4f1'),  (...
                                                             ^
DETAIL:  Token "4f1" is invalid.
CONTEXT:  JSON data, line 1: 4f1
 
-- 4e1 uses exponential notation
INSERT INTO test VALUES ('"1"'), ('2'), ('3e1'),  ('4e1'),  ('true'), ('null'), (NULL);
 
SELECT x, x::TEXT, x#>>'{}', jsonb_typeof(x), pg_typeof(x#>>'{}') FROM test;
   x    |   x    | ?column? | jsonb_typeof | pg_typeof
--------+--------+----------+--------------+-----------
 "1"    | "1"    | 1        | string       | text
 2      | 2      | 2        | number       | text
 30     | 30     | 30       | number       | text
 40     | 40     | 40       | number       | text
 true   | true   | true     | boolean      | text
 null   | null   | (null)   | null         | text
 (null) | (null) | (null)   | (null)       | text
 
-- all values can't be cast to numeric
SELECT x, x::TEXT, x#>>'{}', x::NUMERIC * 5, jsonb_typeof(x)
FROM test;
ERROR:  cannot cast jsonb string to type numeric
 
-- all values of jsonb type 'number' can be cast to numeric
SELECT x, x::TEXT, x#>>'{}', x::NUMERIC * 5, jsonb_typeof(x)
FROM test
WHERE jsonb_typeof(x) = 'number';
 x  | x  | ?column? | ?column? | jsonb_typeof
----+----+----------+----------+--------------
 2  | 2  | 2        |       10 | number
 30 | 30 | 30       |      150 | number
 40 | 40 | 40       |      200 | number
 
-- Use #>> to remove double-quotes from the jsonb string
SELECT x, x::TEXT, x#>>'{}', (x#>>'{}')::NUMERIC * 5, jsonb_typeof(x)
FROM test
WHERE jsonb_typeof(x) = 'number' OR
      jsonb_typeof(x) = 'string';
  x  |  x  | ?column? | ?column? | jsonb_typeof
-----+-----+----------+----------+--------------
 "1" | "1" | 1        |        5 | string
 2   | 2   | 2        |       10 | number
 30  | 30  | 30       |      150 | number
 40  | 40  | 40       |      200 | number

The first insert fails because 4f1 isn't double-quoted and can't be cast to a json numeric, but 4e1 can because it represents exponential notation. In trying to cast all values to numeric in the select, the "1" has double-quotes, so it fails. Using #>> removes the double quotes and allows the string value to be cast to numeric too. This discussion exposes the confusion of using json without casting.

I have been using single jsonb values, but the same behavior happens with jsonb documents:

DELETE FROM test;
 
-- create document with keys 'a' and 'b'
INSERT INTO test VALUES ('{"a": "xyz", "b": 5}');
 
-- access key 'a'
SELECT x->'a', jsonb_typeof(x->'a'), pg_typeof(x->'a'), (x->'a')::TEXT, x->>'a', pg_typeof(x->>'a') FROM test;
 ?column? | jsonb_typeof | pg_typeof | text  | ?column? | pg_typeof
----------+--------------+-----------+-------+----------+-----------
 "xyz"    | string       | jsonb     | "xyz" | xyz      | text
 
-- access key 'b'
SELECT x->'b', jsonb_typeof(x->'b'), pg_typeof(x->'b'), (x->'b')::TEXT, x->>'b', pg_typeof(x->>'b') FROM test;
 ?column? | jsonb_typeof | pg_typeof | text | ?column? | pg_typeof
----------+--------------+-----------+------+----------+-----------
 5        | number       | jsonb     | 5    | 5        | text

So, in summary

If you want to use json values as text, use ->> or #>>
If you want to cast to another sql data type, use :: or cast to cast, but be sure all values can be cast to the new sql data type

View or Post Comments

Two Interviews

Friday, July 10, 2020

I have done two interviews in the past month. The first one was done by the Linux Inlaws and was published on Hacker Public Radio. This interview discusses the history and open source aspects of the Postgres project, and its health and future direction.

The second interview is more personal, discussing how I got involved in computers and my early experiences with Postgres. It also discusses technology disruption, and its challenge to time management. The final quarter covers religious topics.

View or Post Comments

Postgres Marketing

Wednesday, July 8, 2020

Postgres is mostly a technology-driven community, so marketing often suffers. However, one great thing about the community is that it is distributed, so anyone can get involved and help. Here are some examples of very successful community-driven marketing ideas:

PostgreSQL Funds Group release 12 coin
Pg-us Postgres community founding coin
Russian Postgres/MySQL coin
Postgres Pro PostgreSQL for Beginners booklet

Here are front and rear images of these objects. Of course, I have also accumulated many Postgres pins over the years. I am inspired by these marketing efforts and hope they continue.

View or Post Comments

Boolean Indexes

Friday, July 3, 2020

For btree and hash indexes to be used for lookups, values being requested must be very restrictive, roughly 3-5% of a table's rows. Let's consider a boolean column — it can contain only three values: true, false, and null. By definition, at least one of those three values will be in more than 5% of the table's rows, so why index them? A better solution, particularly for boolean fields, is to create a partial index so only rare values are indexed. Partial indexes can also be used to index non-boolean fields when there are a large number of common values that aren't worth indexing.

View or Post Comments

Global Indexes

Wednesday, July 1, 2020

Postgres indexes can only be defined on single tables. Why would you want to have indexes that reference multiple tables, i.e., global indexes?

This email covers some of the reasons why global indexes can be useful. One big use-case is the ability to create indexes on partitioned tables that index all its child tables, rather than requiring a separate index on each child table. This would allow references to partitioned tables as foreign keys without requiring the partition key to be part of the foreign key reference; Postgres 12 allows such foreign keys if they match partition keys.

A second use-case for global indexes is the ability to add a uniqueness constraint to a partitioned table where the unique columns are not part of the partition key. A third use-case is the ability to index values that only appear in a few partitions, and are not part of the partition key. A global index would avoid having to check each partition table's index for the desired value.

It is still unclear if these use-cases justify the architectural changes needed to enable global indexes. Some of these features can be simulated using triggers and user lookup tables. A large global index might also reintroduce problems that prompted the creation of partitioning in the first place.

Update: Summary email 2021-01-08

View or Post Comments

Hardware Acceleration for Databases

Monday, June 29, 2020

There is a long history of hardware acceleration, i.e., hardware modules helping the cpu. There was the 80287 math coprocessor, sound cards, and video cards. The computer industry is constantly moving things from the cpu to the motherboard and external cards, and back again. Movement is mostly determined by whether the cpu is able to efficiently perform the task, the transfer bandwidth needed to perform the task, and the flexibility of replaceable external cards.

This year, the big questions for database software is if and how to make use of graphics processing unit (gpu) and field-programmable gate arrays (fpga). This article does a good job of explaining the history of hardware acceleration, and mentions Netezza's (no longer used) use of fpgas as hardware acceleration for databases.

The same historic hardware acceleration questions apply to database acceleration today:

Are they better suited than cpus to do some database processing tasks, and how common are those tasks?
Is there sufficient transfer bandwidth to gpus and fpgas to justify their use?
Is hardware acceleration worth the deployment complexity?

PgOpenCL, HeteroDB, and PG-Strom (from HeteroDB) are projects that are experimenting with the value of gpus and fpgas in Postgres. As cloud providers increase the availability of gpus and fpgas, we might start see their usage increase.

View or Post Comments

Can Case Comparison Be Controlled?

Friday, June 26, 2020

Computer tasks are one of the most precise activities we do on a daily basis. Driving, cooking, walking, and reading are fairly imprecise compared to computer interaction.

Computers represent symbols like "a" and "A" precisely and require external facilities to define relationships between them. This email thread makes a convincing argument that you usually want case-preserving, but less-precise case-insensitive behavior.

Let's go over some Postgres case-precision behaviors like the handling of character strings, identifiers, and keywords. For example, these queries do the same thing:

SELECT COUNT(*) FROM pg_class;
 count
-------
   386
 
SELECT COUNT(*) FROM PG_CLASS;
 count
-------
   386
 
Select COUNT(*) From Pg_Class;
 count
-------
   386

This is because Postgres, and the sql standard, ignore the case of keywords, e.g., select. They also ignore the case of identifiers, e.g., pg_class, when not double-quoted. Double-quoting adds case precision to identifiers:

SELECT "count"() FROM "pg_class";
 count
-------
   386
 
SELECT "COUNT"() FROM "pg_class";
ERROR:  function COUNT() does not exist
LINE 1: SELECT "COUNT"() FROM "pg_class";
               ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
 
SELECT "Count"() FROM "Pg_Class";
ERROR:  relation "Pg_Class" does not exist
LINE 1: SELECT "Count"() FROM "Pg_Class";
                               ^
SELECT "COUNT"() FROM "PG_CLASS";
ERROR:  relation "PG_CLASS" does not exist
LINE 1: SELECT "COUNT"(*) FROM "PG_CLASS";
                               ^

There is no ability to add case precision to keywords:

"SELECT" COUNT(*) FROM pg_class;
ERROR:  syntax error at or near ""SELECT""
LINE 1: "SELECT" COUNT(*) FROM pg_class;
        ^

When comparing values, Postgres is precise by default:

SELECT 'a' = 'A';
 ?column?
----------
 f
 
SELECT '-' = '_';
 ?column?
----------
 f
 
SELECT '.' = ',';
 ?column?
----------
 f

For certain symbols, is it sometimes visually hard to see the difference.

As shown above, double-quotes adds precision to identifiers. For value comparisons, you have to be explicit to remove precision:

SELECT upper('a') = upper('A');
 ?column?
----------
 t
 
SELECT lower('a') = lower('A');
 ?column?
----------
 t

Most people aren't comparing constants in sql but compare column values:

SELECT oid FROM pg_class WHERE relname = 'pg_class';
 oid
------
 1259
 
SELECT oid FROM PG_CLASS WHERE relname = 'pg_class';
 oid
------
 1259
 
SELECT oid FROM pg_class WHERE relname = 'Pg_Class';
 oid
-----
 
SELECT oid FROM pg_class WHERE relname = 'PG_CLASS';
 oid
-----

Notice that these queries use pg_class as an identifier (without single quotes) and as a value (with single quotes). The identifier usage is case insensitive; the value usage is case sensitive. You can explicitly reduce comparison precision using function calls:

SELECT oid FROM pg_class WHERE lower(relname) = lower('pg_class');
 oid
------
 1259
 
SELECT oid FROM pg_class WHERE lower(relname) = lower('Pg_Class');
 oid
------
 1259
 
SELECT oid FROM pg_class WHERE lower(relname) = lower('PG_CLASS'^);
 oid
------
 1259

These convert the column values and constants to lower case before comparison. (Upper case could also have been used.) Indexes are also case sensitive by default, and it obviously would be inefficient to lower-case every index entry for comparison, so function calls on columns cannot use an ordinary column index:

EXPLAIN SELECT oid FROM pg_class WHERE relname = 'pg_class';
                                        QUERY PLAN
-------------------------------------------------------------------------------------------
 Index Scan using pg_class_relname_nsp_index on pg_class  (cost=0.27..8.29 rows=1 width=4)
   Index Cond: (relname = 'pg_class'::name)
 
 
EXPLAIN SELECT oid FROM pg_class WHERE lower(relname) = lower('pg_class');
                       QUERY PLAN
---------------------------------------------------------
 Seq Scan on pg_class  (cost=0.00..19.76 rows=2 width=4)
   Filter: (lower((relname)::text) = 'pg_class'::text)

They can use expression indexes that are created to match function calls:

-- create a user table because users can't create indexes on system tables
CREATE TABLE my_pg_class AS SELECT * FROM pg_class;
 
-- create non-expression index
CREATE INDEX i_my_pg_class_relname ON my_pg_class (relname);
 
-- create expression index
CREATE INDEX i_my_pg_class_relname_lower ON my_pg_class ((lower(relname)));
 
-- The optimizer needs statistics
-- Doing the analyze after the expression index creation allows creation of statistics on the expression.
-- see https://momjian.us/main/blogs/pgblog/2017.html#February_20_2017
-- Autovacuum would have eventually done this automatically.
ANALYZE my_pg_class;
 
-- use non-expression index
EXPLAIN SELECT oid FROM my_pg_class WHERE relname = 'pg_class';
                                       QUERY PLAN
-----------------------------------------------------------------------------------------
 Index Scan using i_my_pg_class_relname on my_pg_class  (cost=0.27..8.29 rows=1 width=4)
   Index Cond: (relname = 'pg_class'::name)
 
-- use expression index
EXPLAIN SELECT oid FROM my_pg_class WHERE lower(relname) = lower('pg_class');
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Index Scan using i_my_pg_class_relname_lower on my_pg_class  (cost=0.27..8.29 rows=1 width=4)
   Index Cond: (lower((relname)::text) = 'pg_class'::text)

The citext extension allows the creation of columns whose values are automatically compared in a case-insensitive manner:

CREATE EXTENSION citext;
 
-- 'x' column added so the row has a typical length
CREATE TABLE my_pg_class2 AS SELECT oid, relname::citext, repeat('x', 256) FROM pg_class;
 
CREATE INDEX i_my_pg_class_relname2 ON my_pg_class2 (relname);
 
ANALYZE my_pg_class2;
 
\d my_pg_class2
            Table "public.my_pg_class2"
 Column  |  Type  | Collation | Nullable | Default
---------+--------+-----------+----------+---------
 oid     | oid    |           |          |
 relname | citext | C         |          |
 repeat  | text   |           |          |
Indexes:
    "i_my_pg_class_relname2" btree (relname)
 
SELECT oid FROM my_pg_class2 WHERE relname = 'pg_class';
 oid
------
 1259
 
SELECT oid FROM my_pg_class2 WHERE relname = 'PG_CLASS';
 oid
------
 1259
 
EXPLAIN SELECT oid FROM my_pg_class2 WHERE relname = 'pg_class';
                                        QUERY PLAN
-------------------------------------------------------------------------------------------
 Index Scan using i_my_pg_class_relname2 on my_pg_class2  (cost=0.27..8.29 rows=1 width=4)
   Index Cond: (relname = 'pg_class'::citext)
 
 
EXPLAIN SELECT oid FROM my_pg_class2 WHERE relname = 'PG_CLASS';
                                        QUERY PLAN
-------------------------------------------------------------------------------------------
 Index Scan using i_my_pg_class_relname2 on my_pg_class2  (cost=0.27..8.29 rows=1 width=4)
   Index Cond: (relname = 'PG_CLASS'::citext)

There are more facilities available to further reduce precision:

If Postgres 12 or later is compiled with the icu library support (view the system table column pg_collation.collprovider to check), you can use nondeterministic collations that are case and accent-insensitive.

View or Post Comments

Force One Row

Monday, June 22, 2020

How can you force a table to have at most one row? It is actually very easy by creating a unique expression index on a constant, with no column name references:

CREATE TABLE onerow (a INTEGER, b TEXT);
 
CREATE UNIQUE INDEX ON onerow ((1));
 
-- this adds a single row
INSERT INTO onerow VALUES (1, 'foo')
        ON CONFLICT ((1)) DO UPDATE SET a = excluded.a, b = excluded.b;
 
SELECT * FROM onerow;
 a |  b
---+-----
 1 | foo
 
-- this updates the single row
INSERT INTO onerow VALUES (2, 'bar')
        ON CONFLICT ((1)) DO UPDATE SET a = excluded.a, b = excluded.b;
 
SELECT * FROM onerow;
 a |  b
---+-----
 2 | bar
 
-- this also updates the single row
INSERT INTO onerow VALUES (3, 'baz')
        ON CONFLICT ((1)) DO UPDATE SET a = excluded.a, b = excluded.b;
 
SELECT * FROM onerow;
 a |  b
---+-----
 3 | baz
 
-- try INSERT without ON CONFLICT
INSERT INTO onerow VALUES (4, 'foo2');
ERROR:  duplicate key value violates unique constraint "onerow_expr_idx"
DETAIL:  Key ((1))=(1) already exists.

By using on conflict, it is possible to add a row, but if a row already exists, to replace it with a new value.

View or Post Comments

Storing Binary Data in the Database

Friday, June 19, 2020

There are some very good responses in an email thread about whether to store binary data in Postgres or externally. The binary storage options discussed were

In the database (toast helps with performance)
In another database, like SQLite
In a local or network file system
Using cloud storage.

This email reply had many good insights, and this wiki page has even more. I have covered data storage outside of databases before.

View or Post Comments

Dinner Q&A

Wednesday, June 17, 2020

My employer, EnterpriseDB, has been organizing events where potential customers and interested people can ask me questions while enjoying a meal. I thought the idea was strange, but I have done it ten times, and they have gone very well. The Q&A portion usually lasts one hour and forty-five minutes. During a November 2019 event in Utrecht, the Netherlands, a sketch artist was present. The artist illustrated my ideas as I spoke and created this diagram, which I found quite interesting.

View or Post Comments

Controlling Server Variables at Connection Time

Monday, June 15, 2020

I have recently covered the importance of libpq environment variables and connection specification options. While most libpq options control how to connect to the Postgres server, there is one special option that can change variables on the server you connect to, e.g.:

$ psql 'options=-cwork_mem=100MB dbname=test'
psql (13devel)
Type "help" for help.
 
test=> SHOW work_mem;
 work_mem
----------
 100MB

This can also be done using environment variables, with all the benefits of environment variables:

$ PGOPTIONS="-c work_mem=100MB" psql test

These settings can also be set at the user, database, and cluster level on the database side too, but having control on the client side is often useful.

View or Post Comments

Connect Parameter Specification Options

Friday, June 12, 2020

I have previously covered the importance of libpq and environment variables. While you can specify discrete connection command-line parameters and environment variables, there is a catch-all setting that allows connection options to be specified in a single string, e.g.:

$ psql -d test
psql (13devel)
Type "help" for help.
 
test=> \q
 
$ psql --dbname test
psql (13devel)
Type "help" for help.
 
test=> \q
 
$ psql 'dbname=test'
psql (13devel)
Type "help" for help.
 
test=> \q

The first psql command uses a single-letter command-line option. The second one uses a long-format option. The third uses a parameter key word. Multiple key words can be used to specify multiple connection options:

$ psql 'host=myhost.com port=5433 dbname=test'

You can also use a uri syntax to specify the same parameters as above:

$ psql postgresql://myhost.com:5433/test

View or Post Comments

Controlling Connection Parameters Using Environment Variables

Wednesday, June 10, 2020

Libpq is the Postgres connection library used by almost every non-jdbc application. It allows many connection parameters, which can be specified on the command line or embedded in applications:

$ psql -h myhost.com -d mydb

In the above case, the psql host name and database name are specified on the command-line and interpreted by libpq. However, it is also possible to specify parameters using environment variables, which are also interpreted by libpq:

$ PGHOST=myhost.com PGDATABASE=mydb psql

There is obviously no value in specifying libpq parameters using environment variables in this example, but there are use cases. For example, if you want to perform multiple operations on the same host name and database, you can do:

$ export PGHOST=myhost.com
$ export PGDATABASE=mydb
$ vacuumdb
$ reindexdb

This avoids specifying the host and database names multiple times, though with a loss of clarity. Sometimes environment variables are best used as defaults when connection options are not specified:

$ export PGHOST=myhost.com
$ export PGDATABASE=mydb
$ reindexdb
$ reindexdb --dbname mydb2
$ reindexdb --host myhost3.com --dbname mydb3

This reindexes databases mydb and mydb2 on host myhost.com, and database mydb3 on myhost3.com.

Another use-case for environment variables is to set parameters for users, without having to pass them as parameters to commands:

$ # must use 'sh' so the redirect happens as root
$ sudo sh -c "echo PGHOST='example.com' >> ~bruce/.profile"

By appending this to bruce's .profile file, all applications that use libpq without a specific host name will connect to example.com automatically. Once bruce logs out and back in again, all his applications will start using the new .profile pghost setting.

Finally, environment variables make it possible to set default connection values for all users. For example, on Debian, to default all tcp (non-Unix Domain socket) connections to fully verify ssl certificates, you can do:

$ sudo sh -c "echo PGSSLMODE='verify-full' >> /etc/profile.d/pgdefaults.sh"

Of course, you can change environment variables set at login and applications can override connection parameters set by environment variables.

View or Post Comments

Safety Systems Can Reduce Safety

Monday, June 8, 2020

What is the purpose of safety systems? To make things safer? To make them appear safer? To satisfy some external requirement? The purpose of safety systems is not always clear, but even for safety systems whose sole purpose is to increase safety — do they always succeed in increasing safety? The simple answer is "no". Here are three examples:

Crash of ValuJet 592 caused by transporting safety equipment
Three Mile Island nuclear reactor meltdown caused by the failure of complex safety systems
Chernobyl nuclear reactor explosion caused by a safety test

In all three cases, safety systems did not prevent disasters — they caused them. Safety systems are often very useful, and the modern world could not operate with them. However, they also add complexity, and that added complexity can introduce failure modes that did not exist without the safety systems.

So, where does that leave us? Safety systems are useful, but too many of them are bad? How many is too many? These are hard questions, but there are some guidelines:

How serious is the failure that the safety system is trying to prevent?
How likely is the failure that the safety system is trying to prevent?
How likely is the safety system to fail?
What impact will failure of the safety system have on the overall system?

This gets into a very complex calculus where you are computing the likelihood and seriousness of the failure that the safety system is trying to prevent, and the likelihood and seriousness of safety system failure. The big point is that while you are computing the likelihood and seriousness of failures and adding safety systems to compensate for them, you have to be aware of the cost of adding those safety systems, in both the likelihood and impact of their failure.

What does this have to do with databases? Well, you might have a database on an airplane or at a nuclear reactor site, in which case the database is part of a complex system. However, databases are also complex systems, and we regularly add safety systems to increase their reliability. How often to do we consider the cost of those safety systems, in terms of the likelihood and seriousness of failures? Let's look at some typical database safety systems:

Backups
Error reporting
Replication
Connection pooling

Let's walk through the calculus for backups:

How serious is the failure that the safety system is trying to prevent? High
How likely is the failure that the safety system is trying to prevent? High
How reliable is the safety system? Medium
What impact will failure of the safety system have on the overall system? Low

The last question is not considering the seriousness of a failed backup on performing it safety goal, but rather how likely is a backup to cause a failure on its own? It could fill up the disk with backup files, or cause too much load on the system, but those risks are low. Similarly, error reporting has minimal risk on destabilizing the system, except for consuming I/O and storage space.

Replication and connection poolers are in a different class of risk. Replication's goal is to allow for controlled switchover or failover in case of primary server failure, but what is its reliability and the impact if it fails? If synchronous replication is used, failure to replicate will cause the system to stop accepting writes. This can be caused by a network failure to the replicas, a replica outage, or even misconfiguration. Even failure of asynchronous replication can cause the write-ahead log directory to fill up storage, causing an outage.

Similarly, the failure of a connection pooler can cause a total outage. You can use multiple poolers, but what if the failure of one pooler prevents the other from working, or if they both work and conflict with each other. Multiple poolers can also add complexity to the system which makes debugging other problems harder. This is a great example where, to avoid the problems of safety system failure, you create two safety systems, but the two safety systems can interact in ways that make two safety systems less reliable than one safety system. Multi-master replication can have similar downsides. Even auto-failover has risks, and auto-failback, even more.

The bottom line is that safety systems can be useful, but they can also add complexity which makes systems more fragile and harder to control; consider how safety systems interact and implement them only where their value is clear.

View or Post Comments

When Does a Commit Happen?

Wednesday, June 3, 2020

Most people who deal with relational databases think of transaction commits as binary operations — the query is running and not yet committed, then it is completed and committed. However, internally, there are many stages to a commit:

Write commit record to the write-ahead log
Flush the write-ahead log to durable storage
Update the pg_xact (slide 57)
Transfer to replicas
Mark the commit as visible to other sessions (ProcArrayEndTransaction() updates PGPROC)
Communicate commit to the client

These steps are implemented in RecordTransactionCommit().

What is interesting is that this process can be interrupted at anytime, by perhaps a server crash or network failure. For example

The commit could be flushed to disk but not yet visible to other clients
The commit could be replicated (and visible to queries on replicas) but not visible to queries on the primary
Queries could appear committed to other sessions before the client issuing the query receives notification

This email thread explains the problem that commit with standbys is not always an atomic operation. Postgres has the function txid_status() which allows clients to check if a transaction, perhaps from a previous disconnected session, was committed.

View or Post Comments

Lessons from the Online Conference Trenches

Monday, June 1, 2020

Having presented at two online conferences in the past two weeks, presenting at two this coming week, and presenting at many edb-sponsored webinars, I have learned a few things about online presentations that might be helpful for Postgres organizers, speakers, and attendees:

For Organizers: With no need for speakers and attendees to travel to online conferences, there are many more potential speakers available, and many more potential attendees than for in-person conferences. However, the technical challenges of hosting an online conference are significant because producing and consuming content can require multiple platforms that must be integrated seamlessly for a positive user experience. The content production platform, e.g. Zoom, is often different than the consumption platform, e.g. YouTube, Facebook Live. If the user experience is poor, people will leave because they are not bound to a physical location like an in-person event. Adjusting to the time zones of speakers and attendees can be complex — doing a 24-hour conference like Precona Live (organizer tips) solves many of the time zone problems, but requires moderators from many time zones. The moderator's job is much more extensive for online conferences since they control access, deal with technical problems, and manage the all-important chat channel. For online conferences, chat is the best way to promote attendee engagement. If chat is done well, user engagement during presentations can be even better than in-person conferences.

For Speakers: Just like for organizers, speakers have more technical challenges than in-person conferences — it is harder to engage the audience, more things can go wrong, and attendees can more easily leave. As a speaker, I have a checklist that I always references before each presentation:

Mute phone, chat, email, and upgrade notifications
Use a laptop on AC power with wired Ethernet, for reliability
Have a count-up clock to keep track of the talk duration
Use a headset so you don't lean toward a microphone
Use a presentation remote so you don't lean forward to change slides
Turn on your video camera to increase audience engagement
Make sure the lighting is good and the background is uncluttered

For attendees: Many online conferences are a mix of different technologies, not always seamlessly integrated, so anticipate that it will take time to get initially connected. Consider watching the conference on a large television, or from a tablet you can carry around. Use chat to engage with the speaker and other attendees. Feel free to switch to a more interesting presentation without guilt. When you get frustrated, consider how much time you are saving by not having to travel.

View or Post Comments

Visualizing Collations

Friday, May 29, 2020

There is still significant confusion about characters sets, encodings, and collations. This is because in the real, non-digital world, we usually treat languages, their characters, and ordering as unified, but in the digital world, they are treated separately, and their distinction can be hard to visualize.

These two posted queries illustrate collation in a very creative way. The first query, SELECT chr(i) FROM generate_series(1, 50000) i ORDER BY chr(i) COLLATE "C", outputs characters in their "C" binary order, with ascii as the first 128 characters, successive groups of languages following, and ending with pictographic languages.

The second query, SELECT chr(i) FROM generate_series(1, 50000) i ORDER BY chr(i) COLLATE "en_US.utf8", outputs the same 50,000 characters in "United States English" utf8 order. The output starts with pictographic languages, not ascii. The Latin alphabet appears, but not until line 19068. What is interesting is that there are 118 symbols grouped together that look like 'a', 'a' with diacritics, or have 'a' as part of their symbol. Then 'b' appears with a group of 38 symbols that look like or use 'b', and so on through the Latin alphabet. (If you highlight a character and paste it into a search box, Google will tell you about that Unicode symbol.)

I found it interesting that it groups letters that look like Latin letters, even if the they are not from Latin alphabets and don't sound like Latin letters. Cyrillic is grouped in a section after the Latin alphabet section. These sql queries are the clearest example I have seen of collation ordering.

If I had used a different collation, instead of "United States English", there would have been a different ordering. This is why index storage is sensitive to collations, i.e., indexes with different collations store the same stings in a different order. Collation affects other things like upper/lower case processing, the ordering of query output, and certain optimizations.

View or Post Comments

What is an Lsn?

Wednesday, May 27, 2020

You might have seen that there is a pg_lsn data type:

test=> \dTS pg_lsn
              List of data types
   Schema   |  Name  |       Description
------------+--------+-------------------------
 pg_catalog | pg_lsn | PostgreSQL LSN datatype

Client programs pg_receivewal and pg_recvlogical have options that take lsn values, but what is an lsn? It stands for "Log Sequence Number" — it is a 64-bit value that represents a position in the write-ahead log. It is usually displayed as two 32-bit hex values, separated by a slash. For example, pg_controldata displays lsn values:

$ pg_controldata
…
Latest checkpoint location:           0/15AE1B8
Latest checkpoint's REDO location:    0/15AE180

So, the next time you view or need to specify a write-ahead log location, you are using an lsn.

View or Post Comments

Taking Snapshots of Clusters Which Use Tablespaces

Monday, May 25, 2020

Postgres already documents the ability to backup the database cluster using file system snapshots. Unfortunately, database clusters that use tablespaces often cannot use this method if the storage system doesn't support simultaneous snapshots across file systems.

However, simultaneous snapshots across file systems might not be a hard requirement for Postgres snapshot backups. It might be possible for snapshots to be non-simultaneous as long as the write-ahead log that spans the time frame between snapshots is included in the backup, and checkpoints do not happen during that time frame.

Internally, starting Postgres from a snapshot backup replays write-ahead log records to make a single file system snapshot consistent. Potentially, it could do the same for non-simultaneous snapshots of multiple file systems. However, documenting this, giving users a reliable list of steps to perform, and making sure it always works is probably too complex to justify.

View or Post Comments

Moving Tables, Indexes, and Tablespaces Between Clusters

Friday, May 22, 2020

Currently, it is impossible to move tables, indexes, and entire tablespaces from one cluster to another — that is because each table and index file is bound to the cluster's infrastructure because of

Table and index definitions
pg_xact (commit/abort/in-progress transaction status records)
pg_multixact (used for multi-session row locking)

Fyi, you can easily move tablespaces to new directories as long as it remains in the same cluster, and move tables and indexes between tablespaces.

So, how could it be made possible? Freeze can remove references to pg_xact and pg_multixact, assuming there are no active transactions during the freeze operation. Table and index definitions can be more complex, but it certainly seems possible. This requires more research.

View or Post Comments

Why Pgdata Should Not Be at the Top of a Mount Point

Wednesday, May 20, 2020

This email thread is illustrative of why it is unwise to place the Postgres data directory (pgdata) at the top of a mount point. Instead, create a subdirectory under the mount point and put pgdata there. This has the advantage of avoiding possible data corruption if mounting fails, and allows more efficient use of pg_upgrade.

View or Post Comments

Percona Live Online

Monday, May 18, 2020

I am planning to virtually attend and present at the Percona Live Online conference tomorrow, May 19. It starts at 10am, Eastern us time, and spans 24 hours, so it covers every time zone. I am giving my Will Postgres Live Forever? presentation at noon, Eastern us time.

Attendance is free, so you might want to check it out. I saw some interesting topics on the program. I am also curious to experience a 24-hour virtual conference, though I am unlikely to remain awake that long.

View or Post Comments

Using Non-Login Roles

Monday, May 18, 2020

When we talk about database roles, most people immediately think of login roles, which allow people to log in. However, another user management feature is the ability to create non-login roles, formerly called groups. Non-login roles can also be assigned permissions, e.g., via grant, and can have login roles as members. Non-login roles can be even be members of other non-login roles.

What is the value of using non-login roles? They allow a group of people to be assigned as members of a non-login role, and that role can be used to abstract permission assignment. For example, if you have shop foremen, you can configure the login roles of all foremen to be members of a non-login foreman role. As people are added and removed from that staff position, they can be added/removed from the non-login role without the need to change permissions for the foreman role.

A further advantage of non-login roles, as explained in a recent email, is that Postgres can start to suffer performance problems if more than a few dozen roles are granted permission on an object. A much simpler and more manageable solution is to add users to a non-login role and assign object permissions to that non-login role.

View or Post Comments

Draft of Postgres 13 Release Notes

Friday, May 15, 2020

I have completed the draft version of the Postgres 13 release notes, containing 181 items. The release notes will be continually updated until the final release, which is expected to be in September or October of this year. Beta testing will start in the next few weeks.

View or Post Comments

Avoiding Cache Wipe, Synchronized Scans

Wednesday, May 13, 2020

Whenever you are dealing with a lot of data, it helps to cache it. Postgres does this using shared_buffers. However, one risk of caching data is that a large query that accesses a lot of data might remove frequently-accessed data from the cache; this is called cache wipe. To avoid this, Postgres limits the number of shared buffers used by data operations that are expected to access a lot of data.

Looking at C function GetAccessStrategy(), you can see there are four shared buffer access strategies. The first one, bas_normal, is used for normal scans; the rest are to avoiding cache wiping:

bas_bulkread is enabled for large reads and uses 256 kilobytes of shared buffers (typically 32 8kB shared buffers)
bas_bulkwrite: is enabled for large writes and uses 16 megabytes of shared buffers
bas_vacuum is for vacuum operations and uses 256 kilobytes of shared buffers

For example, function initscan() uses bas_bulkread if the scan is expected to access more than one-quarter of shared buffers. Similarly, table rewrites, create table as, and copy from use bas_bulkwrite. Bas_bulkwrite is larger because we can't discard written buffers from the cache until they are written to storage, unlike unmodified buffers which can be discarded anytime.

You might wonder, with operations using so few shared buffers, what happens if another session needs to scan the same data? Doesn't performance suffer? Well, another Postgres facility, that was developed independently, helps with this: synchronized scans. The top of syncscan.c explains it well:

When multiple backends run a sequential scan on the same table, we try to keep them synchronized to reduce the overall I/O needed. The goal is to read each page into shared buffer cache only once, and let all backends that take part in the shared scan process the page before it falls out of the cache.

Both synchronized scans and shared buffer access strategies work automatically, so most Postgres users don't even know they exist, but they do improve Postgres performance.

View or Post Comments

Why Do We Freeze?

Monday, May 11, 2020

You might have seen autovacuum running, and noticed that it sometimes performs freeze operations on transaction ids (32 bits) and multi-xacts (used for multi-session row locking). The frequency of freeze operations is controlled by autovacuum_freeze_max_age and autovacuum_multixact_freeze_max_age. You can reduce the frequency of freezing if you are sure the freeze operation will complete before transaction wraparound is reached.

There is regular discussion about how freezing could be avoided, and this email does the best job of explaining the options. We could expand transaction ids to 64 bits, either on each row or perhaps with a page-level default, but pg_xact (commit/abort/in-progress transaction status records) still need cleanup. This area probably needs more thought.

View or Post Comments

Postgres Internals Website

Friday, May 8, 2020

I am often asked how someone can learn more about Postgres, particularly the internals. There is the Postgres developer page, which has links to many resources, and the developer's faq item about learning about the internals. One link on that page I was not aware of is Hironobu Suzuki's very detailed website about Postgres internals. It has a lot of details I have never seen written before, so I suggest those interested should check it out.

View or Post Comments

Portability's Surprising Win

Wednesday, May 6, 2020

When writing software, it is necessary to decide whether to use external facilities available in command-line tools, libraries, frameworks, and the operating system, or write the facilities yourself. Why would you write them yourself? You might be worried about adding reliance on an external facility or a facility might not have sufficient flexibility or performance.

The Postgres development team has had to make similar decisions. Fortunately, we have tended to favor reliance on common operating system interfaces, tools, and libraries, e.g., OpenSSL, bison. We have avoided reliance on external facilities that are uncommon or not well maintained.

Postgres has reproduced facilities that were commonly available in the operating system, tools, or libraries only when there was a clear benefit. Reproducing such facilities for a small benefit, like a little more control or a little more performance, is rarely wise. While relying on external facilities often makes Postgres less flexible and perhaps less performant, there are long-term benefits:

As external facilities improve their feature-set and performance, Postgres benefits from these improvements with little effort
As new demands are required of these external facilities, Postgres again benefits effortlessly

Let's be specific. In the early days of Postgres, file systems were not optimized for database storage. Fortunately, Postgres never implemented file systems on raw devices. Now that modern file systems, like ext4, give good database performance, Postgres benefits from file system improvements with almost no effort. Even when ssds started being used, the only change needed in Postgres was the ability to set random_page_cost at the tablespace level to handle databases where some tablespaces are on ssds and some are on magnetic storage.

When virtualization, cloud, containers, and container orchestration (e.g., Kubernetes) became popular, Postgres had to do almost nothing to run well on these platforms. (It is true that some of the Postgres enterprise deployment tools required repackaging and re-engineering.)

Because of this philosophy, Postgres has remained relatively light-weight compared to other relational database systems, and this has benefited Postgres in environments where nimble deployments are favored.

View or Post Comments

With ... Materialized and Optimizer Control

Monday, May 4, 2020

Before Postgres 12, queries specified as common table expressions (with clauses) always behaved as optimization barriers, meaning that common table expression queries were executed independently, and were not moved to later parts of the query.

Starting in Postgres 12, if a common table expression is referenced only once, and the keyword materialized is not used, it can be moved to a place later in the query where it can be better optimized; this improves optimization possibilities. However, if the movement of common table expression queries increases the from clause table count above the geqo_threshold, Postgres will decide it can't efficiently optimize such a high table count query and will use the genetic query optimizer.

So, while the new Postgres 12 behavior of in-lining common table expressions usually increases the quality of optimized plans, in some cases it can decrease them by enabling the genetic query optimizer. In a way, in pre-Postgres 12 or with the use of materialized, the query author is doing the optimization by creating common table expressions, while in other cases, the optimizer has greater control, though even the optimizer can determine the query is too complex and fall back to less-precise genetic query optimization.

View or Post Comments

Background Writes

Friday, May 1, 2020

Postgres must guarantee durability and good performance. To meet these objectives, Postgres does writes to the file system and storage in the background as much as possible. In fact, there are only two major cases where writes happen in the foreground:

Write-ahead log writes happen before commits are acknowledged to the client
A needed shared_buffer is dirty and must be written to storage so it can be replaced

Write-ahead log writes (#1) can be controlled using various settings. Dirty shared buffer writes (#2) that happen in the foreground are minimized if the background writer is operating efficiently. You can monitor such writes by viewing the probe buffer-write-dirty-start and buffer-write-dirty-done.

View or Post Comments

Optimal Use of Ssds

Wednesday, April 29, 2020

Ssds have different performance characteristics than magnetic disks, and using them optimally isn't always clear. Ssds have several performance benefits:

Very fast fsyncs
Much faster random reads and writes
Faster sequential reads and writes

So, if all your data is stored on ssds, you will certainly improve performance. If you are mixing ssds and magnetic disks, ideally you should use ssds in ways that give the greatest benefit. Starting with number one, putting the write-ahead log on ssds is a great way to improve fsync performance.

For number two, moving indexes to tablespaces using ssd storage can greatly improve performance because index access is usually random. When using ssds, the default value for random_page_cost should be lowered, perhaps to 1.1. This can be set at the tablespace level if there is a mix of tablespaces on ssds and magnetic disks.

For number three, it is also possible to create tablespaces on ssds for current data, and place archive data on tablespaces that use magnetic disks. By using table partitioning, a partitioned table can transparently span ssds and magnetic disk tablespaces.

View or Post Comments

Does Postgres Support Compression?

Monday, April 27, 2020

I am often asked if Postgres supports compression, and my answer is always a complicated dance around what "compression" level they are asking about. There are six possible levels of database compression:

single field
across rows in a single page
across rows in a single column
across all columns and rows in a table
across tables in a database
across databases

Number one (single field) is currently done by toast. Number two (across rows in a single page) is a practical optimization where a compression routine blindly looks for repeating values in a page without understanding its structure. The difficulty of implementing this happens when a page is stored using its compressed length (rather than the uncompressed 8k), the page contents change, and the new contents compress less well than the previous contents. In this case, the compressed page contents would be larger and it would be very complex to fit the page into the existing space in the file. A different file layout is really required for this, so pages can be placed anywhere in the file, without affecting index access. A team is working on adding this feature using Postgres's table access method interface.

Number three (across rows in a single column) is the classic definition of a columnar database. A team is also working on that. Just like number two, this requires using a different storage layout than Postgres's default, and the table access method interface makes this possible.

Number four can be done using file system compression. Numbers five and six would be nice, but it unclear how this could be done efficiently without adding unacceptable complexity to the database.

View or Post Comments

Multi-Host Technologies

Friday, April 24, 2020

There are so many multi-host technologies and it is hard to remember the benefits of each one, so I decided to create a list:

High availability: Streaming replication is the simplest way to have multiple copies of your database, ready for fail over

Read scaling: Pgpool allows replicas (slide 17) to handle a database's read-only workload

Write scaling: Sharding allows for write scaling

Partial replication: Logical replication allows partial replication

Reduce latency: Multi-master replication allows servers to be located close to users, reducing transmission latency

Hopefully this is helpful to people.

View or Post Comments

Performance Limitations

Wednesday, April 22, 2020

In talking to EnterpriseDB customers, I am often asked about the performance limits of Postgres: How many connections can it handle? How many tps?

Well, those are good questions, but it is hard to give accurate answers since so much depends on the hardware and workload. Eventually, testing of actual workloads on intended hardware has to be done, but not giving ball-park answers is unhelpful. What answer can I give? Well, I came up with this chart:

#	Simultaneous	Limit	Solution
1	Connections	< 250	direct connect
2	Queries	< 250	pgbouncer
3	Write queries	< 250	Pgpool with read-only replicas (slide 17)
4	Write queries	>= 250	sharding

Earlier items use simpler architectures than later items, and are therefore preferred.

For under 250 simultaneous connections(#1), it isn't generally necessary to use any complex architecture, though using a pooler is recommended for workloads with many short-lived sessions to reduce connection startup time. Even when under this limit, performance can be limited by the number of cpus. Oltp databases typically cannot process more than 5 x cpu cores without having to time-slice among cpus. Olap uses 2 x cpu cores, or less if parallelism is used.

For over 250 simultaneous connections but under 250 simultaneous queries(#2), a connection pooler can be beneficial because it reduces the overhead of managing many open connections. For over 250 simultaneous queries but under 250 simultaneous write queries(#3), a combination of Pgpool with read queries routed to streaming replicas can be very efficient. For over 250 simultaneous write queries(#4), a sharding solution should be considered.

These numbers can vary greatly based on workload and hardware, but I think they are good starting points to consider.

View or Post Comments

Fast Enough?

Monday, April 20, 2020

Everyone one wants their software to perform as fast as possible. Some people think that unless the program ends before it begins, it's too slow.

However, realistically, making something as fast as possible is not a universal good, meaning that increasing performance beyond a certain point can cause problems that far exceed the value of the improved performance. Let's give some examples:

Most people use high-level languages that are compiled into cpu instructions or interpreted while being run. In an ideal world, every program would be written in assembly language. (Postgres does use some assembly language code for locking.) However, writing something like a database in assembly language, though it might give slightly better performance if developers knew the behavior of every cpu, would be a huge challenge for even the most expert developers, and code maintenance and feature additions might be impossible. In general, there just isn't enough benefit to using assembly language for anything but the most discrete, performance-critical functions.
Java isn't often chosen for its performance, but rather its ability to allow development teams to produce complex software efficiently. Lighter-weight languages might give better performance, but they don't offer the same development efficiency as Java.
Sql is a heavy-weight way to request data, but it is very efficient for developers since they can express their requests in a declarative way. You could probably write a custom data storage program to run faster than sql, but the effort involved to create and maintain it would be nearly insurmountable.
You can even layout data in sql in more efficient ways, and sometimes it is worth it, but data maintainability, access flexibility, and storage efficiency can suffer.

The bottom line is that performance is rarely a universal good — it has to be balanced against development time, solution flexibility, and maintainability. Sometimes people suggest that Postgres should offer some "super fancy" optimization, and sometimes we can implement it, but we always have to balance development time, solution flexibility, and maintainability with improved performance.

View or Post Comments

No Travel

Friday, April 17, 2020

With the Coronavirus outbreak, almost all Postgres events through June have been either cancelled, rescheduled, or moved online. This has given me time to consider my past travel. I have been blessed to visit so many places, not as a tourist, but rather a guest. I see countries more as a native than as a tourist, and I have many event organizers and hosts to thank for this.

It is hard to know when on-site events will resume, but I can remember what it was like to travel roughly 90 days a year. Going to social outings at home often felt riding a train through my home town, waving from behind a train window to my friends on the platform. I would sit at home and wonder how long I would be there until I had to leave again. Now, sitting at home, the images of our family travels appearing on our kitchen slideshow show places that seem farther away than ever, and I wonder if I will ever see these places again.

I am sure many others have similar feelings, and I have hope that, someday, we will all return to the road to spend time together again.

View or Post Comments

Database Interoperability at Risk

Monday, March 16, 2020

This article parallels Oracle's copying of the sql syntax from ibm in the late 1970's with Google's copying of the Java api. It also explains the possible impact of the current case soon to be decided by the US Supreme Court.

One thing it does not fully cover is the impact on sql-level interoperability between databases. If Oracle can claim the Java api as copyrightable, the sql language could be considered copyrightable, allowing ibm to sue all relational database vendors and users for copyright infringement. It might also allow database vendors to sue competitors when their sql-level features are copied, requiring either huge payments or the removal of interoperability syntax.

The Postgres copyright is open, so any database vendor copying Postgres syntax is fine, but Postgres copying the syntax of other databases could be problematic. Relational database interoperability has been of huge benefit to data management, and this single case could call that into question.

Update: The case was decided in Google's favor. 2021-04-08

View or Post Comments

Databases, Containers, and the Cloud

Saturday, March 7, 2020

A few months ago, I wrote a high-level presentation about the deployment benefits of using containers and cloud infrastructure for databases. I am now ready to share this presentation. I have also added QR codes to the first and last slides of all my presentations, linking to my website.

View or Post Comments