Bruce Momjian

Postgres Blog

This blog is about my work on the Postgres open source database, and is published on Planet PostgreSQL. PgLife allows monitoring of all Postgres community activity.

Online status:
Unread Postgres emails:
Email graphs: incoming, outgoing, unread, commits (details)

Sql Pagination Tips

Monday, August 10, 2020

We have all used applications that allow searches which return results in sections or "pages", like for products or flights. For example, you might search for a product, and there might be 1,500 matches, but you don't see 1,500 results — you only see ten results (1-10), and you have the option of seeing the next ten (11-20), and the next ten (21-30), etc. Pagination is done for several reasons:

  1. The user probably doesn't want to see all the results at once
  2. The user probably doesn't want to ever see all of the results
  3. It might be inefficient to transfer and display all results

There are several ways to enable pagination using a client/server architecture like databases. One approach is to transfer all results to the client, and let the client do the pagination. That handles the first issue, but it doesn't handle cases where the transfer and storage of many results is inefficient. While this might work for 1,500 results, it is going to perform badly for 150 million results.

Therefore, some kind of pagination support on the server side is required if large result sets are possible. There are four common approaches:

  1. Cursors: Open a non-with hold cursor in a transaction block and fetch rows in pages. The downside of this is that the entire result set often has be computed before returning any result. Also, the transaction must be kept open while users are paging through the results, leading to potentially long-running transactions, and therefore decreased cleanup efficiency. Using idle_in_transaction_session_timeout prevents sessions from keeping transactions open too long.
  2. Using with hold cursors avoids keeping a transaction open during page fetches, but it does require the entire result set to be stored in server memory.
  3. limit/offset: These keywords allow select to return partial results, ideally suited to pagination. Limit/offset also allow the optimizer (slide 52) to efficiently access limited result sets by often using indexes which are only efficient for small result sets. This also avoids creation of the entire result in memory.
  4. Limit/offset with repeatable read: Limit/offset alone has a problem displaying accurate data if the underlying results change during page views. For example, if I am viewing results 21-30, and a row is added which should appear in range 11-20, if I request the next page, what was item 30 will appear again as item 31. If an earlier item is deleted, what was item 31 will now be item 30, and will not appear in range 31-40. Effectively, paging back and forward will cause items to not appear, or appear twice. The fix for this is to use limit/offset in a repeatable read or serializable transaction, so that all page requests use the same snapshot. This does have the same problem that non-with hold cursors had with long-running transactions, so the use of idle_in_transaction_session_timeout is recommended.

One optimization to reduce query requests is to fetch a large page range in the application, display smaller page ranges to the user as requested, and request more page ranges as needed.

One other thing people expect from pagination is an estimate of the number of matches. While you can run count(*) to find the number of matches, this can be very inefficient. A more creative approach is to use the optimizer's statistics as an estimate, since it already maintains statistics to generate efficient plans. For example, if you create this function:

AS $$
        row_count INTEGER;
        -- get just the first row
        EXECUTE $a$EXPLAIN $a$ || query INTO str;
        -- extract the row value
        SELECT substring(substring(str FROM ' rows=[0-9]+') FROM 7) INTO row_count;
        RETURN row_count;
LANGUAGE plpgsql;

you can use it to estimate the number of rows in a result set. (Don't specify limit/offset since we want the total count.)

SELECT row_count_estimate('SELECT * FROM product WHERE name LIKE ''a%''');

To improve the accuracy of the estimates, you can increase the frequency of statistics updates by modifying autovacuum's analyze settings (autovacuum_analyze_) either at the cluster or per-table level. You can also run analyze manually. You can also increase the granularity of statistics, which can improve estimates. This can also be set a the cluster or per-table level.


Data modeling Invalid Times

Wednesday, August 5, 2020

Have you ever wondered how invalid or ambiguous times are handled by Postgres? For example, during a daylight saving time transition in the usa, time switches either from 2am to 3am, or from 2am back to 1am. On a fall-forward day (the former), how would 2:30am be represented? Is 1:30am during a fall-back day represented as the first or second 1:30am of that day? This email thread explains the problem, and this patch documents the behavior. (November 4, 2018 was a "fall back" date in the usa.)

The original poster, Michael Davidson, was showing the first query, and Tom Lane was saying you would need to use one of the two later queries to qualify the 1am time:

SELECT '2018-11-04 01:00:00'::timestamp WITH TIME ZONE;
 2018-11-04 01:00:00-05
SELECT '2018-11-04 01:00:00 EST'::timestamp WITH TIME ZONE;
 2018-11-04 01:00:00-05
SELECT '2018-11-04 01:00:00 EDT'::timestamp WITH TIME ZONE;
 2018-11-04 01:00:00-04

I am not sure how to specify this for time zones that don't have daylight/non-daylight saving time abbreviations — I suppose utc offsets would need to be used, e.g.:

SELECT '2018-11-04 01:00:00-05'::timestamp WITH TIME ZONE;
 2018-11-04 01:00:00-05

Interestingly, in this case the time zone abbreviation contains more information than the more generic time zone string 'America/New_York'.

Post a Comment

Sql Differences Between Dates

Monday, August 3, 2020

What is the difference between two dates? You would think there was one answer, but there isn't. You can give an answer in calendar terms (years/months/days), the number of days, or the number of seconds. Postgres offers all of these options:

SELECT age('2019-12-25', '2018-06-01');
 1 year 6 mons 24 days
SELECT '2019-12-25'::timestamp - '2018-06-01'::timestamp;
 572 days
SELECT '2019-12-25'::timestamptz - '2018-06-01'::timestamptz;
 572 days 01:00:00
SELECT '2019-12-25'::date - '2018-06-01'::date;
SELECT EXTRACT(EPOCH FROM '2019-12-25'::timestamptz) - EXTRACT(EPOCH FROM '2018-06-01'::timestamptz);

Unless you are Mr. Spock , you probably can see the value in all of these methods. Age() provides the simplest option for human consumption. Timestamp subtraction (no tz) returns an interval, which can calculate day and second differences, but doesn't handle daylight saving time changes. Timestamptz subtraction also returns interval, but handles daylight saving time changes. Subtraction of date values returns an integer number of days and is good for cases where the smallest unit of calculation is a day, like bank loan interest or hotel nights. Extract with epoch is ideal when exact precision is required, like electricity consumption or astronomical measurements.

You can also think of it as the time difference calculations they ignore. For example, age() compares months/days/seconds units to find the simplest difference, and ignores daylight saving time changes, and sometimes even ignores leap days. Date and timestamp subtraction ignores daylight saving time changes. Subtraction using timestamptz and extract with epoch honor both. (Unfortunately, Postgres ignores leap seconds).

Here is an example of age() ignoring leap days when the day of the month is the same, but reflecting leap days if the the days of the month are different:

-- leap year, same day of the month
SELECT age('2020-03-01', '2020-02-01');
 1 mon
-- non-leap year, same day of the month
SELECT age('2021-03-01', '2021-02-01');
 1 mon
-- leap year, different day of the month
SELECT age('2020-03-01', '2020-02-02');
 28 days
-- non-leap year, different day of the month
SELECT age('2021-03-01', '2021-02-02');
 27 days

This is because age() tries to find the simplest difference. Here is an example of daylight saving time changes using age() and subtraction:

-- Daylight saving time started in America/New_York on March 8, 2020 02:00:00
SHOW timezone;
SELECT age('2020-03-09', '2020-03-08');
 1 day
SELECT age('2020-03-09 00:00:00'::timestamptz, '2020-03-08 00:00:00'::timestamptz);
 1 day
SELECT '2020-03-09'::date - '2020-03-08'::date;
SELECT '2020-03-09'::timestamp - '2020-03-08'::timestamp;
 1 day
SELECT '2020-03-09'::timestamptz - '2020-03-08'::timestamptz;
-- a 23-hour day
SELECT EXTRACT(EPOCH FROM '2020-03-09'::timestamptz) - EXTRACT(EPOCH FROM '2020-03-08'::timestamptz);
-- a 24-hour day
SELECT EXTRACT(EPOCH FROM '2020-03-10'::timestamptz) - EXTRACT(EPOCH FROM '2020-03-09'::timestamptz);

If you are doing a later calculation using a computed date difference, consider how accurate you want the result to be:

  • Do you want it to be the same day based on the calendar? Use age()
  • Do you want the number of days to be exactly the same? Use date or timestamp subtraction
  • Do you want the number of seconds to be exactly the same? Use timestamptz subtraction or extract with epoch

I have already written about the complexities of using and computing interval values.

Post a Comment

Sql Computing Interval Values

Friday, July 31, 2020

The interval data type stores time duration as months, days, and seconds. Years are represented as a fixed number of months, and hours and minutes as a fixed number of seconds. Using interval values makes time computation very simple:

-- Daylight saving time started in America/New_York on March 8, 2020 02:00:00
SHOW timezone;
SELECT '2020-03-07 00:00:00'::timestamptz + '2 days';
 2020-03-09 00:00:00-04
SELECT '2020-03-07 00:00:00'::timestamptz + '48 hours';
 2020-03-09 01:00:00-04

This computation spans a usa daylight saving time change. The interval data type allows the user to specify whether they want days or hours added, with different results because of the daylight saving time change. This is a great use of the interval data type.

Using queries that output interval values is more nuanced. While you can specify units on input, e.g. days, hours, seconds, you can't specify whether you care about months, days, or seconds in interval output. Therefore, different operations are required to produce different outputs, e.g.:

SELECT '2020-03-09 00:00:00'::timestamptz - '2020-03-07 00:00:00'::timestamptz;
 1 day 23:00:00
SELECT age('2020-03-09 00:00:00'::timestamptz, '2020-03-07 00:00:00'::timestamptz);
 2 days

The first query uses subtraction, and computes based on seconds. The second query internally calls timestamp_age(). The comment in the source code is illustrative:

Calculate time difference while retaining year/month fields. Note that this does not result in an accurate absolute time span since year and month are out of context once the arithmetic is done.

The key part is the "out of context" mention. What it is saying is that when age() returns 2 days, that could have spanned 47, 48, or 49 hours, based on usa time zones, and that detail is lost in the output. Look at this:

SELECT '2020-03-07 00:00:00'::timestamptz + '1 day 23 hours';
 2020-03-09 00:00:00-04
SELECT '2020-03-07 00:00:00'::timestamptz + '47 hours';
 2020-03-09 00:00:00-04
SELECT '2020-03-07 00:00:00'::timestamptz + '2 days';
 2020-03-09 00:00:00-04
SELECT '2020-03-07 00:00:00'::timestamptz + '2 days -1 hours';
 2020-03-08 23:00:00-04

The first three return the same result, though the last does not because days is computed first, then hours. However, when the first and third queries are run in a different context, i.e., for a different date, they generate different outputs:

SELECT '2020-03-12 00:00:00'::timestamptz + '1 day 23 hours';
 2020-03-13 23:00:00-04
SELECT '2020-03-12 00:00:00'::timestamptz + '2 days';
 2020-03-14 00:00:00-04

When you are inputting interval values, be conscious of the month, day, and second values you specify, and when calling a function that produces interval output, consider how it computes its output.

Post a Comment

Sql pgFormatter

Wednesday, July 29, 2020

In my years with Postgres, I have seen some amazingly complex queries posted to the email lists. I have never understood how people can read complex queries with no formatting, e.g., no keyword capitalization, no indenting, no line breaks for new clauses:

select n.nspname as "Schema", p.proname as "Name",
pg_catalog.format_type(p.prorettype, null) as "Result data type", case
when p.pronargs = 0 then cast('*' as pg_catalog.text) else
pg_catalog.pg_get_function_arguments(p.oid) end as "Argument data
types", pg_catalog.obj_description(p.oid, 'pg_proc') as "Description"
from pg_catalog.pg_proc p left join pg_catalog.pg_namespace n on n.oid =
p.pronamespace where p.prokind = 'a' and n.nspname <> 'pg_catalog' and
n.nspname <> 'information_schema' and
pg_catalog.pg_function_is_visible(p.oid) order by 1, 2, 4; 

Obviously, some people can read such queries, but I never can. I rely on clean formatting to conceptually understand queries and detect errors. Some people have no trouble understanding, but I find this much clearer:

SELECT n.nspname AS "Schema", p.proname AS "Name",
    pg_catalog.format_type(p.prorettype, NULL) AS "Result data type", CASE WHEN
    p.pronargs = 0 THEN
        CAST('*' AS pg_catalog.text)
    END AS "Argument data types", pg_catalog.obj_description(p.oid, 'pg_proc')
        AS "Description"
FROM pg_catalog.pg_proc p
    LEFT JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace
WHERE p.prokind = 'a'
    AND n.nspname <> 'pg_catalog'
    AND n.nspname <> 'information_schema'
    AND pg_catalog.pg_function_is_visible(p.oid)
ORDER BY 1, 2, 4;

In the past, I used to clean up queries using a tool designed to reformat Informix 4GL programs, and I would use sed to fix some things the tool missed. That was obviously far from ideal, so last year I tried Gilles Darold's pgFormatter. My first test was to run the sql files I use for my presentations through the formatter. I found a few problems that Gilles fixed quickly. I then ran the Postgres regression test queries through the formatter, which turned up more problems. Again, Gilles fixed them, and I could then run all 78k lines of sql queries from the regression tests with satisfactory results.

Now, when I am presented with a query that looks like the first version, I run to pgFormatter, and with my favorite flags, I get a query that looks like I typed it.

Post a Comment

Sql Writing Style

Monday, July 27, 2020

There seem to be as many methods of writing sql queries as ways of writing essays. While spacing, capitalization, and naming are all personal preferences, there are also logical arguments for why certain styles are better than others. This web page outlines one set of styles, based on Joe Celko's SQL Programming Style. While I don't agree with all the style decisions, I feel it is a good way to think about your own style decisions and increase style consistency.

Post a Comment

Security Encryption at Rest

Friday, July 24, 2020

There are many security guides that require encryption at rest. However, it is unclear exactly what "at rest" means. Encrypted at rest can potentially mean encrypted when the:

  • Storage is powered off
  • File system is unmounted
  • Database is not running
  • Data is in the kernel's file system cache
  • Write-ahead log is in archive storage
  • Data is in backups
  • Data is in the database process's memory

Odds are some of these items are required to implement a guide's "encryption at rest," but maybe not all of them. The last one is particularly difficult, especially since the unlock key probably also has to be somewhere in process memory. When implementing encryption at rest, it is good to be clear exactly what encryption levels are required, and what risks they are designed to minimize.

Post a Comment

Security Passwords in Log files

Wednesday, July 22, 2020

No one likes user passwords appearing in log files (except the bad guys/gals). Because Postgres uses sql queries to manage user accounts, including password assignment, it is possible for passwords to appear in the server logs if, for example, log_statement is enabled or an error is generated, depending on the setting of log_min_error_statement.

One solution is to encrypt the password manually before issuing the query, but that can be error-prone. A simpler solution is to use psql to automatically encrypt passwords before sending them to the server, e.g.:

SET client_min_messages = 'log';
SET log_statement = 'all';
\password postgres
Enter new password:
Enter it again:
LOG:  statement: show password_encryption
LOG:  statement: ALTER USER postgres PASSWORD 'md567429efea5606f58dff8f67e3e2ad490'

Notice psql runs show password_encryption to determine if md5 or scram-sha-256 should be used. psql then hashes the supplied password and issues an alter user command. While it is not ideal that hashed passwords can appear in the server logs, it is better than having user-typed passwords in the logs. (It might be wise to set log_min_error_statement to "panic" to prevent logging of error queries.) Another secure option is to use syslog to send the logs to a secure server.

View or Post Comments

Administration Force Password Changes

Monday, July 20, 2020

I have written about the limitations of passwords before. Postgres has limited ability to control passwords — mostly how passwords are transferred and how long passwords remain valid. For more complex password requirements, the Postgres project recommends using an external authentication system that manages passwords like ldap or gssapi. Cert authentication has many advantages because it does not require a password.

Last year, one big change in password policy was the us nist announcement that periodic password changes are no longer recommended. This article explains why this is a logical approach to security. The new recommendations also suggests the limited value of password complexity requirements.

Users occasionally ask for for more complex password management features to be built into Postgres. These new guidelines make it less likely these features will ever be added to core Postgres.

View or Post Comments

Security Credential Rotation Using Certificates

Friday, July 17, 2020

The traditional method of authenticating is local password. Authentication using external password managers is also very popular. However, passwords have known limitations, particularly password selection.

Managing password changes can also be complex. If users change their own passwords, the process is simple. However, if someone else needs to change the password, and there is a delay between the user requesting the change and the change being made, it can be difficult for the user to know when to switch to using the new password. For applications that connect to the database by supplying passwords automatically, password changes can be even more complex. Programming applications to try both old and new passwords is awkward.

Instead of using something you know, e.g., passwords, another authentication approach is to use something you have. One ideal method of "something you have" is certificate authentication. It allows for more effective credential rotation because the old and new certificates can be valid at the same time. The process of replacing ssl certificates is:

  1. Create a new certificate, signed by a certificate authority in the same certificate chain as the server certificate
  2. Replace the old client certificate with the new one
  3. Add the old client certificate to the certificate revocation list (crl)

After step one, administrators don't need to rush step two because the old and new certificates are both valid. Once step two is complete, step three can be done. Step three is only useful if the server has been set up to honor certificate revocation lists. You also must set an expiration date for certificates to limit their lifetimes.

There is no similar way to do delayed authentication rotation using password-based methods.

View or Post Comments

Null Grouping Sets and Null Values

Wednesday, July 15, 2020

You might be aware that grouping sets, along with rollup and cube, allow queries to return additional group by rows. However, because these additional rows are added by Postgres, it is unclear what value to use for summary fields, so the sql standard specifies null for those fields. Let's start with a simple example using a simple group by:

INSERT INTO test VALUES ('a', 0), ('b', 0), (NULL, 0);
 x | count
 a |     1
 b |     1
   |     1

Each row represents a value in test and each row has a count of one since there is one matching row for each value. The last row is the number of nulls in the table, also one. Here is the same query using rollup:

 x | count
 a |     1
 b |     1
   |     3
   |     1

Notice there are two rows where the first column is null. Let's verify these are nulls and not something else using psql's \pset:

\pset null '(null)'
   x    | count
 a      |     1
 b      |     1
 (null) |     3
 (null) |     1
-- reset the psql NULL display
\pset null ''

Yes, they are both null. The first null is the one generated by rollup, and the last row is the number of null values.

This is quite confusing. If we expect a column involved in a grouping set-type operation to perhaps contain nulls, how can we distinguish between those values and values added by grouping set-type operations? Well, your first inclination might be to use coalesce:

 coalesce | count
 a        |     1
 b        |     1
 (null)   |     3
 (null)   |     1

(Remember we turned off psql's special displaying of nulls, so (null) is coming from coalesce.) That didn't help. The reason is that the select target list is executed last, as covered in a previous blog post. The right way to do it is to force coalesce to run before the grouping set-type operation using a common table expression:

        SELECT COALESCE(x, '(null)') AS x, y FROM test
   x    | count
 a      |     1
 b      |     1
 (null) |     1
        |     3

That's what we wanted — (null) for the inserted null, and a real null for the grouping set-type added row.

Another option is to use grouping, which returns a non-zero value if the row is generated by a grouping set, e.g.:

 x | count | grouping
 a |     1 |        0
 b |     1 |        0
   |     3 |        1
   |     1 |        0

This can be combined with case to control the label of the grouping set null column:

           WHEN GROUPING(x) != 0 THEN x
           WHEN x IS NOT NULL THEN x
           ELSE '(null)'
       END AS y,
FROM test
   y    | count
 a      |     1
 b      |     1
 (null) |     1
        |     3

View or Post Comments

NoSQL JSONB: A Container of Types

Monday, July 13, 2020

The popular json and jsonb data types are more than just a collection of single-data-type values like arrays. They can contain multiple data types:

INSERT INTO test VALUES ('"abc"'), ('5'), ('true'), ('null'), (NULL);
\pset null (null)
SELECT x, jsonb_typeof(x), pg_typeof(x) FROM test;
   x    | jsonb_typeof | pg_typeof
 "abc"  | string       | jsonb
 5      | number       | jsonb
 true   | boolean      | jsonb
 null   | null         | jsonb
 (null) | (null)       | jsonb

You can see I stored a string, number, boolean, json null, and an sql null in a single jsonb column, and Postgres knew the type of each value in the jsonb column, as shown by jsonb_typeof(). However, pg_typeof() still sees all these as jsonb values, and that is the type exposed to anything referencing the json column.

Here is another example that uses #>> to convert all json values to text:

SELECT x, x::TEXT, x#>>'{}', pg_typeof(x#>>'{}') FROM test;
   x    |   x    | ?column? | pg_typeof
 "abc"  | "abc"  | abc      | text
 5      | 5      | 5        | text
 true   | true   | true     | text
 null   | null   | (null)   | text
 (null) | (null) | (null)   | text

(I am using #>> rather than the more popular ->> because it lets me access json values that are not associated with keys.) Notice that using :: to cast to text retains the double-quotes, while #>> removes them. (->> does also.) However, the return value for #>> did not change based on the json contents — it always returned text or sql null. Also notice that :: casting returns the string "null" for the jsonb null value, while #>> returns an sql null. (I used psql's \pset null above.)

There is a fundamental problem with the interaction between jsonb and sql data types. Postgres knows the data type of each json value inside the jsonb field (as shown by jsonb_typeof), but each field can have multiple json values inside, and each row can be different as well. Therefore, you only have two choices for interfacing jsonb to sql data types:

  1. Use ->> or #>> to map all values to text
  2. Cast values to an sql data type

For number one, all data types can be converted to text, so it is a simple solution, though it doesn't work well if you need to process the values in a non-textual way, e.g., a numeric comparison. You can cast jsonb values to any sql data type (number two) as long as all jsonb values can be cast successfully. For example:

-- all inserted values must cast to a jsonb type
INSERT INTO test VALUES ('"1"'), ('2'), ('3e1'),  ('4f1'),  ('true'), ('null'), (NULL);
ERROR:  invalid input syntax for type json
LINE 1: ...NSERT INTO test VALUES ('"1"'), ('2'), ('3e1'),  ('4f1'),  (...
DETAIL:  Token "4f1" is invalid.
CONTEXT:  JSON data, line 1: 4f1
-- 4e1 uses exponential notation
INSERT INTO test VALUES ('"1"'), ('2'), ('3e1'),  ('4e1'),  ('true'), ('null'), (NULL);
SELECT x, x::TEXT, x#>>'{}', jsonb_typeof(x), pg_typeof(x#>>'{}') FROM test;
   x    |   x    | ?column? | jsonb_typeof | pg_typeof
 "1"    | "1"    | 1        | string       | text
 2      | 2      | 2        | number       | text
 30     | 30     | 30       | number       | text
 40     | 40     | 40       | number       | text
 true   | true   | true     | boolean      | text
 null   | null   | (null)   | null         | text
 (null) | (null) | (null)   | (null)       | text
-- all values can't be cast to numeric
SELECT x, x::TEXT, x#>>'{}', x::NUMERIC * 5, jsonb_typeof(x)
FROM test;
ERROR:  cannot cast jsonb string to type numeric
-- all values of jsonb type 'number' can be cast to numeric
SELECT x, x::TEXT, x#>>'{}', x::NUMERIC * 5, jsonb_typeof(x)
FROM test
WHERE jsonb_typeof(x) = 'number';
 x  | x  | ?column? | ?column? | jsonb_typeof
 2  | 2  | 2        |       10 | number
 30 | 30 | 30       |      150 | number
 40 | 40 | 40       |      200 | number
-- Use #>> to remove double-quotes from the jsonb string
SELECT x, x::TEXT, x#>>'{}', (x#>>'{}')::NUMERIC * 5, jsonb_typeof(x)
FROM test
WHERE jsonb_typeof(x) = 'number' OR
      jsonb_typeof(x) = 'string';
  x  |  x  | ?column? | ?column? | jsonb_typeof
 "1" | "1" | 1        |        5 | string
 2   | 2   | 2        |       10 | number
 30  | 30  | 30       |      150 | number
 40  | 40  | 40       |      200 | number

The first insert fails because 4f1 isn't double-quoted and can't be cast to a json numeric, but 4e1 can because it represents exponential notation. In trying to cast all values to numeric in the select, the "1" has double-quotes, so it fails. Using #>> removes the double quotes and allows the string value to be cast to numeric too. This discussion exposes the confusion of using json without casting.

I have been using single jsonb values, but the same behavior happens with jsonb documents:

-- create document with keys 'a' and 'b'
INSERT INTO test VALUES ('{"a": "xyz", "b": 5}');
-- access key 'a'
SELECT x->'a', jsonb_typeof(x->'a'), pg_typeof(x->'a'), (x->'a')::TEXT, x->>'a', pg_typeof(x->>'a') FROM test;
 ?column? | jsonb_typeof | pg_typeof | text  | ?column? | pg_typeof
 "xyz"    | string       | jsonb     | "xyz" | xyz      | text
-- access key 'b'
SELECT x->'b', jsonb_typeof(x->'b'), pg_typeof(x->'b'), (x->'b')::TEXT, x->>'b', pg_typeof(x->>'b') FROM test;
 ?column? | jsonb_typeof | pg_typeof | text | ?column? | pg_typeof
 5        | number       | jsonb     | 5    | 5        | text

So, in summary:

  • If you want to use json values as text, use ->> or #>>
  • If you want to cast to another sql data type, use :: or cast to cast, but be sure all values can be cast to the new sql data type
View or Post Comments

Thoughts Two Interviews

Friday, July 10, 2020

I have done two interviews in the past month. The first one was done by the Linux Inlaws and was published on Hacker Public Radio. This interview discusses the history and open source aspects of the Postgres project, and its health and future direction.

The second interview is more personal, discussing how I got involved in computers and my early experiences with Postgres. It also discusses technology disruption, and its challenge to time management. The final quarter covers religious topics.

View or Post Comments

News Postgres Marketing

Wednesday, July 8, 2020

Postgres is mostly a technology-driven community, so marketing often suffers. However, one great thing about the community is that it is distributed, so anyone can get involved and help. Here are some examples of very successful community-driven marketing ideas:

Here are front and rear images of these objects. Of course, I have also accumulated many Postgres pins over the years. I am inspired by these marketing efforts and hope they continue.

View or Post Comments

Indexing Boolean Indexes

Friday, July 3, 2020

For btree and hash indexes to be used for lookups, values being requested must be very restrictive, roughly 3-5% of a table's rows. Let's consider a boolean column — it can contain only three values: true, false, and null. By definition, at least one of those three values will be in more than 5% of the table's rows, so why index them? A better solution, particularly for boolean fields, is to create a partial index so only rare values are indexed. Partial indexes can also be used to index non-boolean fields when there are a large number of common values that aren't worth indexing.

View or Post Comments

Indexing Global Indexes

Wednesday, July 1, 2020

Postgres indexes can only be defined on single tables. Why would you want to have indexes that reference multiple tables, i.e., global indexes?

This email covers some of the reasons why global indexes can be useful. One big use-case is the ability to create indexes on partitioned tables that index all its child tables, rather than requiring a separate index on each child table. This would allow references to partitioned tables as foreign keys without requiring the partition key to be part of the foreign key reference; Postgres 12 allows such foreign keys if they match partition keys.

A second use-case for global indexes is the ability to add a uniqueness constraint to a partitioned table where the unique columns are not part of the partition key. A third use-case is the ability to index values that only appear in a few partitions, and are not part of the partition key. A global index would avoid having to check each partition table's index for the desired value.

It is still unclear if these use-cases justify the architectural changes needed to enable global indexes. Some of these features can be simulated using triggers and user lookup tables. A large global index might also reintroduce problems that prompted the creation of partitioning in the first place.

View or Post Comments

Hardware Hardware Acceleration for Databases

Monday, June 29, 2020

There is a long history of hardware acceleration, i.e., hardware modules helping the cpu. There was the 80287 math coprocessor, sound cards, and video cards. The computer industry is constantly moving things from the cpu to the motherboard and external cards, and back again. Movement is mostly determined by whether the cpu is able to efficiently perform the task, the transfer bandwidth needed to perform the task, and the flexibility of replaceable external cards.

This year, the big questions for database software is if and how to make use of graphics processing unit (gpu) and field-programmable gate arrays (fpga). This article does a good job of explaining the history of hardware acceleration, and mentions Netezza's (no longer used) use of fpgas as hardware acceleration for databases.

The same historic hardware acceleration questions apply to database acceleration today:

  • Are they better suited than cpus to do some database processing tasks, and how common are those tasks?
  • Is there sufficient transfer bandwidth to gpus and fpgas to justify their use?
  • Is hardware acceleration worth the deployment complexity?

PgOpenCL, PG-Strom, and HeteroDB are projects that are experimenting with the value of gpus and fpgas in Postgres. As cloud providers increase the availability of gpus and fpgas, we might start see their usage increase.

View or Post Comments

Data modeling Can Case Comparison Be Controlled?

Friday, June 26, 2020

Computer tasks are one of the most precise activities we do on a daily basis. Driving, cooking, walking, and reading are fairly imprecise compared to computer interaction.

Computers represent symbols like "a" and "A" precisely and require external facilities to define relationships between them. This email thread makes a convincing argument that you usually want case-preserving, but less-precise case-insensitive behavior.

Let's go over some Postgres case-precision behaviors like the handling of character strings, identifiers, and keywords. For example, these queries do the same thing:

Select Count(*) From Pg_Class;

This is because Postgres, and the sql standard, ignore the case of keywords, e.g., select,. They also ignore the case of identifiers, e.g., pg_class, when not double-quoted. Double-quoting adds case precision to identifiers:

SELECT "count"() FROM "pg_class";
SELECT "COUNT"() FROM "pg_class";
ERROR:  function COUNT() does not exist
LINE 1: SELECT "COUNT"() FROM "pg_class";
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
SELECT "Count"() FROM "Pg_Class";
ERROR:  relation "Pg_Class" does not exist
LINE 1: SELECT "Count"() FROM "Pg_Class";
ERROR:  relation "PG_CLASS" does not exist

There is no ability to add case precision to keywords:

"SELECT" COUNT() FROM pg_class;
ERROR:  syntax error at or near ""SELECT""
LINE 1: "SELECT" COUNT() FROM pg_class;

When comparing values, Postgres is precise by default:

SELECT 'a' = 'A';
SELECT '-' = '_';
SELECT '.' = ',';

For certain symbols, is it sometimes visually hard to see the difference.

As shown above, double-quotes adds precision to identifiers. For value comparisons, you have to be explicit to remove precision:

SELECT upper('a') = upper('A');
SELECT lower('a') = lower('A');

Most people aren't comparing constants in sql but compare column values:

SELECT oid FROM pg_class WHERE relname = 'pg_class';
SELECT oid FROM PG_CLASS WHERE relname = 'pg_class';
SELECT oid FROM pg_class WHERE relname = 'Pg_Class';
SELECT oid FROM pg_class WHERE relname = 'PG_CLASS';

Notice that these queries use pg_class as an identifier (without single quotes) and as a value (with single quotes). The identifier usage is case insensitive; the value usage is case sensitive. You can explicitly reduce comparison precision using function calls:

SELECT oid FROM pg_class WHERE lower(relname) = lower('pg_class');
SELECT oid FROM pg_class WHERE lower(relname) = lower('Pg_Class');
SELECT oid FROM pg_class WHERE lower(relname) = lower('PG_CLASS'^);

These convert the column values and constants to lower case before comparison. (Upper case could also have been used.) Indexes are also case sensitive by default, and it obviously would be inefficient to lower-case every index entry for comparison, so function calls on columns cannot use an ordinary column index:

EXPLAIN SELECT oid FROM pg_class WHERE relname = 'pg_class';
                                        QUERY PLAN
 Index Scan using pg_class_relname_nsp_index on pg_class  (cost=0.27..8.29 rows=1 width=4)
   Index Cond: (relname = 'pg_class'::name)
EXPLAIN SELECT oid FROM pg_class WHERE lower(relname) = lower('pg_class');
                       QUERY PLAN
 Seq Scan on pg_class  (cost=0.00..19.76 rows=2 width=4)
   Filter: (lower((relname)::text) = 'pg_class'::text)

They can use expression indexes that are created to match function calls:

-- create a user table because users can't create indexes on system tables
CREATE TABLE my_pg_class AS SELECT * FROM pg_class;
-- create non-expression index
CREATE INDEX i_my_pg_class_relname ON my_pg_class (relname);
-- create expression index
CREATE INDEX i_my_pg_class_relname_lower ON my_pg_class ((lower(relname)));
-- The optimizer needs statistics
-- Doing the analyze after the expression index creation allows creation of statistics on the expression.
-- see
-- Autovacuum would have eventually done this automatically.
ANALYZE my_pg_class;
-- use non-expression index
EXPLAIN SELECT oid FROM my_pg_class WHERE relname = 'pg_class';
                                       QUERY PLAN
 Index Scan using i_my_pg_class_relname on my_pg_class  (cost=0.27..8.29 rows=1 width=4)
   Index Cond: (relname = 'pg_class'::name)
-- use expression index
EXPLAIN SELECT oid FROM my_pg_class WHERE lower(relname) = lower('pg_class');
                                          QUERY PLAN
 Index Scan using i_my_pg_class_relname_lower on my_pg_class  (cost=0.27..8.29 rows=1 width=4)
   Index Cond: (lower((relname)::text) = 'pg_class'::text)

The citext extension allows the creation of columns whose values are automatically compared in a case-insensitive manner:

-- 'x' column added so the row has a typical length
CREATE TABLE my_pg_class2 AS SELECT oid, relname::citext, repeat('x', 256) FROM pg_class;
CREATE INDEX i_my_pg_class_relname2 ON my_pg_class2 (relname);
ANALYZE my_pg_class2;
\d my_pg_class2
            Table "public.my_pg_class2"
 Column  |  Type  | Collation | Nullable | Default
 oid     | oid    |           |          |
 relname | citext | C         |          |
 repeat  | text   |           |          |
    "i_my_pg_class_relname2" btree (relname)
SELECT oid FROM my_pg_class2 WHERE relname = 'pg_class';
SELECT oid FROM my_pg_class2 WHERE relname = 'PG_CLASS';
EXPLAIN SELECT oid FROM my_pg_class2 WHERE relname = 'pg_class';
                                        QUERY PLAN
 Index Scan using i_my_pg_class_relname2 on my_pg_class2  (cost=0.27..8.29 rows=1 width=4)
   Index Cond: (relname = 'pg_class'::citext)
EXPLAIN SELECT oid FROM my_pg_class2 WHERE relname = 'PG_CLASS';
                                        QUERY PLAN
 Index Scan using i_my_pg_class_relname2 on my_pg_class2  (cost=0.27..8.29 rows=1 width=4)
   Index Cond: (relname = 'PG_CLASS'::citext)

There are more facilities available to further reduce precision:

If Postgres 12 or later is compiled with the icu library support (view the system table column pg_collation.collprovider to check), you can use nondeterministic collations that are case and accent-insensitive.

View or Post Comments

Data modeling Force One Row

Monday, June 22, 2020

How can you force a table to have at most one row? It is actually very easy by creating a unique expression index on a constant, with no column name references:

-- this adds a single row
INSERT INTO onerow VALUES (1, 'foo')
        ON CONFLICT ((1)) DO UPDATE SET a = excluded.a, b = excluded.b;
SELECT * FROM onerow;
 a |  b
 1 | foo
-- this updates the single row
INSERT INTO onerow VALUES (2, 'bar')
        ON CONFLICT ((1)) DO UPDATE SET a = excluded.a, b = excluded.b;
SELECT * FROM onerow;
 a |  b
 2 | bar
-- this also updates the single row
INSERT INTO onerow VALUES (3, 'baz')
        ON CONFLICT ((1)) DO UPDATE SET a = excluded.a, b = excluded.b;
SELECT * FROM onerow;
 a |  b
 3 | baz
-- try INSERT without ON CONFLICT
INSERT INTO onerow VALUES (4, 'foo2');
ERROR:  duplicate key value violates unique constraint "onerow_expr_idx"
DETAIL:  Key ((1))=(1) already exists.

By using on conflict, it is possible to add a row, but if a row already exists, to replace it with a new value.

View or Post Comments

Data modeling Storing Binary Data in the Database

Friday, June 19, 2020

There are some very good responses in an email thread about whether to store binary data in Postgres or externally. The binary storage options discussed were:

  • In the database (toast helps with performance)
  • In another database, like SQLite
  • In a local or network file system
  • Using cloud storage.

This email reply had many good insights, and this wiki page has even more. I have covered data storage outside of databases before.

View or Post Comments

Conference Dinner Q&A

Wednesday, June 17, 2020

My employer, EnterpriseDB, has been organizing events where potential customers and interested people can ask me questions while enjoying a meal. I thought the idea was strange, but I have done it ten times, and they have gone very well. The Q&A portion usually lasts one hour and forty-five minutes. During a November, 2019 event in Utrecht, the Netherlands, a sketch artist was present. The artist illustrated my ideas as I spoke and created this diagram, which I found quite interesting.

View or Post Comments

Client Controlling Server Variables at Connection Time

Monday, June 15, 2020

I have recently covered the importance of libpq environment variables and connection specification options. While most libpq options control how to connect to the Postgres server, there is one special option that can change variables on the server you connect to, e.g.:

$ psql 'options=-cwork_mem=100MB dbname=test'
psql (13devel)
Type "help" for help.
test=> SHOW work_mem;

This can also be done using environment variables, with all the benefits of environment variables:

$ PGOPTIONS="-c work_mem=100MB" psql test

These settings can also be set at the user, database, and cluster level on the database side too, but having control on the client side is often useful.

View or Post Comments

Client Connect Parameter Specification Options

Friday, June 12, 2020

I have previously covered the importance of libpq and environment variables. While you can specify discrete connection command-line parameters and environment variables, there is a catch-all setting that allows connection options to be specified in a single string, e.g.:

$ psql -d test
psql (13devel)
Type "help" for help.
test=> \q
$ psql --dbname test
psql (13devel)
Type "help" for help.
test=> \q
$ psql 'dbname=test'
psql (13devel)
Type "help" for help.
test=> \q

The first psql command uses a single-letter command-line option. The second one uses a long-format option. The third uses a parameter key word. Multiple key words can be used to specify multiple connection options:

$ psql ' port=5433 dbname=test'

You can also use a uri syntax to specify the same parameters as above:

$ psql postgresql://

View or Post Comments

Client Controlling Connection Parameters Using Environment Variables

Wednesday, June 10, 2020

Libpq is the Postgres connection library used by almost every non-jdbc application. It allows many connection parameters, which can be specified on the command line or embedded in applications:

$ psql -h -d mydb

In the above case, the psql host name and database name are specified on the command-line and interpreted by libpq. However, it is also possible to specify parameters using environment variables, which are also interpreted by libpq:

$ PGDATABASE=mydb psql

There is obviously no value in specifying libpq parameters using environment variables in this example, but there are use cases. For example, if you want to perform multiple operations on the same host name and database, you can do:

$ export
$ export PGDATABASE=mydb
$ vacuumdb
$ reindexdb

This avoids specifying the host and database names multiple times, though with a loss of clarity. Sometimes environment variables are best used as defaults when connection options are not specified:

$ export
$ export PGDATABASE=mydb
$ reindexdb
$ reindexdb --dbname mydb2
$ reindexdb --host --dbname mydb3

This reindexes databases mydb and mydb2 on host, and database mydb3 on

Another use-case for environment variables is to set parameters for users, without having to pass them as parameters to commands:

$ # must use 'sh' so the redirect happens as root
$ sudo sh -c "echo PGHOST='' >> ~bruce/.profile"

By appending this to bruce's .profile file, all applications that use libpq without a specific host name will connect to automatically. Once bruce logs out and back in again, all his applications will start using the new .profile pghost setting.

Finally, environment variables make it possible to set default connection values for all users. For example, on Debian, to default all tcp (non-Unix Domain socket) connections to fully verify ssl certificates, you can do:

$ sudo sh -c "echo PGSSLMODE='verify-full' >> /etc/profile.d/"

Of course, you can change environment variables set at login and applications can override connection parameters set by environment variables.

View or Post Comments

Administration Safety Systems Can Reduce Safety

Monday, June 8, 2020

What is the purpose of safety systems? To make things safer? To make them appear safer? To satisfy some external requirement? The purpose of safety systems is not always clear, but even for safety systems whose sole purpose is to increase safety — do they always succeed in increasing safety? The simple answer is "no". Here are three examples:

In all three cases, safety systems did not prevent disasters — they caused them. Safety systems are often very useful, and the modern world could not operate with them. However, they also add complexity, and that added complexity can introduce failure modes that did not exist without the safety systems.

So, where does that leave us? Safety systems are useful, but too many of them are bad? How many is too many? These are hard questions, but there are some guidelines:

  • How serious is the failure that the safety system is trying to prevent?
  • How likely is the failure that the safety system is trying to prevent?
  • How likely is the safety system to fail?
  • What impact will failure of the safety system have on the overall system?

This gets into a very complex calculus where you are computing the likelihood and seriousness of the failure that the safety system is trying to prevent, and the likelihood and seriousness of safety system failure. The big point is that while you are computing the likelihood and seriousness of failures and adding safety systems to compensate for them, you have to be aware of the cost of adding those safety systems, in both the likelihood and impact of their failure.

What does this have to do with databases? Well, you might have a database on an airplane or at a nuclear reactor site, in which case the database is part of a complex system. However, databases are also complex systems, and we regularly add safety systems to increase their reliability. How often to do we consider the cost of those safety systems, in terms of the likelihood and seriousness of failures? Let's look at some typical database safety systems:

  • Backups
  • Error reporting
  • Replication
  • Connection pooling

Let's walk through the calculus for backups:

  • How serious is the failure that the safety system is trying to prevent? High
  • How likely is the failure that the safety system is trying to prevent? High
  • How reliable is the safety system? Medium
  • What impact will failure of the safety system have on the overall system? Low

The last question is not considering the seriousness of a failed backup on performing it safety goal, but rather how likely is a backup to cause a failure on its own? It could fill up the disk with backup files, or cause too much load on the system, but those risks are low. Similarly, error reporting has minimal risk on destabilizing the system, except for consuming I/O and storage space.

Replication and connection poolers are in a different class of risk. Replication's goal is to allow for controlled switchover or failover in case of primary server failure, but what is its reliability and the impact if it fails? If synchronous replication is used, failure to replicate will cause the system to stop accepting writes. This can be caused by a network failure to the replicas, a replica outage, or even misconfiguration. Even failure of asynchronous replication can cause the write-ahead log directory to fill up storage, causing an outage.

Similarly, the failure of a connection pooler can cause a total outage. You can use multiple poolers, but what if the failure of one pooler prevents the other from working, or if they both work and conflict with each other. Multiple poolers can also add complexity to the system which makes debugging other problems harder. This is a great example where, to avoid the problems of safety system failure, you create two safety systems, but the two safety systems can interact in ways that make two safety systems less reliable than one safety system. Multi-master replication can have similar downsides. Even auto-failover has risks, and auto-failback, even more.

The bottom line is that safety systems can be useful, but they can also add complexity which makes systems more fragile and harder to control; consider how safety systems interact and implement them only where their value is clear.

View or Post Comments

Administration When Does a Commit Happen?

Wednesday, June 3, 2020

Most people who deal with relational databases think of transaction commits as binary operations — the query is running and not yet committed, then it is completed and committed. However, internally, there are many stages to a commit:

  1. Write commit record to the write-ahead log
  2. Flush the write-ahead log to durable storage
  3. Update the pg_xact (slide 57)
  4. Transfer to replicas
  5. Mark the commit as visible to other sessions (ProcArrayEndTransaction() updates PGPROC)
  6. Communicate commit to the client

These steps are implemented in RecordTransactionCommit().

What is interesting is that this process can be interrupted at anytime, by perhaps a server crash or network failure. For example:

  • The commit could be flushed to disk but not yet visible to other clients
  • The commit could be replicated (and visible to queries on replicas) but not visible to queries on the primary
  • Queries could appear committed to other sessions before the client issuing the query receives notification

This email thread explains the problem that commit with standbys is not always an atomic operation. Postgres has the function txid_status() which allows clients to check if a transaction, perhaps from a previous disconnected session, was committed.

View or Post Comments

Presentation Lessons from the Online Conference Trenches

Monday, June 1, 2020

Having presented at two online conferences in the past two weeks, presenting at two this coming week, and presenting at many edb-sponsored webinars, I have learned a few things about online presentations that might be helpful for Postgres organizers, speakers, and attendees:

For Organizers: With no need for speakers and attendees to travel to online conferences, there are many more potential speakers available, and many more potential attendees than for in-person conferences. However, the technical challenges of hosting an online conference are significant because producing and consuming content can require multiple platforms that must be integrated seamlessly for a positive user experience. The content production platform, e.g. Zoom, is often different than the consumption platform, e.g. YouTube, Facebook Live. If the user experience is poor, people will leave because they are not bound to a physical location like an in-person event. Adjusting to the time zones of speakers and attendees can be complex — doing a 24-hour conference like Precona Live (organizer tips) solves many of the time zone problems, but requires moderators from many time zones. The moderator's job is much more extensive for online conferences since they control access, deal with technical problems, and manage the all-important chat channel. For online conferences, chat is the best way to promote attendee engagement. If chat is done well, user engagement during presentations can be even better than in-person conferences.

For Speakers: Just like for organizers, speakers have more technical challenges than in-person conferences — it is harder to engage the audience, more things can go wrong, and attendees can more easily leave. As a speaker, I have a checklist that I always references before each presentation:

  • Mute phone, chat, email, and upgrade notifications
  • Use a laptop on AC power with wired Ethernet, for reliability
  • Have a count-up clock to keep track of the talk duration
  • Use a headset so you don't lean toward a microphone
  • Use a presentation remote so you don't lean forward to change slides
  • Turn on your video camera to increase audience engagement
  • Make sure the lighting is good and the background is uncluttered

For attendees: Many online conferences are a mix of different technologies, not always seamlessly integrated, so anticipate that it will take time to get initially connected. Consider watching the conference on a large television, or from a tablet you can carry around. Use chat to engage with the speaker and other attendees. Feel free to switch to a more interesting presentation without guilt. When you get frustrated, consider how much time you are saving by not having to travel.

View or Post Comments

Administration Visualizing Collations

Friday, May 29, 2020

There is still significant confusion about characters sets, encodings, and collations. This is because in the real, non-digital world, we usually treat languages, their characters, and ordering as unified, but in the digital world, they are treated separately, and their distinction can be hard to visualize.

These two posted queries illustrate collation in a very creative way. The first query, SELECT chr(i) FROM generate_series(1, 50000) i ORDER BY chr(i) COLLATE "C", outputs characters in their "C" binary order, with ascii as the first 128 characters, successive groups of languages following, and ending with pictographic languages.

The second query, SELECT chr(i) FROM generate_series(1, 50000) i ORDER BY chr(i) COLLATE "en_US.utf8", outputs the same 50,000 characters in "United States English" utf8 order. The output starts with pictographic languages, not ascii. The Latin alphabet appears, but not until line 19068. What is interesting is that there are 118 symbols grouped together that look like 'a', 'a' with diacritics, or have 'a' as part of their symbol. Then 'b' appears with a group of 38 symbols that look like or use 'b', and so on through the Latin alphabet. (If you highlight a character and paste it into a search box, Google will tell you about that Unicode symbol.)

I found it interesting that it groups letters that look like Latin letters, even if the they are not from Latin alphabets and don't sound like Latin letters. Cyrillic is grouped in a section after the Latin alphabet section. These sql queries are the clearest example I have seen of collation ordering.

If I had used a different collation, instead of "United States English", there would have been a different ordering. This is why index storage is sensitive to collations, i.e., indexes with different collations store the same stings in a different order. Collation affects other things like upper/lower case processing, the ordering of query output, and certain optimizations.

View or Post Comments

Administration What is an Lsn?

Wednesday, May 27, 2020

You might have seen that there is a pg_lsn data type:

test=> \dTS pg_lsn
              List of data types
   Schema   |  Name  |       Description
 pg_catalog | pg_lsn | PostgreSQL LSN datatype

Client programs pg_receivewal and pg_recvlogical have options that take lsn values, but what is an lsn? It stands for "Log Sequence Number" — it is a 64-bit value that represents a position in the write-ahead log. It is usually displayed as two 32-bit hex values, separated by a slash. For example, pg_controldata displays lsn values:

$ pg_controldata
Latest checkpoint location:           0/15AE1B8
Latest checkpoint's REDO location:    0/15AE180

So, the next time you view or need to specify a write-ahead log location, you are using an lsn.

View or Post Comments

Administration Taking Snapshots of Clusters Which Use Tablespaces

Monday, May 25, 2020

Postgres already documents the ability to backup the database cluster using file system snapshots. Unfortunately, database clusters that use tablespaces often cannot use this method if the storage system doesn't support simultaneous snapshots across file systems.

However, simultaneous snapshots across file systems might not be a hard requirement for Postgres snapshot backups. It might be possible for snapshots to be non-simultaneous as long as the write-ahead log that spans the time frame between snapshots is included in the backup, and checkpoints do not happen during that time frame.

Internally, starting Postgres from a snapshot backup replays write-ahead log records to make a single file system snapshot consistent. Potentially, it could do the same for non-simultaneous snapshots of multiple file systems. However, documenting this, giving users a reliable list of steps to perform, and making sure it always works is probably too complex to justify.

View or Post Comments

Administration Moving Tables, Indexes, and Tablespaces Between Clusters

Friday, May 22, 2020

Currently, it is impossible to move tables, indexes, and entire tablespaces from one cluster to another — that is because each table and index file is bound to the cluster's infrastructure because of:

  • Table and index definitions
  • pg_xact (commit/abort/in-progress transaction status records)
  • pg_multixact (used for multi-session row locking)

Fyi, you can easily move tablespaces to new directories as long as it remains in the same cluster, and move tables and indexes between tablespaces.

So, how could it be made possible? Freeze can remove references to pg_xact and pg_multixact, assuming there are no active transactions during the freeze operation. Table and index definitions can be more complex, but it certainly seems possible. This requires more research.

View or Post Comments

Administration Why Pgdata Should Not Be at the Top of a Mount Point

Wednesday, May 20, 2020

This email thread is illustrative of why it is unwise to place the Postgres data directory (pgdata) at the top of a mount point. Instead, create a subdirectory under the mount point and put pgdata there. This has the advantage of avoiding possible data corruption if mounting fails, and allows more efficient use of pg_upgrade.

View or Post Comments

Conference Percona Live Online

Monday, May 18, 2020

I am planning to virtually attend and present at the Percona Live Online conference tomorrow, May 19. It starts at 10am, Eastern us time, and spans 24 hours, so it covers every time zone. I am giving my Will Postgres Live Forever? presentation at noon, Eastern us time.

Attendance is free, so you might want to check it out. I saw some interesting topics on the program. I am also curious to experience a 24-hour virtual conference, though I am unlikely to remain awake that long.

View or Post Comments

Administration Using Non-Login Roles

Monday, May 18, 2020

When we talk about database roles, most people immediately think of login roles, which allow people to log in. However, another user management feature is the ability to create non-login roles, formerly called groups. Non-login roles can also be assigned permissions, e.g., via grant, and can have login roles as members. Non-login roles can be even be members of other non-login roles.

What is the value of using non-login roles? They allow a group of people to be assigned as members of a non-login role, and that role can be used to abstract permission assignment. For example, if you have shop foremen, you can configure the login roles of all foremen to be members of a non-login foreman role. As people are added and removed from that staff position, they can be added/removed from the non-login role without the need to change permissions for the foreman role.

A further advantage of non-login roles, as explained in a recent email, is that Postgres can start to suffer performance problems if more than a few dozen roles are granted permission on an object. A much simpler and more manageable solution is to add users to a non-login role and assign object permissions to that non-login role.

View or Post Comments

News Draft of Postgres 13 Release Notes

Friday, May 15, 2020

I have completed the draft version of the Postgres 13 release notes, containing 181 items. The release notes will be continually updated until the final release, which is expected to be in September or October of this year. Beta testing will start in the next few weeks.

View or Post Comments

Internals Avoiding Cache Wipe, Synchronized Scans

Wednesday, May 13, 2020

Whenever you are dealing with a lot of data, it helps to cache it. Postgres does this using shared_buffers. However, one risk of caching data is that a large query that accesses a lot of data might remove frequently-accessed data from the cache; this is called cache wipe. To avoid this, Postgres limits the number of shared buffers used by data operations that are expected to access a lot of data.

Looking at C function GetAccessStrategy(), you can see there are four shared buffer access strategies. The first one, bas_normal, is used for normal scans; the rest are to avoiding cache wiping:

  • bas_bulkread is enabled for large reads and uses 256 kilobytes of shared buffers (typically 32 8kB shared buffers)
  • bas_bulkwrite: is enabled for large writes and uses 16 megabytes of shared buffers
  • bas_vacuum is for vacuum operations and uses 256 kilobytes of shared buffers

For example, function initscan() uses bas_bulkread if the scan is expected to access more than one-quarter of shared buffers. Similarly, table rewrites, create table as, and copy from use bas_bulkwrite. Bas_bulkwrite is larger because we can't discard written buffers from the cache until they are written to storage, unlike unmodified buffers which can be discarded anytime.

You might wonder, with operations using so few shared buffers, what happens if another session needs to scan the same data? Doesn't performance suffer? Well, another Postgres facility, that was developed independently, helps with this: synchronized scans. The top of syncscan.c explains it well:

When multiple backends run a sequential scan on the same table, we try to keep them synchronized to reduce the overall I/O needed. The goal is to read each page into shared buffer cache only once, and let all backends that take part in the shared scan process the page before it falls out of the cache.

Both synchronized scans and shared buffer access strategies work automatically, so most Postgres users don't even know they exist, but they do improve Postgres performance.

View or Post Comments

Internals Why Do We Freeze?

Monday, May 11, 2020

You might have seen autovacuum running, and noticed that it sometimes performs freeze operations on transaction ids (32 bits) and multi-xacts (used for multi-session row locking). The frequency of freeze operations is controlled by autovacuum_freeze_max_age and autovacuum_multixact_freeze_max_age. You can reduce the frequency of freezing if you are sure the freeze operation will complete before transaction wraparound is reached.

There is regular discussion about how freezing could be avoided, and this email does the best job of explaining the options. We could expand transaction ids to 64 bits, either on each row or perhaps with a page-level default, but pg_xact (commit/abort/in-progress transaction status records) still need cleanup. This area probably needs more thought.

View or Post Comments

Internals Postgres Internals Website

Friday, May 8, 2020

I am often asked how someone can learn more about Postgres, particularly the internals. There is the Postgres developer page, which has links to many resources, and the developer's faq item about learning about the internals. One link on that page I was not aware of is Hironobu Suzuki's very detailed website about Postgres internals. It has a lot of details I have never seen written before, so I suggest those interested should check it out.

View or Post Comments

Internals Portability's Surprising Win

Wednesday, May 6, 2020

When writing software, it is necessary to decide whether to use external facilities available in command-line tools, libraries, frameworks, and the operating system, or write the facilities yourself. Why would you write them yourself? You might be worried about adding reliance on an external facility or a facility might not have sufficient flexibility or performance.

The Postgres development team has had to make similar decisions. Fortunately, we have tended to favor reliance on common operating system interfaces, tools, and libraries, e.g., OpenSSL, bison. We have avoided reliance on external facilities that are uncommon or not well maintained.

Postgres has reproduced facilities that were commonly available in the operating system, tools, or libraries only when there was a clear benefit. Reproducing such facilities for a small benefit, like a little more control or a little more performance, is rarely wise. While relying on external facilities often makes Postgres less flexible and perhaps less performant, there are long-term benefits:

  • As external facilities improve their feature-set and performance, Postgres benefits from these improvements with little effort
  • As new demands are required of these external facilities, Postgres again benefits effortlessly

Let's be specific. In the early days of Postgres, file systems were not optimized for database storage. Fortunately, Postgres never implemented file systems on raw devices. Now that modern file systems, like ext4, give good database performance, Postgres benefits from file system improvements with almost no effort. Even when ssds started being used, the only change needed in Postgres was the ability to set random_page_cost at the tablespace level to handle databases where some tablespaces are on ssds and some are on magnetic storage.

When virtualization, cloud, containers, and container orchestration (e.g., Kubernetes) became popular, Postgres had to do almost nothing to run well on these platforms. (It is true that some of the Postgres enterprise deployment tools required repackaging and re-engineering.)

Because of this philosophy, Postgres has remained relatively light-weight compared to other relational database systems, and this has benefited Postgres in environments where nimble deployments are favored.

View or Post Comments

Performance With ... Materialized and Optimizer Control

Monday, May 4, 2020

Before Postgres 12, queries specified as common table expressions (with clauses) always behaved as optimization barriers, meaning that common table expression queries were executed independently, and were not moved to later parts of the query.

Starting in Postgres 12, if a common table expression is referenced only once, and the keyword materialized is not used, it can be moved to a place later in the query where it can be better optimized; this improves optimization possibilities. However, if the movement of common table expression queries increases the from clause table count above the geqo_threshold, Postgres will decide it can't efficiently optimize such a high table count query and will use the genetic query optimizer.

So, while the new Postgres 12 behavior of in-lining common table expressions usually increases the quality of optimized plans, in some cases it can decrease them by enabling the genetic query optimizer. In a way, in pre-Postgres 12 or with the use of materialized, the query author is doing the optimization by creating common table expressions, while in other cases, the optimizer has greater control, though even the optimizer can determine the query is too complex and fall back to less-precise genetic query optimization.

View or Post Comments

Performance Background Writes

Friday, May 1, 2020

Postgres must guarantee durability and good performance. To meet these objectives, Postgres does writes to the file system and storage in the background as much as possible. In fact, there are only two major cases where writes happen in the foreground:

  1. Write-ahead log writes happen before commits are acknowledged to the client
  2. A needed shared_buffer is dirty and must be written to storage so it can be replaced

Write-ahead log writes (#1) can be controlled using various settings. Dirty shared buffer writes (#2) that happen in the foreground are minimized if the background writer is operating efficiently. You can monitor such writes by viewing the probe buffer-write-dirty-start and buffer-write-dirty-done.

View or Post Comments

Performance Optimal Use of Ssds

Wednesday, April 29, 2020

Ssds have different performance characteristics than magnetic disks, and using them optimally isn't always clear. Ssds have several performance benefits:

  1. Very fast fsyncs
  2. Much faster random reads and writes
  3. Faster sequential reads and writes

So, if all your data is stored on ssds, you will certainly improve performance. If you are mixing ssds and magnetic disks, ideally you should use ssds in ways that give the greatest benefit. Starting with number one, putting the write-ahead log on ssds is a great way to improve fsync performance.

For number two, moving indexes to tablespaces using ssd storage can greatly improve performance because index access is usually random. When using ssds, the default value for random_page_cost should be lowered, perhaps to 1.1. This can be set at the tablespace level if there is a mix of tablespaces on ssds and magnetic disks.

For number three, it is also possible to create tablespaces on ssds for current data, and place archive data on tablespaces that use magnetic disks. By using table partitioning, a partitioned table can transparently span ssds and magnetic disk tablespaces.

View or Post Comments

Performance Does Postgres Support Compression?

Monday, April 27, 2020

I am often asked if Postgres supports compression, and my answer is always a complicated dance around what "compression" level they are asking about. There are six possible levels of database compression:

  1. single field
  2. across rows in a single page
  3. across rows in a single column
  4. across all columns and rows in a table
  5. across tables in a database
  6. across databases

Number one (single field) is currently done by toast. Number two (across rows in a single page) is a practical optimization where a compression routine blindly looks for repeating values in a page without understanding its structure. The difficulty of implementing this happens when a page is stored using its compressed length (rather than the uncompressed 8k), the page contents change, and the new contents compress less well than the previous contents. In this case, the compressed page contents would be larger and it would be very complex to fit the page into the existing space in the file. A different file layout is really required for this, so pages can be placed anywhere in the file, without affecting index access. A team is working on adding this feature using Postgres's table access method interface.

Number three (across rows in a single column) is the classic definition of a columnar database. A team is also working on that. Just like number two, this requires using a different storage layout than Postgres's default, and the table access method interface makes this possible.

Number four can be done using file system compression. Numbers five and six would be nice, but it unclear how this could be done efficiently without adding unacceptable complexity to the database.

View or Post Comments

Performance Multi-Host Technologies

Friday, April 24, 2020

There are so many multi-host technologies and it is hard to remember the benefits of each one, so I decided to create a list:

High availability: Streaming replication is the simplest way to have multiple copies of your database, ready for fail over

Read scaling: Pgpool allows replicas (slide 17) to handle a database's read-only workload

Write scaling: Sharding allows for write scaling

Partial replication: Logical replication allows partial replication

Reduce latency: Multi-master replication allows servers to be located close to users, reducing transmission latency

Hopefully this is helpful to people.

View or Post Comments

Performance Performance Goalposts

Wednesday, April 22, 2020

In talking to EnterpriseDB customers, I am often asked about the performance limits of Postgres: How many connections can it handle? How many tps?

Well, those are good questions, but it is hard to give accurate answers since so much depends on the hardware and workload. Eventually, testing of actual workloads on intended hardware has to be done, but not giving ball-park answers is unhelpful. What answer can I give? Well, I came up with this chart:

1Connections< 250direct connect
2Queries< 250pgbouncer
3Write queries< 250Pgpool with read-only replicas (slide 17)
4Write queries>= 250sharding

Earlier items use simpler architectures than later items, and are therefore preferred.

For under 250 simultaneous connections(#1), it isn't generally necessary to use any complex architecture, though using a pooler is recommended for workloads with many short-lived sessions to reduce connection startup time. Even when under this limit, performance can be limited by the number of cpus. Oltp databases typically cannot process more than 5 x cpu cores without having to time-slice among cpus. Olap uses 2 x cpu cores, or less if parallelism is used.

For over 250 simultaneous connections but under 250 simultaneous queries(#2), a connection pooler can be beneficial because it reduces the overhead of managing many open connections. For over 250 simultaneous queries but under 250 simultaneous write queries(#3), a combination of Pgpool with read queries routed to streaming replicas can be very efficient. For over 250 simultaneous write queries(#4), a sharding solution should be considered.

These numbers can vary greatly based on workload and hardware, but I think they are good starting points to consider.

View or Post Comments

Performance Fast Enough?

Monday, April 20, 2020

Everyone one wants their software to perform as fast as possible. Some people think that unless the program ends before it begins, it's too slow.

However, realistically, making something as fast as possible is not a universal good, meaning that increasing performance beyond a certain point can cause problems that far exceed the value of the improved performance. Let's give some examples:

  • Most people use high-level languages that are compiled into cpu instructions or interpreted while being run. In an ideal world, every program would be written in assembly language. (Postgres does use some assembly language code for locking.) However, writing something like a database in assembly language, though it might give slightly better performance if developers knew the behavior of every cpu, would be a huge challenge for even the most expert developers, and code maintenance and feature additions might be impossible. In general, there just isn't enough benefit to using assembly language for anything but the most discrete, performance-critical functions.
  • Java isn't often chosen for its performance, but rather its ability to allow development teams to produce complex software efficiently. Lighter-weight languages might give better performance, but they don't offer the same development efficiency as Java.
  • Sql is a heavy-weight way to request data, but it is very efficient for developers since they can express their requests in a declarative way. You could probably write a custom data storage program to run faster than sql, but the effort involved to create and maintain it would be nearly insurmountable.
  • You can even layout data in sql in more efficient ways, and sometimes it is worth it, but data maintainability, access flexibility, and storage efficiency can suffer.

The bottom line is that performance is rarely a universal good — it has to be balanced against development time, solution flexibility, and maintainability. Sometimes people suggest that Postgres should offer some "super fancy" optimization, and sometimes we can implement it, but we always have to balance development time, solution flexibility, and maintainability with improved performance.

View or Post Comments

Community No Travel

Friday, April 17, 2020

With the Coronavirus outbreak, almost all Postgres events through June have been either cancelled, rescheduled, or moved online. This has given me time to consider my past travel. I have been blessed to visit so many places, not as a tourist, but rather a guest. I see countries more as a native than as a tourist, and I have many event organizers and hosts to thank for this.

It is hard to know when on-site events will resume, but I can remember what it was like to travel roughly 90 days a year. Going to social outings at home often felt riding a train through my home town, waving from behind a train window to my friends on the platform. I would sit at home and wonder how long I would be there until I had to leave again. Now, sitting at home, the images of our family travels appearing on our kitchen slideshow show places that seem farther away than ever, and I wonder if I will ever see these places again.

I am sure many others have similar feelings, and I have hope that, someday, we will all return to the road to spend time together again.

View or Post Comments

Thoughts Database Interoperability at Risk

Monday, March 16, 2020

This article parallels Oracle's copying of the sql syntax from ibm in the late 1970's with Google's copying of the Java api. It also explains the possible impact of the current case soon to be decided by the US Supreme Court.

One thing it does not fully cover is the impact on sql-level interoperability between databases. If Oracle can claim the Java api as copyrightable, the sql language could be considered copyrightable, allowing ibm to sue all relational database vendors and users for copyright infringement. It might also allow database vendors to sue competitors when their sql-level features are copied, requiring either huge payments or the removal of interoperability syntax.

The Postgres copyright is open, so any database vendor copying Postgres syntax is fine, but Postgres copying the syntax of other databases could be problematic. Relational database interoperability has been of huge benefit to data management, and this single case could call that into question.

View or Post Comments

Presentation Databases, Containers, and the Cloud

Saturday, March 7, 2020

A few months ago, I wrote a high-level presentation about the deployment benefits of using containers and cloud infrastructure for databases. I am now ready to share this presentation. I have also added QR codes to the first and last slides of all my presentations, linking to my website.

View or Post Comments