Bruce Momjian

Postgres Blog


This blog is about my work on the Postgres open source database, and is published on Planet PostgreSQL. PgLife allows monitoring of all Postgres community activity.

Online status:
Unread Postgres emails:
Email graphs: incoming, outgoing, unread, commits (details), blog posts, events



Sql Referencing Select Column Aliases in Where Clauses

Friday, March 24, 2017

Ever wonder why using select column aliases in where clauses fails?

SELECT random() AS confusion
FROM generate_series(1,10)
WHERE confusion > 0.5;
 
ERROR:  column "confusion" does not exist
LINE 3: WHERE confusion > 0.5;

It is because the order in which select clauses are evaluated is specified by the sql standard, and it isn't top to bottom. Tom Lane's excellent email post goes into the details. The thread also mentions the unfortunate effect that users of union often want the behavior of union all.

 


Sql Characters Sets, Encodings, and Collations

Wednesday, March 22, 2017

Understanding how characters sets, encodings, and collations work together can be confusing. I would like to explain them in this blog post:

Characters Sets

Postgres databases can be initialized to support a variety of character sets. A character set is a full set of the characters or glyphs that can be represented. Popular characters sets are ascii (127 characters), Latin1 (ISO8859-1, 255 characters), and Unicode (128k+ characters).

Encodings

Encoding is the way a character set is stored. For single-byte character sets like ascii and Latin1, there is only a single way to encode the character set, i.e. as a single byte. For more complex multi-byte character sets there can be several ways to encode the character set. For example, Unicode can be encoded as UTF-8 (8-bit granularity), UTF-16 (16-bit), or UTF-32 (32-bit). (Postgres only supports UTF-8 for Unicode, server-side.) Asian languages also often support multiple encodings for a single character set, e.g. Big5, GB2312, Shift-JIS, EUC-JP.

Collations

Collations specify the order of characters in a character set. Again, for single-byte character sets, there is usually only one possible collation, which uses encoded byte values to provide the ordering, though this can lead to odd orderings, e.g. Z (uppercase) ordered before a (lowercase), or z ordered before á. For a complex character set like Unicode, the user can often select the desired collation.

It is possible to illustrate different collations of the same character set using Unicode:

SHOW lc_collate;
 lc_collate
-------------
 en_US.UTF-8
 
SELECT * FROM (VALUES ('a'), ('_b'), ('_C')) AS f(x) ORDER BY x;
 x
----
 a
 _b
 _C
 
-- the same, with the collation explicitly stated
SELECT * FROM (VALUES ('a'), ('_b'), ('_C')) AS f(x) ORDER BY x COLLATE "en_US.utf8";
 x
----
 a
 _b
 _C
 
-- with a collation based on byte values, matching ASCII ordering
SELECT * FROM (VALUES ('a'), ('_b'), ('_C')) AS f(x) ORDER BY x COLLATE "C";
 x
----
 _C
 _b
 a

You can see in the first two queries that collation en_US ignores case in letter comparisons, and ignores underscores. The last query, using C collation, reverses the order because the byte values for uppercase letters (C) are less than lowercase letters (b), and the underscore character is lower than lowercase letters (a).

This can be seen at the operating system level too:

$ echo $LANG
en_US.UTF-8
 
$ (echo 'a'; echo '_b'; echo '_C') | sort
a
_b
_C
 
$ (echo 'a'; echo '_b'; echo '_C') | LANG="en_US.UTF-8" sort
a
_b
_C
 
$ (echo 'a'; echo '_b'; echo '_C') | LANG="C" sort
_C
_b
a

The handling of whitespace can also be controlled by sql-level collations:

SELECT * FROM (VALUES (' '), ('a'), (' b')) AS f(x) ORDER BY x COLLATE "en_US.utf8";
 x
----
 
 a
  b
 
SELECT * FROM (VALUES (' '), ('a'), (' b')) AS f(x) ORDER BY x COLLATE "C";
 x
----
 
  b
 a

In the first query, while a single space is ordered before a, it does not cause space-b to order before a; it does in the C collation. Things get more complicated with letters with accents, tildes, and pictographic languages.

Postgres relies on the operating system locale to support characters sets, encodings, and collations. On Linux, you can see the list of supported locales via locale -a. While initdb sets the default locale for the cluster based on evironment variables, the locale can be overridden in a variety of ways, including by CREATE TABLE.

Post a Comment

Sql Outer Joins and Where Clauses

Monday, March 20, 2017

Postgres supports both traditional join syntax, which uses the where clause to specify joined columns, and ansi join syntax, that uses the word join in the from clause. While both syntaxes can be used for inner joins, only the ansi join syntax supports outer joins in Postgres.

Because column restrictions like col = 4 are not related to joins, you would think that restrictions have the same effect whether they appear in the where clause or as part of a join clause, e.g. a outer join b on a.x = b.x and col = 4. However, this is not always true. Restrictions in the join clause are processed during joins, while where clause restrictions are processed after joins.

This is only significant in outer joins (and cross joins) because columns from unjoined rows are manufactured by outer joins. Here is an example:

CREATE TABLE test1 (x INTEGER);
CREATE TABLE test2 (x INTEGER);
 
INSERT INTO test1 VALUES (1), (2), (3);
INSERT INTO test2 VALUES (1), (2);
 
SELECT * FROM test1 LEFT JOIN test2 ON test1.x = test2.x;
 x |   x
---+--------
 1 |      1
 2 |      2
 3 |  (null)
 
SELECT * FROM test1 LEFT JOIN test2 ON test1.x = test2.x AND test2.x <= 2;
 x |   x
---+--------
 1 |      1
 2 |      2
 3 |  (null)
 
SELECT * FROM test1 LEFT JOIN test2 ON test1.x = test2.x WHERE test2.x <= 2;
 x | x
---+---
 1 | 1
 2 | 2

As you can see, the first select performs the outer join with no column restrictions. The second select returns the same result because the column restriction matches all rows in test2 (before the outer join happens). The final select operates on the result of the join and since null <= 2 returns null, which behaves like false, the third output row is suppressed.

What is even more interesting is seeing how a column restriction can cause a query that would normally be fully joined to return unjoined column values:

INSERT INTO test2 VALUES (3);
 
SELECT * FROM test1 LEFT JOIN test2 ON test1.x = test2.x;
 x | x
---+---
 1 | 1
 2 | 2
 3 | 3
 
SELECT * FROM test1 LEFT JOIN test2 ON test1.x = test2.x AND test2.x <= 2;
 x |   x
---+--------
 1 |      1
 2 |      2
 3 | (null)
 
SELECT * FROM test1 LEFT JOIN test2 ON test1.x = test2.x WHERE test2.x <= 2;
 x | x
---+---
 1 | 1
 2 | 2

In the second select, the test2.x value of 3 is excluded from the join, resulting in an outer-join value of null for the column.

In summary, be careful of the placement of restriction clauses on inner-side tables in outer-join queries and clearly determine whether you want restrictions to happen at join time or post-join.

Post a Comment

Community 'You need to set a tone where everyone in the community feels valued'

Friday, March 17, 2017

As open source gains popularity in the enterprise, there is increased study of open source communities and how they function. Those studying such things often ask about Postgres because of its unusually healthy community and recent successes.

While I was in India in February, I was interviewed by an open source magazine; an excerpt from that interview is now online. It covers open source leadership, encouraging new developers, and healthy software ecosystems.

Post a Comment

Internals Column Storage Internals

Wednesday, March 15, 2017

Postgres uses native cpu alignment to store values in a row. This allows blocks to be copied unchanged from disk into shared buffers and accessed as local variables, as outlined in this presentation.

This can be illustrated by pg_column_size(). First, an empty row size:

SELECT pg_column_size(row());
 pg_column_size
----------------
             24

Predictably, 2-byte, 4-byte, and 8-byte integers increase the length as expected:

SELECT pg_column_size(row(1::int2));
 pg_column_size
----------------
             26
 
SELECT pg_column_size(row(1::int4));
 pg_column_size
----------------
             28
 
SELECT pg_column_size(row(1::int8));
 pg_column_size
----------------
             32
 

You can also see alignment effects — the first query has no alignment padding, while the second query has a two-byte padding because the row length is unchanged:

SELECT pg_column_size(row(1::int2, 1::int2, 1::int4));
 pg_column_size
----------------
             32
 
SELECT pg_column_size(row(1::int2, 1::int4));
 pg_column_size
----------------
             32

You can see the alignment requirements for each data type by querying the system catalogs:

SELECT typalign, typname FROM pg_type ORDER BY 1, 2;
 typalign | typname
----------+---------
 c        | bool
 c        | char
 c        | cstring
 c        | name
 c        | unknown
 c        | uuid
 d        | _box
 d        | _circle
 d        | _float8
…

The Postgres source code file src/include/catalog/pg_type.h documents the meaning of the letters:

'c' = CHAR alignment, ie no alignment needed.
's' = SHORT alignment (2 bytes on most machines).
'i' = INT alignment (4 bytes on most machines).
'd' = DOUBLE alignment (8 bytes on many machines, but by no means all).

It is possible to define table columns in an order that minimizes padding. Someday Postgres might do this automatically.

The 24-byte row header includes an 8-bit mask to record null values. You can see below that the 8-bit mask is sufficient for eight nulls, but the ninth null requires the null bit mask to be expanded, with additional alignment:

SELECT pg_column_size(row(NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL));
 pg_column_size
----------------
             24
 
SELECT pg_column_size(row(NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL));
 pg_column_size
----------------
             32

These things all happen automatically, but sometimes it is in interesting to see it working.

Post a Comment

Administration Getting the Most Out of Application_name

Monday, March 13, 2017

Application_name might be one of those Postgres settings that you have seen in various places but never understood its purpose. It is true that setting application_name doesn't change the behavior of Postgres (with one small exception), but it is very useful for monitoring. Its value appears in pg_stat_activity and can be prepended to every log line with log_line_prefix.

Applications can set application_name when they connect as part of the connection string. Psql, pg_dump, and other tools set application_name. (They actually set fallback_application_name so application_name can be overridden by users.)

A more interesting use of application_name is to change it while the application is running. pgAdmin updates the application_name when a user changes screens so administrators can know exactly what screen users are on. While you can't change application_name while a query is running, you can change it between queries, which allows you to monitor the progress of long-running batch jobs, e.g.:

SELECT application_name
FROM pg_stat_activity
WHERE pid = pg_backend_pid();
 application_name
------------------
 psql
 
SET application_name = 'demo';
 
SELECT application_name
FROM pg_stat_activity
WHERE pid = pg_backend_pid();
 application_name
------------------
 demo

Some people might complain about the overhead of a separate query to update application_name. There are two solutions. One solution is to send two queries in a single string, e.g. select application_name = 'demo2'; select 100. Some libraries like libpq support this, but psql sends these as two separate queries — you have to enable log_statement to see how queries are sent to the server.

Another approach is to bundle the setting of application_name inside the query:

SELECT 100 FROM set_config('application_name', 'demo3', false);
 
SELECT application_name
FROM pg_stat_activity
WHERE pid = pg_backend_pid();
 application_name
------------------
 demo3

Unfortunately, there is no way to guarantee that set_config() will be run first, e.g. in the first query, set_config() is run first, and in the second query, run second, and the optimizer is allowed to run from-clause expressions in any order:

SELECT 100 FROM set_config('application_name', 'demo4', false), pg_sleep(5);
 
SELECT 100 FROM pg_sleep(5), set_config('application_name', 'demo5', false);

As you can see, application_name gives you a way to monitor what Postgres is doing, not only at connection time but during the life of the connection.

Post a Comment

Administration Dynamic Logging Control

Friday, March 10, 2017

Postgres has many logging options, e.g. log_statement, log_min_duration_statement, and log_line_prefix. These values can be set at various levels, which I have already covered. One level that has particular flexibility is postgresql.conf. Settings in this file affect all running sessions, unless the values are overridden at lower levels.

When debugging, it is often hard to know which log settings to enable. One approach is to enable all settings beforehand, but that can generate lots of log traffic and be hard to interpret. Ideally you can enable just the settings you need at the time you need them — this is where postgresql.conf comes in. With postgresql.conf or alter system, you can change any logging settings you want and then signal a reload by either sending a sighup signal to the server, running "pg_ctl reload", or executing "SELECT pg_reload_conf()". One nice thing is that after a reload, all running sessions receive the new settings when their transactions complete. This allows logging settings to be enabled and disabled as needed.

Post a Comment

Administration Postgres Alerts

Wednesday, March 8, 2017

It is cool to be able to stare at colorful graphs to see what Postgres is doing, but sometimes you just want to setup something, walk away, and be informed when there is problem. That is what check_postgres and tail_n_mail are designed to do.

check_postgres is a script designed to be run from cron or a monitoring tool like Nagios. It reports on areas in the database that need administrator attention. tail_n_mail has a similar purpose, but monitors the Postgres log files for important messages.

Few people stare at their mobile phones waiting for something to happen — they configure their phones to notify them when something important happens. Why not do that for your Postgres clusters? That's what check_postgres and tail_n_mail are meant to do.

View or Post Comments

Performance Wait Event Reporting

Tuesday, February 28, 2017

Postgres often lacks the sophisticated reporting of more established enterprise relational database systems. Sometimes that is because Postgres doesn't require as much monitoring and tuning, but there are legitimate monitoring needs which Postgres doesn't support.

Thanks to the wait_event_type and wait_event columns added to the pg_stat_activity view in Postgres 9.6, it is possible to find which parts of the system are causing query delays. In Postgres 9.6 wait_event_type can have four values:

  • LWLockNamed
  • LWLockTranche
  • Lock
  • BufferPin

and this table lists all the possible wait_event column values, grouped by wait_event_type. Below is typical pg_stat_activity output during pgbench:

SELECT pid, wait_event_type, wait_event
FROM pg_stat_activity;
  pid  | wait_event_type |  wait_event   
-------+-----------------+---------------
 14154 | LWLockNamed     | WALWriteLock  
  8923 |                 |               
 14155 | LWLockNamed     | WALWriteLock  
 14158 | LWLockNamed     | WALWriteLock  
 14160 | LWLockNamed     | WALWriteLock  
 14161 | Lock            | transactionid 
 14163 | LWLockNamed     | WALWriteLock  
 14165 | LWLockNamed     | WALWriteLock  
 14166 | LWLockNamed     | WALWriteLock  
 14167 | Lock            | tuple         
 …

Postgres 10 will have even more monitoring capabilities, with at least four new wait event types. One nice thing about this feature is that it has almost no overhead so is enabled by default. Unfortunately, there is no efficient way to measure event duration except by periodic sampling of event states.

This excellent email post from Robert Haas statistically analyzes wait events to show how different queries have different wait behaviors, e.g. heavy wal writes, unlogged tables, all data in shared buffers. The results match what you would expect from these kinds of workloads, but it is much more detailed than you would get from guessing.

These new pg_stat_activity columns give us a new window into performance. Before this, we had to guess what the bottleneck was, and we were very good at it. This new instrumentation gives us very accurate statistics on where we are losing performance. I expect this feature to yield major performance improvements in the years to come.

View or Post Comments

Indexing Creating Descending Indexes

Monday, February 27, 2017

You might have noticed that create index allows you to specify asc or desc for each column being indexed, and you might have wondered why. Asc is the default, so there is no need to use it. Desc actually does have a use, but it isn't obvious.

Postgres can access indexes forward or backward, so there is no need to specify ascending/descending if all columns are ascending or descending. The use-case for specifying the index order is for multi-column indexes with mixed ordering, assuming queries also use the same mixed ordering. For example, let's create a two-column table with 1000 rows, and an index in all-ascending order:

CREATE TABLE test(x INTEGER, y INTEGER);
INSERT INTO test
        SELECT x, x FROM generate_series(1, 1000) AS f(x);
 
CREATE INDEX i_test ON test (x, y);

Postgres has no problem using the index if the query is all ascending or all descending, but it can't be used with mixed-column ordered queries:

EXPLAIN (COSTS off) SELECT * FROM test ORDER BY x, y;
              QUERY PLAN
--------------------------------------
 Index Only Scan using i_test on test
 
EXPLAIN (COSTS off) SELECT * FROM test ORDER BY x DESC, y DESC;
                  QUERY PLAN
-----------------------------------------------
 Index Only Scan Backward using i_test on test
 
EXPLAIN (COSTS off) SELECT * FROM test ORDER BY x, y DESC;
       QUERY PLAN
------------------------
 Sort
   Sort Key: x, y DESC
   ->  Seq Scan on test

Creating an index that matches the mixed ordering of the query allows the index to be used:

CREATE INDEX i_test2 ON test (x, y DESC);
 
EXPLAIN (COSTS off) SELECT * FROM test ORDER BY x, y DESC;
              QUERY PLAN
---------------------------------------
 Index Only Scan using i_test2 on test

The mixed-ordered index can even be used for backward scans:

EXPLAIN (COSTS off) SELECT * FROM test ORDER BY x DESC, y;
                   QUERY PLAN
------------------------------------------------
 Index Only Scan Backward using i_test2 on test

I am not sure how many people use mixed-ordered queries that would benefit from indexes, but I wanted to show how this feature can be useful.

View or Post Comments

Indexing Index Order Does Matter

Friday, February 24, 2017

Postgres has supported multi-column indexes since 1997, e.g. create index i_test on test (a, b, c). It can easily use an index if the supplied columns are all at the front of the index, e.g. a and b in the previous index, but it can also use the index if some of the indexed column values are not supplied, e.g. columns a and c in the previous index. It does this by looking up a in the index, then looking through the index for matches of c, ignoring values of b, e.g.

CREATE TABLE test (a INT, b INT, c INT);
INSERT INTO test
        SELECT x, x, x FROM generate_series(1, 100000) AS f(x);
CREATE INDEX i_test ON test(a, b, c);
ANALYZE test;
 
EXPLAIN (COSTS false)
SELECT * FROM test WHERE a = 1 AND c = 1;
              QUERY PLAN
--------------------------------------
 Index Only Scan using i_test on test
   Index Cond: ((a = 1) AND (c = 1))

Of course, if possible, you should always put the most commonly-supplied columns first in the index because skipping columns during index scans (called "index skip scans") is expensive.

However, if you are supplying all the column values referenced in the index, I assumed it didn't matter what order the columns were specified in the index, but Robert Haas recently mentioned this is not always true. For example, if a has many duplicate values, and b has mostly unique values, having a at the start of the index is suboptimal — better to use b first, which will more effectively narrow the search space when looking for matches on a.

A more specific example of this is range queries. In that case, if the range test of one column is less restrictive than the equality test of another, it would be better for the equality test column to be first in an index. For example, in the queries below, the first query uses an index because it is very restrictive on the first column, the second query uses an index because it is moderately restrictive on the first column, while the third does not use an index because it is effectively unrestrictive on the first column:

EXPLAIN SELECT * FROM test WHERE a = 1 AND b >= 1 AND b <= 100000;
                               QUERY PLAN
-------------------------------------------------------------------------
 Index Only Scan using i_test on test  (cost=0.42..8.44 rows=1 width=12)
   Index Cond: ((a = 1) AND (b >= 1) AND (b <= 100000))
 
EXPLAIN SELECT * FROM test WHERE a >= 1 AND a <= 50000 AND b = 1;
                                 QUERY PLAN
----------------------------------------------------------------------------
 Index Only Scan using i_test on test  (cost=0.42..1404.10 rows=1 width=12)
   Index Cond: ((a >= 1) AND (a <= 50000) AND (b = 1))
 
EXPLAIN SELECT * FROM test WHERE a >= 1 AND a <= 100000 AND b = 1;
                       QUERY PLAN
--------------------------------------------------------
 Seq Scan on test  (cost=0.00..2291.00 rows=1 width=12)
   Filter: ((a >= 1) AND (a <= 100000) AND (b = 1))

Notice the increasing costs, even though all queries match one indexed row.

Obviously, in cases where you are not specifying all indexed columns in every query, you should put the most frequently referenced columns first in the index to avoid the overhead of index skip scans. However, for cases where most indexed columns are going to be supplied in queries, placing the most restrictive columns first in indexes is a win.

View or Post Comments

Indexing Expression Index Statistics and Joins

Wednesday, February 22, 2017

In my previous blog post, I showed how statistics generated on expression indexes can be used to produce more accurate row counts, and potentially better plans. While I did show more accurate row counts via explain, I did not show changed query plans. I plan to do so in this blog post. First, the setup:

CREATE TABLE test1 AS
        SELECT * FROM generate_series(1, 100000) AS f(x);
CREATE TABLE test2 AS
        SELECT * FROM generate_series(1, 2) AS f(x);
 
ANALYZE test1;
ANALYZE test2;

then a join query using modulus with no expression index:

EXPLAIN SELECT test1.x
FROM test1, test2
WHERE test1.x % 2 = 1 AND test1.x = test2.x;
                           QUERY PLAN
-----------------------------------------------------------------
 Nested Loop  (cost=0.00..1959.02 rows=1 width=4)
   Join Filter: (test1.x = test2.x)
   ->  Seq Scan on test1  (cost=0.00..1943.00 rows=500 width=4)
         Filter: ((x % 2) = 1)
   ->  Materialize  (cost=0.00..1.03 rows=2 width=4)
         ->  Seq Scan on test2  (cost=0.00..1.02 rows=2 width=4)

A nested loop join is used, which is suboptimal because the row count for test1 is one hundred times too small. With proper statistics on the modulus operation on test1.x, a more efficient hash join is used:

CREATE INDEX i_test1 ON test1((x % 2));
ANALYZE test1;
ANALYZE test2;
 
EXPLAIN SELECT test1.x
FROM test1, test2
WHERE test1.x % 2 = 1 AND test1.x = test2.x;
                            QUERY PLAN
------------------------------------------------------------------
 Hash Join  (cost=1.04..2132.29 rows=1 width=4)
   Hash Cond: (test1.x = test2.x)
   ->  Seq Scan on test1  (cost=0.00..1943.00 rows=50197 width=4)
         Filter: ((x % 2) = 1)
   ->  Hash  (cost=1.02..1.02 rows=2 width=4)
         ->  Seq Scan on test2  (cost=0.00..1.02 rows=2 width=4)

Notice the test1 row count is now much more accurate, and that analyzing the base table also analyzes the expression index. The total cost is now slightly higher (2132.29 vs. 1959.02), but that is not because the hash join is more expensive. Rather, it is because the nested loop misestimated how many rows it would need to process because it didn't know the selectivity of the modulus operation.

One thing I learned in researching this blog post is how much the optimizer "loves" hash joins. If test2 has three or more rows, or if test1 has ten times more rows and parallelism is enabled, a hash join is used even without expression index statistics. Hash joins are very robust despite misestimation so they are favored by the optimizer. The takeaway is that the creation of expression indexes for statistical purposes is recommended only if testing shows they actually improve query plans, i.e. improving explain row counts alone has little benefit.

View or Post Comments

Indexing Statistics on Expression Indexes

Monday, February 20, 2017

Most people know that Postgres allows the creation of indexes on expressions. This is helpful if you need index lookups of expressions used in where clauses.

However, there is another benefit to expression indexes, and that is optimizer statistics. Not only do expression indexes allow rapid lookups of matching expressions, but they also provide optimizer statistics, which improve row estimates and hence query plans. Here is an example:

CREATE TABLE test (x INTEGER);
INSERT INTO test SELECT x FROM generate_series(1, 100) AS t(x);
 
ANALYZE test;
 
SELECT COUNT(*) FROM test WHERE x % 2 = 1;
 count
-------
    50
 
EXPLAIN SELECT * FROM test WHERE x % 2 = 1;
                     QUERY PLAN
----------------------------------------------------
 Seq Scan on test  (cost=0.00..2.50 rows=1 width=4)
   Filter: ((x % 2) = 1)

The optimizer doesn't know the selectivity of the modulus operator, so it initially assumes only one row is returned. Once an expression index is created and analyze statistics generated, the optimizer knows exactly how many rows will be returned:

CREATE INDEX i_test ON test ((x % 2));
 
ANALYZE test;
 
EXPLAIN SELECT * FROM test WHERE x % 2 = 1;
                     QUERY PLAN
-----------------------------------------------------
 Seq Scan on test  (cost=0.00..2.50 rows=50 width=4)
   Filter: ((x % 2) = 1)

Interestingly, the optimizer used expression index statistics, even though the expression index itself was not used. In the example above, the modulus operator is not selective enough to make the index useful, but expression statistics would be useful for more complex queries, e.g. with joins. This method can also be used to create statistics on functions.

It is also possible to create an expression index that generates cross-columns statistics, the benefits of which I mentioned in an earlier blog post. For example, this expression index would supply accurate statistics for state/city combinations, but queries would need to use the exact concatenation construction:

CREATE INDEX i_customer_state_city ON customer ((state || '|' || city));

It would be nice if there a way to create expression statistics without the overhead of creating and maintaining indexes.

View or Post Comments

Performance Still No Query Hints?

Friday, February 17, 2017

So, it is 2017 and Postgres still doesn't support query hints like other relational databases? Yep, it's true, and probably will be forever since "'Oracle-style' optimizer hints" is listed in the "Features We Do Not Want" section of the Postgres todo list. A wiki page outlines the reasons for this.

While this seems perfectly logical to people who have used Postgres for years, it strikes new users as rigid and extreme. There are several reasons for this divergence.

First, what new users don't realize is that there are ways to control the optimizer, just not with inline query hints. The most useful ways are by changing planner constants and other planner options, which allow you to tailor the optimizer to your specific hardware and query types. This is a more holistic approach to optimizer tuning because, if set properly, they improve all queries, compared to "use this index"-style hints which must be added to every query.

There are also several crude enable settings that can be changed at the session level to prevent the optimizer from considering certain executor capabilities, like join and scan types. Unfortunately, these effect the entire query, i.e. you can't turn off sequential scan for just one table. It also doesn't allow the specification of which index to use. These are really designed for debugging.

In the same vein, an area that often causes suboptimal plans are queries that reference several columns from the same table where column values are correlated, e.g. a US state column with the value 'NY' is more likely to have a city value of 'New York City' than a state value of 'AZ' would. While inline query hints might be able to fix this case for constants, it can't help when variables are passed into the query, like with parameterized queries. A patch being considered for Postgres 10, multivariate statistics, would benefit all queries by collecting statistics on column combinations in addition to the single column statistics already collected. This, again, is an example of a holistic approach to improving the optimizer.

There are two additional ways to control single queries that are often used in production. Common table expressions allow a query to be broken up into stages which are executed sequentially because each with clause acts as an optimization fence. Secondly, the use of offset 0 prevents subqueries from being moved into the outer query.

I expect future improvements to address other optimizer shortcomings. One area where inline query hints would really help is fixing queries in an emergency, when you don't have time to research the cause. Postgres currently doesn't have an easy solution for that.

View or Post Comments

Performance Going Deep on Stats

Wednesday, February 15, 2017

Postgres includes lots of monitoring tools that allow you to see what is happening, but there are a few settings that really go into detail, e.g. log_statement_stats:

SET client_min_messages = log;
SET log_statement_stats = true;
 
SELECT 100;
LOG:  QUERY STATISTICS
DETAIL:  ! system usage stats:
1 !       0.000000 s user, 0.000000 s system, 0.000201 s elapsed
2 !       [0.000000 s user, 0.000000 s system total]
3 !       0/0 [0/0] filesystem blocks in/out
4 !       0/14 [0/363] page faults/reclaims, 0 [0] swaps
5 !       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
6 !       0/0 [3/0] voluntary/involuntary context switches
 ?column?
----------
      100

What do all those numbers mean? Well, the Linux getrusage() manual page has a clue — these are kernel resource usage statistics. The first two lines show the user and kernel ("system") cpu time used, as well as elapsed time. (The numbers in brackets on line #2 and following are totals for this process.) Line #3 shows I/O from/to the storage device (not from the kernel cache). Line #4 covers memory pages faulted into the process address space and reclaimed. Line #5 shows signal and ipc message activity. Line #6 shows process context switches.

Let's look at some more interesting queries by populating a new table:

CREATE TABLE test(x INTEGER);
INSERT INTO test SELECT x FROM generate_series(1, 1000) t(x);

By clearing the kernel buffers with echo 3 > /proc/sys/vm/drop_caches and restarting the server we can see reads from the storage device (via log_statement_stats) and reads from the kernel cache (via explain (buffers true):

EXPLAIN (ANALYZE true, BUFFERS true, TIMING false, COSTS false)
SELECT COUNT() FROM test;
LOG:  QUERY STATISTICS
DETAIL:  ! system usage stats:
!       0.000000 s user, 0.000000 s system, 0.019259 s elapsed
!       [0.008000 s user, 0.000000 s system total]
!       1712/0 [5840/280] filesystem blocks in/out
!       7/94 [21/586] page faults/reclaims, 0 [0] swaps
!       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!       29/1 [73/3] voluntary/involuntary context switches
                    QUERY PLAN
---------------------------------------------------
 Aggregate (actual rows=1 loops=1)
   Buffers: shared read=5
   ->  Seq Scan on test (actual rows=1000 loops=1)
         Buffers: shared read=5
 Planning time: 3.673 ms
 Execution time: 2.221 ms
 
EXPLAIN (ANALYZE true, BUFFERS true, TIMING false, COSTS false)
SELECT COUNT() FROM test;
LOG:  QUERY STATISTICS
DETAIL:  ! system usage stats:
!       0.000000 s user, 0.000000 s system, 0.000408 s elapsed
!       [0.008000 s user, 0.000000 s system total]
!       0/0 [5840/288] filesystem blocks in/out
!       0/3 [21/591] page faults/reclaims, 0 [0] swaps
!       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!       0/0 [74/3] voluntary/involuntary context switches
                    QUERY PLAN
---------------------------------------------------
 Aggregate (actual rows=1 loops=1)
   Buffers: shared hit=5
   ->  Seq Scan on test (actual rows=1000 loops=1)
         Buffers: shared hit=5
 Planning time: 0.027 ms
 Execution time: 0.250 ms

Notice the first count reads 1712 512-byte blocks from storage and 5 8kB blocks from the kernel cache into the shared buffer cache. (The unrecorded reads are probably system table information.) The second query shows no I/O from the storage device, and a hit on 5 8kB shared buffer pages.

As you might know, Postgres executes queries in stages. Fortunately, you can get rusage statistics for each stage with log_parser_stats, log_planner_stats, and log_executor_stats. (explain (analyze true) also shows planning and execution durations, which you can see above.)

As you can see, you can access a wealth of information about Postgres and how it is using the cpus, storage, and caches. These have a large impact on how quickly queries execute.

View or Post Comments

Performance Performance Goalposts

Monday, February 13, 2017

I often get the question, "How do I get good performance out of Postgres?," and sometimes I am kind of miffed because I get the feeling that they expect a short answer. Those who are familiar with Postgres know there is no short answer, and it really isn't Postgres's fault because there isn't a simple answer to that question for any relational database.

I guess it comes down to the complexity of relational systems. There is the sql parser, the optimizer, and background workers. No matter how much auto-tuning we do, there are still going to be things that administrators must adjust for optimal performance. And then there is the hardware, which is stressed by all relational systems. Proper tuning of Postgres must happen at all three levels:

  • Hardware
  • Database parameter configuration
  • Sql (both data modeling and queries)

Unless all three are tuned, you can't say the system is optimized. I have already covered hardware tuning in my Database Hardware Selection Guidelines presentation. Database parameter configuration is a tough one because every site has different requirements, though there are a few common tuning cases. Tuning at the sql level involves proper schema and index setup, plus monitoring with something like log_min_duration_statement, pg_stat_statements, or pgBadger. Once you identify a problem sql query, you often have to do detective work with explain.

None of this is easy, but hey, if it were easy everyone could do it. The good news is that, when done well, database performance tuning can yield big dividends.

View or Post Comments

NoSQL Postgres Gives Developers More Options with NoSQL

Saturday, February 11, 2017

I just co-authored a blog for ibm's Linux on Power developer site. While there is similar content in my YeSQL slides, the ibm blog more concisely describes the four options available to application developers, from pure relational to pure NoSQL, and the options in between possible with Postgres.

View or Post Comments

Object bundles Creating Extensions in Schemas

Thursday, February 2, 2017

Postgres extensions are a powerful way to add functionality to Postgres and dozens of extensions are available. However, it can be difficult to manage them. By default, extensions are loaded into the first existing schema in search_path (usually the public schema) e.g.:

CREATE EXTENSION pgcrypto;
\df
                                        List of functions
 Schema |         Name          | Result data type |        Argument data types         |  Type
--------+-----------------------+------------------+------------------------------------+--------
 public | armor                 | text             | bytea                              | normal
 public | armor                 | text             | bytea, text[], text[]              | normal
…

This loads 36 functions into the public schema. This can be difficult to manage, particularly if multiple extensions are loaded into the same database.

One solution is to place each extension into its own schema. This can be accomplished by setting search_path before loading the extension, or using the with schema clause in the create extension command:

CREATE SCHEMA pgcrypto;
 
CREATE EXTENSION pgcrypto WITH SCHEMA pgcrypto;
 
\df pgcrypto.*
                                         List of functions
  Schema  |         Name          | Result data type |        Argument data types         |  Type
----------+-----------------------+------------------+------------------------------------+--------
 pgcrypto | armor                 | text             | bytea                              | normal
 pgcrypto | armor                 | text             | bytea, text[], text[]              | normal
…

Users can then set their search_path whenever they want to use the extension, or prefix the function call with the schema name, e.g. select pgcrypto.armor('abc').

View or Post Comments

Administration Beware of Antivirus Software with Postgres

Monday, January 30, 2017

Antivirus software helps to protect your computer from external attack. Unfortunately, it can cause problems with Postgres. Antivirus software constantly scans for files and behavior that suggests attacks, but Postgres can easily get caught in that net.

Postgres regularly creates heap, index, and wal files, and these files are binary. Depending on what data is written, these files can contain patterns that are mistaken for viruses, causing the files to be removed from Postgres-controlled directories and quarantined. Obviously, this causes bad things to happen to Postgres. Postgres sites using anti-virus software should disable anti-virus scans of directories controlled by Postgres.

View or Post Comments

News Updated Blog Categories, Images

Saturday, January 28, 2017

When the number of my blog entries surpassed 400, I realized that the categories I was using to group entries for easier reference were too coarse. Therefore, I have added new blog categories to make future reference easier. Also, I have dozens of new blog entries ready to post in the coming months.

Related to restructuring, I have updated many of the stock images in my presentations.

View or Post Comments

Performance Two More Interviews

Wednesday, January 25, 2017

I did two hour-long audio interviews recently with the group Three Devs and a Maybe. The first is covers performance and the second covers database concurrency.

View or Post Comments

Thoughts Moscow Interview

Monday, January 23, 2017

I did a 12-minute interview in Moscow in November and a recording is now available online. The first four minutes cover Russian-specific aspects of Postgres, and the remainder discusses Postgres's general direction, challenges, and its place in the database ecosystem.

View or Post Comments

Security Using Ssl Certificates

Tuesday, January 17, 2017

Having covered ssl certificate creation and the use of certificate authorities (ca), I would like to put it all together and show how certificates and certificate authorities work to ensure trusted Postgres communication.

I have created a diagram showing server, client, and certificate authority certificates. None of these certificates is secret, e.g. the server sends its ssl certificate to the client, and visa versa. In the diagram, the server and client use the same certificate authority certificate. (Intermediate certificate authorities could also be used.)

When the client connects, the server sends its certificate to the client. The client uses the public key in its certificate authority certificate to verify that the server certificate was signed by its trusted certificate authority (the red line). It then uses the public key in the server certificate to encrypt a secret key that is sent to the server. Only a server with the matching private key can reply to generate a session key. It is not the possession of the server certificate that proves the server's identity but the possession of the private key that matches the public key stored in the server's certificate. The same is true for client certificates used for client host and user authentication (the blue line).

Hopefully this diagram helped you see how the same certificate authority certificate on the server and client allows for identity verification. Interestingly, identity verification is not required for two systems to communicate in an eavesdrop-proof manner, but if you can't be sure who you are communicating with, it isn't very useful.

View or Post Comments

Security Creating Ssl Certificates

Thursday, January 12, 2017

Having covered the choice of certificate authorities, I want to explain the internals of creating server certificates in Postgres. The instructions are already in the Postgres documentation.

When using these instructions for creating a certificate signing request (csr), two files are created:

  • certificate signing request file with extension req
  • key file, containing public and private server keys, with extension pem

(It is also possible to use an existing key file.) You can view the contents of the csr using openssl, e.g.:

$ openssl req -in server.req -text
Certificate Request:
    Data:
        Version: 0 (0x0)
        Subject: … CN=momjian.us …
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                Public-Key: (2048 bit)
                Modulus:
                    00:b2:ea:53:a6:5e:0a:bd:dd:59:08:52:0d:bb:77:
                    …
                Exponent: 65537 (0x10001)
        Attributes:
            a0:00
    Signature Algorithm: sha256WithRSAEncryption
…
-----BEGIN CERTIFICATE REQUEST-----
…
-----END CERTIFICATE REQUEST-----

The first argument to the openssl command-line tool always starts with a subcommand, in this case req. (The openssl manual pages are split based on the subcommand, so man req shows openssl arguments for the subcommand req.)

The common name (cn) shown above is potentially checked by the client. The encoded section ("certificate request") can be suppressed with -noout. The file server.req already contains this information in text format because -text was specified during its creation.

The server public and private keys stored in privkey.pem can also be viewed:

$ openssl pkey -in privkey.pem -text
-----BEGIN PRIVATE KEY-----
MIIEvwIBADANBgkqhkiG9w0BAQEFAASCBKkwggSlAgEAAoIBAQCy6lOmXgq93VkI
…
-----END PRIVATE KEY-----
Private-Key: (2048 bit)
modulus:
    00:b2:ea:53:a6:5e:0a:bd:dd:59:08:52:0d:bb:77:
…
-----END PRIVATE KEY-----
Private-Key: (2048 bit)
modulus:
    00:a3:ad:e4:61:2b:5f:c0:5b:23:fc:8c:b4:ab:70:
    …
publicExponent: 65537 (0x10001)
    …
privateExponent:
    …
prime1:
    …
prime2:
    …
exponent1:
    …
exponent2:
    …
coefficient:
    …

All of this can be very complex so I have created a diagram which illustrates what is happening. At the top-left is the server key generated by openssl req. This command also creates the certificate signing request (the csr, top-right) which contains:

  • Certificate signing information in X.509 format, e.g. common name (cn)
  • Public server key to be used by clients wishing to authenticate the server's identity
  • The above signed by the server's private key to prove that the server owner supplied this information

The certificate signing request (csr) can't be used as a certificate because it is unsigned. For self-signed certificates, the server's private key is used to sign the csr. (These are the instructions given in the Postgres documentation.) A more powerful option, as I mentioned in my previous blog post, is to use a local certificate authority. In my diagram, a certificate authority (bottom-left) uses its private key to sign the csr and create a certificate (bottom-right).

Once the server has a certificate, any client who has the signer of the server's certificate in their ~/.postgresql/root.crt file can authenticate the server certificate by using the sslmode connection parameter verify-ca. They can also verify that the certificate was created for the specific server's host name by checking the common name (cn) using verify-full. (Clients can record revoked certificates in ~/.postgresql/root.crl.)

Hopefully you can make better use of ssl server certificates now that you understand how they are created, inspected, and verified.

View or Post Comments

Security Ssl Certificates and Certificate Authorities

Monday, January 9, 2017

When setting up ssl in Postgres, you can't just enable ssl. You must also install a signed certificate on the server.

The first step is to create a certificate signing request (csr) file that contains the host name of the database server. Once created, there are three ways to sign a csr to create a certificate:

If the certificate is to be self-signed, use the key created by the certificate signing request to create a certificate. If using a local certificate authority, sign the csr file with the local certificate authority's key.

The use of public certificate authorities doesn't make sense for most databases because it allows third parties to create trusted certificates. Their only reasonable use is if you wish to allow public certificate authorities to independently issue certificates that you wish to trust. This is necessary for browsers because they often connect to unaffiliated websites where trust must be established by a third party. (Browsers include a list of public certificate authorities who can issue website certificates it trusts.)

Issuing certificates to clients as well enables additional features:

These items require the server and client certificates be signed by the same certificate authority.

Ssl setup in Postgres can be complicated, but it offers a unique set of security and usability features that are unmatched.

View or Post Comments

Administration Use Kill -9 Only in Emergencies

Friday, January 6, 2017

During normal server shutdown, sessions are disconnected, dirty shared buffers and pending write-ahead log (wal) records are flushed to durable storage, and a clean shutdown record is written to pg_control. During the next server start, pg_control is checked, and if the previous shutdown was clean, startup can ignore the wal and start immediately.

Unfortunately, a clean shutdown can take some time, and impatient database administrators might get into the habit of using kill -9 or pg_ctl -m immediate to quicken the shutdown. While this does have the intended effect, and you will not lose any committed transactions, it greatly slows down the next database startup because all wal generated since the last completed checkpoint must be replayed. You can identify an unclean shutdown by looking at the server logs for these two ominous lines:

LOG:  database system was interrupted; last known up at 2016-10-25 12:17:28 EDT
LOG:  database system was not properly shut down; automatic recovery in progress

These crash database shutdowns can also happen if the computer crashes, the operating system crashes, or if a proper database shutdown script isn't triggered on computer shutdown.

So, in summary, administrators should avoid Postgres crash shutdowns unless shutdown time is more precious than startup time. Every time those log lines appear, Postgres is doing more work than would have been necessary if the previous shutdown had been clean.

View or Post Comments

Administration Controlling Autovacuum

Tuesday, January 3, 2017

Unlike other database systems, Postgres makes the cleanup process visible and tunable to users. Autovacuum performs recycling of old rows and updates optimizer statistics. It appears in ps command output, the pg_stat_activity system view, and optionally in the server logs via log_autovacuum_min_duration.

Postgres also allows fine-grained control over the autovacuum cleanup process. Occasionally users find that cleanup is slowing the system down, and rather than modifying the behavior of autovacuum, they decide to turn it off via the autovacuum setting.

However, turning off autovacuum can cause problems. Initially the system will run faster since there is no cleanup overhead, but after a while old rows will clog up user tables and indexes, leading to increasing slowness. Once that happens, you can turn on autovacuum again, and it will recycle the old rows and free up space, but there will be much unused space that can't be reused quickly, or perhaps ever.

Turning off autovacuum is kind of like someone trying to save time by not changing the engine oil in their car. They are correct that for a while their car will be in the shop less often, but ultimately the car will be in the shop for a long time, or will stop working. Therefore, don't turn off autovacuum.

Another problem is that while autovacuum is turned off, optimizer statistics were not being updated, perhaps causing slower execution plans. Fortunately restarting autovacuum does fully fix that problem.

Now that we know that turning off autovacuum is a bad idea, what are the options if administrators want to reduce the overhead of autovacuum. Well, first, there are many autovacuum tuning parameters that allow autovacuum activity to happen less often, or consume fewer resources while running.

A more aggressive, and perhaps creative, approach is to change when autovacuum runs. Most systems have busy times and idle times. Sometimes this can be determined by the time of the day or the day of the week, e.g. Sundays, 0200-0500. In other cases it can be determined by the system's load average or number of active sql sessions in pg_stat_activity.

No matter how it is determined, wouldn't it make sense for autovacuum to run during these idle periods, and not run during busy periods? Well, it's not quite that simple. You don't really want to force autovacuum to run during idle times if there is no useful activity to perform, nor do you want it to never run during busy times in case there is a massive delete that requires cleanup or a large data change that requires updated optimizer statistics.

What you want is to encourage autovacuum to run during idle periods, and discourage it from running during busy times. Fortunately, that is easy to do, especially with the alter system command added in Postgres 9.4.

Let's assume you have your own method to determine busy and idle times. When an idle time starts, simply execute:

ALTER SYSTEM SET autovacuum_vacuum_scale_factor = 0.1;
ALTER SYSTEM SET autovacuum_analyze_scale_factor = 0.05;
 
SELECT pg_reload_conf();
 pg_reload_conf
----------------
 t

Of course, if you changed these settings in postgresql.conf you can use this in psql to reduce the current value of autovacuum_vacuum_scale_factor by half:

SELECT current_setting('autovacuum_vacuum_scale_factor')::float8 / 2 AS scale
\gset
ALTER SYSTEM SET autovacuum_vacuum_scale_factor = :scale;

You might also want to reduce autovacuum_freeze_max_age and autovacuum_multixact_freeze_max_age during this period. Then, at the end of the idle period, run this:

ALTER SYSTEM RESET autovacuum_vacuum_scale_factor;
ALTER SYSTEM RESET autovacuum_analyze_scale_factor;
 
SELECT pg_reload_conf();
 pg_reload_conf
----------------
 t

This will set the values back to their defaults. Pretty simple, huh? Yet effective. A more sophisticated approach would be to proportionally decrease these settings based on the load on the system.

Let's walk through how this works. Suppose we don't change any settings during idle times and use the default autovacuum_vacuum_scale_factor of 20%. If a table has 19.5% expired rows at midnight, autovacuum will not run. When the percentage of expired rows reaches 20% at 11am, autovacuum will run and remove the expired rows. Now suppose that autovacuum_vacuum_scale_factor is reduced by half at midnight — autovacuum will run and all tables will have less than 10% expired rows by the time the idle period ends. Now, during the busy time autovacuum will only run if a table has increased from 10% expired rows to 20%.

It would be nice if this proportional behavior could be added to autovacuum but I am not sure how to implement that in a clean way.

View or Post Comments