Bruce Momjian: Postgres Blog

Bruce Momjian

Postgres Blog

This blog is about my work on the Postgres open source database, and is published on Planet PostgreSQL. PgLife allows monitoring of all Postgres community activity.

Online status:
Unread Postgres emails:
Email graphs: incoming, outgoing, unread, commits (details)

View Category Index

Sharding Update from Asia

Tuesday, December 12, 2017

It has been a year since my last blog post about sharding. There are many sharding improvements in Postgres 10, and this PGConf.Asia talk from ntt staff covers the advances. The slides go into great detail about what has been accomplished, and what remains. This whiteboard image from the PGConf.Asia Unconference gives a single view of the necessary pieces.

What I wrote about a year ago felt like a far-distant goal — now it feels like we are almost there. Yes, it will take years to fully complete this feature, like all our complex features, but it feels like we are very near to proof-of-concept and production deployments, at least for certain workloads. Postgres pulls off these amazing feats regularly, and you would think I would get use to it and not be surprised, but I still am.

View or Post Comments

Postgres vs PostgreSQL

Friday, November 24, 2017

I have been with the project long enough to remember how the project got the name "PostgreSQL". In 1996, we inherited the name "Postgres95" from the Berkeley team's remaining member Jolly Chen. Obviously the Postgres95 name wasn't going to last long-term, so a new name had to be chosen. We could revert to the original Berkeley name "Postgres", or go with a name that more closely identified us as an sql database and call it "PostgreSQL". To add complexity to the discussion, Berkeley Postgres (before Postgres95) used the quel query language, so there were valid arguments that we needed to advertise that the database now used sql. After heated debate, "PostgreSQL" was chosen, and "Postgres" was later accepted as a valid alias.

I was always in the "Postgres" camp, but over time I have learned to appreciate the arguments of those who prefer "PostgreSQL". The two strong arguments for "PostgreSQL" are

Visually better than "Postgres"
Identifies it to new people as an sql database

The strong argument for "Postgres" is that "Postgres" is simpler to say, while "PostgreSQL" is complex and has several verbal options, e.g., Postgres-Q-L, Postgres Sequel", Postgre Sequel".

What has really cemented my appreciation for both names is that the companies I have worked for have cycled through the two names because they saw value in each one. What that tells me is that both names have unique value and that we have to do our best to leverage the strengths of both.

Also, from the further obscure closet, here is the first email suggesting that Postgres use an elephant as its mascot:

but if you want an animal-based logo, how about some sort of elephant? After all, as the Agatha Christie title read, elephants can remember ...

View or Post Comments

Huge Pages

Wednesday, November 22, 2017

Postgres 9.4 added the server variable huge_pages. The Postgres documentation does a good job explaining huge pages, "The use of huge pages results in smaller page tables and less CPU time spent on memory management, increasing performance."

Specifically, all modern operating systems use virtual memory to map process-visible memory to physical ram. Typically, the default granularity of virtual pages is four kilobytes. For a process using one megabyte of virtual memory, that is 256 page table entries mapping virtual addresses to physical addresses. For a processing using one gigabyte of virtual memory, that is 256k virtual page table entries! While modern cpus have translation lookaside buffers (tlb) to speed virtual/physical memory mapping, there is still a lot of overhead for processes that use large memory spaces. Huge pages allow for less granular virtual pages, meaning fewer page table entries and fewer lookups. The x86-64 cpu architecture supports 2MB and 1GB-sized huge pages.

Unfortunately Postgres only supports huge pages on Linux, and to make use of huge pages, you must enable huge pages in the operating system. The Postgres documentation goes into detail on how to do this. You can see how much memory is allocated for huge pages on Linux by looking in /proc/meminfo:

AnonHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

The last line indicates that huge pages are 2MB is size. The first line indicates the anonymous or transparent huge pages in use in your kernel. Transparent huge pages can cause significant overhead when a background kernel thread tries to rearrange memory to coalesce it into 2MB chunks. It is recommended that transparent huge pages be disabled when using Postgres, and only explicit huge pages be used. Fortunately, most modern Linux kernels have transparent huge pages disabled by default.

View or Post Comments

Getting Object Creation Time

Tuesday, November 21, 2017

The Postgres community is often asked to provide automatic tracking of the creation and last modification times for objects, e.g., tables, functions. File systems track file last modification times, and some record creation time, so the logic is why can't Postgres do this? (Unix's inode modification time does not represent the creation time.)

For object creation-time tracking, we have to consider when creation time should be set. It is obvious that the creation time should be set when the object is first created, but what about

restore of pg_dump
restoring a base backup for pitr or streaming replication
pg_upgrade in copy mode

For object modification, things become more complicated. There are several layers of modification:

object schema changes via alter
storage rewrites via cluster, vacuum full
user data row changes, e.g., via insert, update, delete
internal changes, e.g., hint bits, vacuum freeze

Tracking all of these times is reasonable. However, there is significant code and performance overhead in doing it, and a given user is probably only interested in a few of these. A detailed email by Tom Lane also explains that even if we tracked all of this, it only records the change time, not what was changed.

A more holistic solution is auditing, which would contain change information as well as tracking of multiple change times. Effort is probably better spent adding auditing to Postgres than adding fine-grained tracking of creation and modification times. Users who want to track such times are best served writing custom event triggers.

View or Post Comments

Interview in China

Monday, November 13, 2017

I previously mentioned my visit to Hangzhou, China. A video interview from that trip is now available, as well as my presentation in English and Chinese (starts at 153 minutes).

The most interesting part of my visit to Shenzhen, China was an evening meeting with 30 Postgres community members discussing how to increase Postgres adoption in China.

View or Post Comments

Data in the Database vs the File System

Monday, November 6, 2017

An age-old question is whether it is better to put data in a database or a file system. Of course, the answer is "it depends," but let's look at what it depends on.

First, Postgres stores its data in the file system, so the file system must be good for something. Postgres also stores some of its configuration files in the file system, e.g., pg_hba.conf, so it isn't just a issue that once you have a database, everything is better in a database.

It usually comes down to evaluating database positives vs. negatives — first the database storage positives:

Do you need to view the data in multiple ways?
Is synchronizing database data and file system data going to be difficult?
Do you need multi-object commit synchronization?
Do applications need a single, guaranteed-consistent view of their data?
Is a file system API unreasonable or inaccessible to clients?

and these database storage negatives:

Are the objects very large, meaning that using bytea or large objects is too cumbersome?
Is the data so small that the per-row 28-byte overhead is too large?
Is direct file system access useful?

It is often a question of object size. At the multi-megabyte size, having multiple copies of the data in the database while a query is executing is not a problem, and the auto-synchronization of data in the database is a great benefit. At the hundreds of megabytes and larger, the overhead of moving around multiple copies of the object in the database can become a performance problem. Of course, these problems only happen if you access the large data object in a query — if you don't query the bytea column or access the large object in the query, there is no overhead except storage. Updates of rows containing large objects also don't suffer a performance penalty if the update doesn't modify the large object. (The same is true for any long value, e.g., character strings, json.)

Also, if the object is not a binary blob to the database but has a structure the database can understand then storing it in the database has other advantages. For example, doing full text search or json lookups on data can be very beneficial.

Some of the same questions come up when deciding whether some of your data would be best stored in a different database technology and using foreign data wrappers to unify data access. Again, the answer is still "it depends," but hopefully this helps you judge the "depends" better.

View or Post Comments

Why Attend Conferences?

Friday, November 3, 2017

I have attended 323 Postgres events in my career. While I have enjoyed almost all of them, many had different focuses, so I thought I would share my experiences. First, there are a variety of conference types:

Vendor conferences: often in big cities, which focus on company-produced products
Business conferences: also often in big cities, which focus on business challenges and discussions, often with high attendance prices
Community conferences: led by people who care about open-source software and focus on software knowledge transfer
Hobbyist conferences: often in smaller cities, which focus on interpersonal relationship building with technology as a catalyst, often free

It would be nice if I could say which conference types are good or bad, but that isn't possible. Each conference targets an audience whose needs it seeks to fulfill. Let's look at the needs that each fulfills:

Vendor conferences: If you are new to a technology and need people to help you navigate purchase options, these conferences are for you.
Business conferences: If you are frequently challenged to make business decisions, but feel you have no one to share options with or brainstorm, this type of conference can give you a framework to help you make your next complex business decision.
Community conferences: If you spend significant time solving technological problems, you can gain great insight and new approaches by attending this type of conference.
Hobbyist conferences: If you are looking for emotional connections to people who share similar interests, this type of conference can be personally rewarding.

Ideally everyone would go to conferences which match their interests, but what happens when they don't match? Here are some examples:

Vendor conferences: "Wow, this is boring. The booth staff don't even know about the technology they are selling. When will this be over?"
Business conferences: "People are very passionate about the problems they are trying to solve. I am glad I don't have these problems — they seem unsolvable."
Community conferences: "These people really care about the minutia of the software. When are they going to get a life?"
Hobbyist conferences: "Does this end with everyone sitting in a circle and roasting marshmallows over a cpu fan?"

Ninety-five percent of Postgres conferences are community conferences (#3). They have some of the other aspects, but that is not the focus. Open source vendor conferences, e.g., LinuxWorld, used to be popular but are almost extinct. They were really focused on vendors, and when attendees realized this, and vendors didn't get the desired sales opportunities, demand collapsed. Postgres has few business-focused conferences — Postgres Vision is an example. Hobbyist conferences still exist, and probably always will, though they are, by necessity, small.

For me, conferences allow me to hear from and talk to people knowledgeable in a field I want to study. These interactions convey information I can't get from reading books and articles. The best interactions are a fire hose of information that I can absorb and probe. Such interactions give me information it would take me weeks or months to learn.

This Slashdot discussion covers some of the value in attending conferences. Hopefully this blog post and that discussion will help you navigate the many conferences offered and help you get the best value from them.

View or Post Comments

Blocking Ddl

Wednesday, November 1, 2017

Postgres is advertised as "reading never blocks writing and writing never blocks reading." While this is true for selects and dml (insert, update, delete), it isn't true for ddl.

Ddl, like alter table, can block reads and writes. Postgres has continued to reduce locking requirements for alter and other ddl commands, and will continue to improve things. However, ultimately there is some ddl that will always require, at least briefly, the blocking of reads and writes.

View or Post Comments

Speaking in China

Tuesday, October 17, 2017

I had the pleasure of speaking at Alibaba's Computing Conference last week in Hangzhou, China. I gave two presentations. The first covered Postgres 10 features (images, video). The second was a generic talk about the future of open source databases (images, video in English and Chinese (starts at 153 minutes)).

This week I am attending a community Postgres conference in Shenzhen, China. Postgres users here have a strong desire to increase Postgres adoption in China in the coming years.

View or Post Comments

Using Xmin in Queries

Wednesday, October 4, 2017

You might be aware that Postgres uses invisible columns to track concurrent row access and modifications. My mvcc talk covers much of this.

Someone at a conference mentioned they were using the invisible xmin column to perform updates. I was initially skeptical of this approach, but once he explained the purpose, it made sense, e.g.:

CREATE TABLE mvcc_test (x INTEGER PRIMARY KEY, name TEXT);
 
INSERT INTO mvcc_test VALUES (1, 'Sal');
 
SELECT * FROM mvcc_test;
 x | name
---+------
 1 | Sal
 
SELECT xmin, * FROM mvcc_test;
 xmin | x | name
------+---+------
  715 | 1 | Sal
 
UPDATE mvcc_test SET name = 'Sally' WHERE x = 1 AND xmin = 715;
 
SELECT xmin, * FROM mvcc_test;
 xmin | x | name
------+---+-------
  716 | 1 | Sally

Why would you do this? Normally this would be done using select ... for update. However, what if you want to do the update without locking the row between select and update, and are willing to discard the update if the row has already been updated by another session? This is a case where using xmin in updates makes sense.

View or Post Comments

Percent_rank vs. Cumm_dist

Monday, October 2, 2017

During research for my Postgres Window Magic talk, I studied the unusual behavior of percent_rank and cumm_dist (cumulative distribution). The Postgres documentation wasn't helpful. I finally came up with this paragraph to add to the Postgres 10 docs:

cume_dist computes the fraction of partition rows that are less than or equal to the current row and its peers, while percent_rank computes the fraction of partition rows that are less than the current row, assuming the current row does not exist in the partition.

There are some key phrases in there. First, cume_dist computes the percentage of rows that are "less than or equal," while percent_rank computes a similar percentage that are "less than" the current row, except it also assumes the current row isn't in the partition.

That last issue of not counting the current row struck me as odd. I talked to some statisticians about this and did some web searches. The best I can understand is that this special percent_rank behavior is designed to handle cases where there is a single maximum value in the partition. For example:

SELECT x, (PERCENT_RANK() OVER w)::numeric(10, 2),
       (CUME_DIST() OVER w)::numeric(10, 2)
FROM generate_series(1, 5) AS f(x)
WINDOW w AS (ORDER BY x);
 x | percent_rank | cume_dist
---+--------------+-----------
 1 |         0.00 |      0.20
 2 |         0.25 |      0.40
 3 |         0.50 |      0.60
 4 |         0.75 |      0.80
 5 |         1.00 |      1.00

In this example, what are the ideal ratios for the last line? Cume_dist is easy since is it "less than or equal" to the current row, so that is clearly 1.00. What about percent_rank, which is "less than?" Without the "assume the current row isn't in the partition" exclusion, it would show 0.80. That would be a silly answer since all the rows are less than the last row, except the last row itself. I think this is why that exclusion was added.

This issue becomes even more complex when there are many duplicates. For example, imagine this data set:

WITH cte (x) AS (
        SELECT 0
        UNION ALL
        SELECT 1 FROM generate_series(1, 5)
        UNION ALL
        SELECT 2
)
SELECT x, (PERCENT_RANK() OVER w)::numeric(10, 2),
       (CUME_DIST() OVER w)::numeric(10, 2)
FROM cte
WINDOW w AS (ORDER BY x);
 x | percent_rank | cume_dist
---+--------------+-----------
 0 |         0.00 |      0.14
 1 |         0.17 |      0.86
 1 |         0.17 |      0.86
 1 |         0.17 |      0.86
 1 |         0.17 |      0.86
 1 |         0.17 |      0.86
 2 |         1.00 |      1.00

Frankly, neither percent_rank nor cume_dist look like they produce reasonable results. If I got a score of 1, is saying that a ratio of 0.17 rows are less than mine, or that 0.86 are equal or less than mine, useful? Imagine this was an sat test and many people got 650. Wouldn't it be reasonable to count half of the people who had matching scores less then or equal to mine, which would compute 0.50, rather than including all or none of the matching scores?

Right now, there is no way in Postgres to do that, but I think it would be a nice thing to add. I assume a server-side function could be written to do this. Computing the average of percent_rank and cume_dist would yield a similar result.

View or Post Comments

Storing the Original Time Zone

Wednesday, September 27, 2017

In a previous blog entry I suggested storing the original time zone offset in a separate column if clients need to know the stored time in the original time zone. There is some more complexity to this issue that I would like to cover.

First, when I suggested using select extract(timezone from current_timestamp), I assumed the user was also storing the current_timestamp value in the database. If they were storing a past or future timestamp in the database, they would need to use that value in extract, instead of current_timestamp.

Second, as was pointed out by Miha Vrhovnik in a blog comment, things get more complicated if a future time is stored in the database and the future time zone rules change after the data is stored. You might think this concern is only theoretical, but Russia made such changes in 2014 and 2016.

To get the proper behavior, you have to ask yourself, if you are storing a future timestamp, do you want to retain the same wall-clock time? If you were recording a future doctor's appointment, odds are you would want to adjust the stored value to reflect the same time of day (wall-clock time). If you were recording a future astronomical event, you would want to keep the same instant in time, even if the visible time changes. The default timestamp with time zone behavior is to retain the same instant in time, i.e., astronomical behavior.

To retain the same future wall-clock time after a time zone rule change, you would need to store the timestamp using without time zone and store the time zone name in a separate column. You would then need to combine the timestamp without time zone and the time zone name in each query to compute the instant in time based on the current time zone rules. Any indexes that did such computations would also need to be reindexed when time zone rules change.

This method allows the data to adjust to future time zone rule changes by computing the time zone offset on demand, rather than being locked in the time zone rules which were current at the time the data was entered, e.g.:

-- controls the OUTPUT time zone of AT TIME ZONE when passed a WITHOUT TIME ZONE value
SET TIME ZONE 'Europe/Moscow';
 
-- AT TIME ZONE specifies the time zone for the literal value
SELECT TIMESTAMP WITHOUT TIME ZONE '2012-03-08 09:00:00' AT TIME ZONE 'Europe/Moscow';
        timezone
------------------------
 2012-03-08 09:00:00+04
 
SELECT TIMESTAMP WITHOUT TIME ZONE '2022-03-08 09:00:00' AT TIME ZONE 'Europe/Moscow';
        timezone
------------------------
 2022-03-08 09:00:00+03

Basically, if the future time zone rules change, the output time of day would be the same, but the instant in time compared to utc would change, e.g., the +03. Of course, if all events are in the same time zone, and you don't want "astronomical behavior," then there is no need for time zone adjustments. Miha Vrhovnik is right that using time zones can be more trouble than they are worth, especially if all entries are in the same time zone and future time zone changes are a possibility.

View or Post Comments

PL/pgSQL's Good Sql Alignment

Monday, September 25, 2017

I recently wrote about PL/pgSQL's good alignment with the scripting needs of the sql language. There have been several email threads in the past few months exploring this in much greater detail so I wanted to share them.

The first email thread defines "impedance mismatch" as a way of measuring how closely a language's features and syntax fit the requirements of server-side functions. It cautions against always choosing familiar languages because sometimes using a language with a lower "impedance mismatch", even if its syntax is unfamiliar, is better in the long run. It also has a long discussion about when server-side logic is wise.

The second email thread talks about the mechanics of using server-side languages, and mentions PL/Perl and PL/v8 as good for compute-heavy server-side functions.

The final thread is a wish-list of how to improve PL/pgSQL. While there are many requests, they center on cases where PL/pgSQL is overly-verbose or the defaults are non-optimal. (I was actually happy to see that the scope of requested improvements was so limited.)

I will close with a particularly-glowing post about the community's accomplishments. As someone who sees complaints and bug reports all day, it is good to remember how much goes well in our community and development processes.

View or Post Comments

Identifier Case Sensitivity

Wednesday, September 20, 2017

I last blogged about Postgres's handling of case sensitivity in 2012. A recent email thread requested that case sensitivity be configurable in Postgres.

Certainly, allowing case sensitivity to be configurable would help those porting from other databases. However, Postgres tools and libraries operate inside other applications, so if there was a case sensitivity setting, they couldn't change it. They would need to support the case sensitivity set by the client, meaning that every identifier would have to be double-quoted. It was concluded that this seems to be a bigger negative than the portability enhancement is a positive.

View or Post Comments

Vectorize Surprise

Monday, September 18, 2017

The Postgres hackers list is a steady-stream of great ideas and discussion, but occasionally something comes along that really makes you sit back and think, "Wow, where did that come from?" Such was a February 2017 email from Konstantin Knizhnik presenting a proof-of-concept vectorization optimization for the executor.

In May Andres Freund presented a plan for speeding up the executor using Just In Time Compilation (jit) and llvm. This work is pending for Postgres 11. (In fact, it was almost committed to Postgres 10 on April Fool's Day.)

Konstantin's work adds vectorization to the executor, which can be revolutionary. Once Andres's work is in, we can research how to make the executor even faster using vectorization. This would open up Postgres to an entirely new class of big-data applications.

View or Post Comments

Index Everything

Friday, September 15, 2017

When setting up a database schema, indexing is always a consideration. While Postgres supports traditional btree indexes for most data types, it can be quite heavy, often requiring a significant percentage of the table size for index storage.

There are two options for schemas that need to index a large percentage of columns. The first is brin indexes which allow for small, low-overhead indexes that can be created on many columns. One downside of brin indexes is that they must be defined on each column that needs it.

A more generic solution is to place all data in a jsonb column and create a gin index. The gin index indexes every json key and value, and only stores the keys and values once in the index. This is a great "index everything" option for administrators that aren't sure of their indexing needs.

View or Post Comments

Future Optimizer Enhancements

Wednesday, September 13, 2017

The Postgres optimizer has improved dramatically over the years. However, there are still a few areas where it can be improved.

First, it would be nice if users could be informed when an increase of default_statistics_target is wise. Second, while multivariate statistics will be added in Postgres 10, there are still other multivariate optimizations to add.

A more sophisticated improvement, which bypasses the problem of inaccurate statistics, would be to have the executor send feedback on selectivity found during query execution to the optimizer to improve the future query plans.

View or Post Comments

Partitioning in Postgres 10 and Beyond

Monday, September 11, 2017

Postgres 10 is adding partitioning syntax to Postgres. This ends the previous Frankenstein-style partitioning setup of having to configure child tables, check constraints, and triggers or rules.

You would think that the partitioning feature is now complete. Oh, no! It seems there are now more partitioning enhancement requests than we had before adding the syntax. I knew there would be new optimizations once there was a canonical representation of partitions in the backend, but I didn't expect this feature-request growth.

Ottawa's PGCon conference had a talk about partitioning, but the major partition feature list was created during its unconference. The good news is that Amit Langote recorded all the requests. I expect it will be a few years until we can say we are done improving partitioning in Postgres.

View or Post Comments

A Hierarchical Read-Scaling Architecture

Monday, September 4, 2017

Postgres has a well-deserved reputation for flexibility. You can often combine two features to get something very powerful. For example, it was recognized several years ago that combining Pgpool and streaming replication creates a powerful, easy-to-administer read-scaling solution.

When asked about creating a read-scaling solution for multi-terabyte systems, I came up with an enhanced option. The problem with using streaming replication for read scaling is that you need a full copy of the entire database cluster on each standby. For multi-terabyte clusters, that requires a lot of storage. Also, odds are you don't access the cluster's data uniformly — some data is current and frequently accessed, some is archival and rarely accessed.

An interesting solution to this problem is to combine four Postgres features:

Pgpool
Streaming replication
Partitioning
Foreign data wrappers

First, use Pgpool and add a partitioned table on the primary, where some of the partitions are local and some are foreign data wrapper references. The local partitions contain current data and are copied to the top-level streaming standbys. Archival data is placed on a secondary set of streaming-replication servers and accessed via foreign data wrappers from the top-level primary and its standbys. For read scaling, you might need many top-level streaming replicas with current data, but only a few streaming replicas to handle archival-data requests.

This solution uses four Postgres tools to provide flexible read scaling for large data sets. It is also transparent to applications so administrators can rearrange the data without modifying applications.

View or Post Comments

Materialized Views and Foreign Data Wrappers

Friday, September 1, 2017

You might know that Postgres supports materialized views and foreign data wrappers (fdw). Briefly, materialized views allow for queries to be materialized and refreshed on demand. Foreign data wrappers allow data to be pulled from foreign data sources, like Nosql stores and other Postgres servers.

What you might not have considered is that materialized views and foreign data wrappers can be used together. Materialized views speed data access by summarizing data so it isn't necessary to query the base tables that are referenced by materialized views. Foreign data wrappers, because they are remote, can be slow to access. By combining the two features, you can get fast access to remote data.

Let's see this in action! First, let's set up the foreign table:

CREATE DATABASE fdw_test;
\connect fdw_test;
CREATE TABLE world (greeting TEXT);
\connect test
 
CREATE EXTENSION postgres_fdw;
CREATE SERVER postgres_fdw_test FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (host 'localhost', dbname 'fdw_test');
 
CREATE USER MAPPING FOR PUBLIC SERVER postgres_fdw_test
OPTIONS (password '');
 
CREATE FOREIGN TABLE other_world (greeting TEXT)
SERVER postgres_fdw_test
OPTIONS (table_name 'world');
 
\det
          List of foreign tables
 Schema |    Table    |      Server
--------+-------------+-------------------
 public | other_world | postgres_fdw_test

populate it with some data:

INSERT INTO other_world
SELECT *
FROM generate_series(1, 100000);

and create a materialized view on the foreign table:

CREATE MATERIALIZED VIEW mat_view (first_letter, count) AS 
        SELECT left(greeting, 1), COUNT(*)
        FROM other_world
        GROUP BY left(greeting, 1);

Now we can compare select times for foreign tables and materialized views:

\timing
 
SELECT left(greeting, 1) AS first_letter, COUNT(*)
FROM other_world
GROUP BY left(greeting, 1);
 first_letter | count
--------------+-------
 1            | 11112
 2            | 11111
 3            | 11111
 4            | 11111
 5            | 11111
 6            | 11111
 7            | 11111
 8            | 11111
 9            | 11111
 
Time: 354.571 ms
 
SELECT * FROM mat_view;
 first_letter | count
--------------+-------
 1            | 11112
 2            | 11111
 3            | 11111
 4            | 11111
 5            | 11111
 6            | 11111
 7            | 11111
 8            | 11111
 9            | 11111
 
Time: 0.783 ms

The materialized view is much faster, but also a contrived example. It is interesting that refreshing the materialized view takes a similar time to selecting from the foreign table:

REFRESH MATERIALIZED VIEW mat_view;
Time: 364.889 ms

The above output is from Postgres 9.6. Thanks to this improvement in Postgres 10:

Push aggregates to foreign data wrapper servers, where possible (Jeevan Chalke, Ashutosh Bapat)

This reduces the amount of data that must be passed from the foreign data wrapper server, and offloads aggregate computation from the requesting server. The postgres_fdw fdw is able to perform this optimization. There are also improvements in pushing down joins involving extensions.

Postgres 10 gets faster foreign table aggregate selects than in 9.6, but still slower than using materialized views:

SELECT left(greeting, 1) AS first_letter, COUNT(*)
FROM other_world
GROUP BY left(greeting, 1);
 first_letter | count
--------------+-------
 1            | 11112
 2            | 11111
 3            | 11111
 4            | 11111
 5            | 11111
 6            | 11111
 7            | 11111
 8            | 11111
 9            | 11111
 
Time: 55.052 ms

You don't even need to use aggregates in materialized views — you can just copy the contents of a remote table into a materialized view for faster access, and refresh the materialized view occasionally (though logical replication in Postgres 10 does this better):

CREATE MATERIALIZED VIEW mat_view2  AS 
        SELECT *
        FROM other_world;

Now we can run a performance test on the foreign table and its local copy:

\o /dev/null
 
SELECT *
FROM other_world;
Time: 317.428 ms
 
SELECT * FROM mat_view2;
Time: 34.861 ms

In summary, materialized views and foreign data wrappers are two features that work well together. Materialized views allow remote data to be cached locally, either entire tables or aggregate summarizations. The cache can be refreshed using refresh materialized view. Also, Postgres 10 speeds up aggregate queries on foreign tables.

View or Post Comments

A New Pgindent

Wednesday, August 30, 2017

I blogged about pgindent in 2011. After much discussion, an improved BSD indent binary has been created by Piotr Stefaniak and Tom Lane to fix various problems that were impossible to fix using the old Perl post-processing method. The Postgres 10 tree has been reformatted with the new tool.

View or Post Comments

PG-C

Monday, August 28, 2017

Like many open-source projects, Postgres is written in the C programming language. However, with the code base being 31 years old, a lot of specialized C infrastructure has been added over the years to simplify server programming, e.g., memory management, caching, system catalog access, tuple access, error handing.

While this infrastructure helps experienced developers be productive, it often confuses people studying the Postgres source code because you have to understand the infrastructure to understand the code. Robert Haas makes this point clear in an email post:

A system like PostgreSQL is almost a language of its own; we don't really code for PostgreSQL in C, but in "PG-C". Learning the PG-specific idioms is arguably more work than learning C itself …

View or Post Comments

Tuning Autovacuum

Friday, August 25, 2017

Autovacuum simplifies Postgres administration by automating the cleanup of updated(expired) and deleted rows and rows created by aborted transactions. It also updates optimizer statistics. It uses information gathered by the statistics collector to launch vacuum and analyze operations at the appropriate time. Autovacuum can respond much more promptly than an administrator.

However, there are cases where autovacuum isn't optimal, and for such cases there are two options. First, autovacuum can be controlled at the global level by settings in postgresql.conf. A lesser-known option is controlling autovacuum parameters at the table level via create/alter table storage options. The later are obviously more fine-grained and allow cases were global options are too coarse. And, of course, these settings can all be adjusted to be more aggressive during idle periods.

View or Post Comments

Limiting Memory to Avoid the Oom

Wednesday, August 23, 2017

Linux uses an out-of-memory killer (oom), which is designed to kill processes:

If your memory is exhaustively used up by the processes to the extent which can possibly threaten the stability of the system, then the oom killer comes into picture. It is the task of the oom killer to kill the processes until enough memory is freed for the smooth functioning of the rest of the process.

While killing processes is never good, it is better than having the system halt due to memory exhaustion. Sometimes the oom kills Postgres, and that isn't a good thing either. This email thread explains how to use ulimit to cause Postgres sessions that consume a lot of memory to fail due to excessive memory requests. This avoids having them continue and be killed by the oom killer, which causes the entire database server to restart. The Postgres documentation also explains the behavior of the oom killer.

View or Post Comments

Compressed Tablespaces

Monday, August 21, 2017

It is common knowledge that tablespaces allow you to place frequently-accessed data on fast storage, and archive-level data on slower storage. This is even mentioned in our documentation.

One additional aspect I had never considered is that you can store archive-level data on slower and compressed file systems. Most file systems support some kind of compression. While it doesn't make sense for frequently-accessed data, it seems perfect for archive-level data.

View or Post Comments

Using Docker with Postgres

Wednesday, August 16, 2017

Docker uses Linux containers (lxc) to allow application deployment in a pre-configured environment. Multiple such isolated environments can run on top of the same operating system. Docker is ideal for rapid, reproducible deployments.

How that relates to database deployments is an open question. The full power of Docker is that everything is in the container and the container can be easily destroyed and recreated. Because databases require a persistent state, a hybrid approach is necessary. Typical Docker database deployments use Docker for database installation and setup, and persistent storage for persistent state. This email thread explains the benefits of Docker for databases:

Isolation between environments
Deployment of tested and verified Docker images
Allows developers to use the same Docker images as production

and explains how to add persistent state:

Run one container on one node
Use bind mounts
Use --net=host

Hopefully this helps users evaluate Docker for future Postgres deployments.

View or Post Comments

Session State Failover

Monday, August 14, 2017

On the server side, high availability means having the ability to quickly failover to standby hardware, hopefully with no data loss. Failover behavior on the client side is more nuanced. For example, when failover happens, what happens to connected clients? If no connection pooler is being used, clients connected to the failed machine will need to reconnect to the new server to continue their database work. Failover procedures should guarantee that all connections to the failed server are terminated and that no new connections happen. (Reconnecting to the failed server could produce incorrect results and lost changes.) If a client is connected to a standby that is promoted to primary, existing client connections and new connections are read/write.

Clients connect to the new primary via operating-system-specific methods, usually either virtual IP addresses (vip, good blog entry) or dns entries with a short time to live (ttl). This is normally accomplished using dedicated high-availability or clustering software. Postgres 10 will also allow multiple host names to be tried by clients.

For clients using a connection pooler, things are even more complicated. Logically, you would think that, since clients didn't connect directly to the failed server, they should be able to continue their queries in the same session uninterrupted. Generally, this is not the case.

First, the client might have been running a query during the failure — therefore, a query error code needs to be returned to the client so the client can retry the query (assuming query retry logic is built into the application). Second, the session might have been in the middle of a transaction block, meaning it has to resubmit all the queries that were part of the transaction (again, assuming the client has such retry logic). Third, several server configuration parameters might have been modified at the session level. The client might be depending on these settings for proper operation, so they would need to be re-modified in the new session. Fourth, there is additional session state that might need to be recreated. You can get an idea of possible session state modifications by reviewing the discard manual page.

If this sounds like a boat-load of challenges to allow clients using a connection pooler to continue uninterrupted, it is. None of the Postgres pooling tools allows uninterrupted failover. Pgpool automatically disconnects all clients and allows them to reconnect. (In pgpool 3.6+, clients connected to still-running standby servers are not affected.) PgBouncer does not detect failover.

To implement even a limited case where failover appears uninterrupted to clients, connection poolers would have to record session state changes, client queries, and even client results because many queries depend on the results of previous queries, e.g., select ... for update. Even if all this tracking was performed, changed query results would make uninterrupted behavior impossible. Given all this complexity, it isn't surprising that the standard approach to failover is for clients to be programmed to handle such cases by reconnecting, resetting their session state, and re-running any queries that future queries depend on.

View or Post Comments

Logical Replication

Friday, August 11, 2017

With the addition of logical replication in Postgres 10, we get a whole new set of replication capabilities. First, instead of having to replicate an entire cluster, you can replicate specific tables using streaming replication. With this granularity, you can broadcast a single table to multiple Postgres databases, or aggregate tables from multiple servers on a single server. This provides new data management opportunities.

Another big advantage of logical replication is migrating between major Postgres versions. If both major Postgres versions support logical replication, you can set up logical replication between them and then switch over to the new major-version Postgres server with only seconds of downtime. It also allows you to downgrade back to the old major version and replay any changes made on the upgraded server, assuming logical replication is still working properly.

Quicker upgrade switching and the ability to downgrade in case of problems have been frequent feature requests that pg_upgrade has been unable to fulfill. For users who need this, setting up logical replication for major version upgrades will certainly be worth it.

View or Post Comments

Wal Archive: Local or Remote?

Wednesday, August 9, 2017

For wal archiving, e.g., archive_command, you are going to need to store your wal files somewhere, and, depending on how often you take base backups, it might be very large.

Most sites that require high availability have both a local standby in the same data center as the primary, and a remote standby in case of data center failure. This brings up the question of where to store the wal archive files. If you store them in the local data center, you get fast recovery because the files are quickly accessible, but if the entire data center goes down, you can't access them from the remote standby, which is now probably the new primary. If you store your wal archive files remotely, it is difficult to quickly transfer the many files needed to perform point-in-time recovery.

My guess is that most sites assume that they are only going to be using wal archive files for local point-in-time recovery because if you are running on your backup data center, doing point-in-time recovery is probably not something you are likely to do soon — you probably only want the most current data, which is already on the standby. However, this is something to consider because with lost wal you will need to take a base backup soon so you can do point-in-time recovery in the future.

View or Post Comments

The Hot Standby Tradeoff

Monday, August 7, 2017

When the ability to run queries on standby servers (called hot_standby) was added to Postgres, the community was well aware that there were going to be tradeoffs between replaying wal cleanup records on the standby and canceling standby queries that relied on records that were about to be cleaned up. The community added max_standby_streaming_delay and max_standby_archive_delay to control this tradeoff. To completely eliminate this trade-off by delaying cleanup on the primary, hot_standby was added.

So, in summary, there is no cost-free way to have the primary and standby stay in sync. The cost will be either:

Standby queries canceled due to the replay of wal cleanup records
Stale standby data caused by the delay of wal replay due to cleanup records
Delay of cleanup on the primary

The default is a mix of numbers 1 and 2, i.e., to wait for 30 seconds before canceling queries that conflict with about-to-be-applied wal records.

Unfortunately, there is always a cost in running hot standby queries, but at least with Postgres, you can control it. Fortunately, you can also monitor replay delay using pg_stat_replication.

View or Post Comments

Sent, Write, Flush, Replay?

Monday, July 31, 2017

In streaming replication, wal is streamed from the primary to the standby. For a variety of reasons, including max_standby_streaming_delay, there can be a significant delay for a commit on the primary to appear in read-only queries running on the standby. The pg_stat_replication view shows the possible delays:

\d pg_stat_replication
          View "pg_catalog.pg_stat_replication"
      Column      |           Type           | Modifiers
------------------+--------------------------+-----------
 pid              | integer                  |
 usesysid         | oid                      |
 usename          | name                     |
 application_name | text                     |
 client_addr      | inet                     |
 client_hostname  | text                     |
 client_port      | integer                  |
 backend_start    | timestamp with time zone |
 backend_xmin     | xid                      |
 state            | text                     |
 sent_location    | pg_lsn                   |
 write_location   | pg_lsn                   |
 flush_location   | pg_lsn                   |
 replay_location  | pg_lsn                   |
 sync_priority    | integer                  |
 sync_state       | text                     |

These columns clearly show the typical order of wal from primary to replay:

wal is sent from the primary to the standby ("sent")
wal is written to the standby's file system ("write")
wal is flushed to the standby's durable storage ("flushed")
wal is replayed and is then visible to standby read-only queries ("replay")

Postgres 10 will add interval delay information to the view:

\d pg_stat_replication
                    View "pg_catalog.pg_stat_replication"
      Column      |           Type           | Collation | Nullable | Default
------------------+--------------------------+-----------+----------+---------
 ...
 write_lag        | interval                 |           |          |
 flush_lag        | interval                 |           |          |
 replay_lag       | interval                 |           |          |
 ...

View or Post Comments

Recent Interview

Tuesday, July 25, 2017

I was interviewed on July 14 during the PGConf Local: Philly 2017, and a 25-minute recording was just published.

View or Post Comments

The Decline of Hstore

Tuesday, July 11, 2017

Hstore was added to Postgres in 8.2 (2006). I didn't appreciate the purpose of hstore when it was first introduced. I knew it was a per-field key/value store, but not much else. People would come up to me at conferences and say "Hstore is the best," but I could only reply with "Yeah." Only later did someone explain to me the usefulness of being able to store unstructured data in a database.

Of course, with the popularity of NoSQL, everyone now understands the value of storing unstructured data. They probably also know that being able to store structured (relational) and unstructured data in the same database, like Postgres, is a big win.

Unfortunately, hstore, the Postgres extension that started it, hasn't kept up. When jsonb was added in Postgres 9.4 (2014) it superseded the capabilities of hstore. Jsonb is a key/value store, but allows for hierarchical storage too, and uses the popular json syntax. Some of the performance improvements made to the jsonb data type could not be backpatched to hstore, due to compatibility requirements.

At this point, unless you have a backward-compatible need for hstore, you should be using jsonb. Existing users of hstore would probably benefit from moving to jsonb, when convenient.

View or Post Comments

Postgres Party

Monday, July 10, 2017

In conjunction with PGConf Local: Philly, I will be hosting a Postgres Party at my home near Philadelphia this Friday, July 14. You don't have to attend the conference to attend the party.

View or Post Comments

Playing With IPv6

Wednesday, July 5, 2017

Now that everyone is using IPv6 () it might be time to start playing with it. Postgres has had full IPv6 support for years, so Postgres is a good place to start, particularly with IPv6-aware data types.

Since IPv6 addresses are 128-bits instead of IPv4's 32-bits, they can be quite long, e.g., 2001:0db8:85a3:0000:0000:8a2e:0000:7334. As you can see, it is made up of eight quad-hex segments, separated by colons. To shorten the text representation, leading zeros in any quad-hex segment can be removed, though an all-zero quad still requires a zero. In addition, the longest string of all-zero quads can be abbreviated with double colons. This can be illustrated in Postgres:

SELECT '2001:0db8:85a3:0000:0000:8a2e:0000:7334'::inet;
             inet
------------------------------
 2001:db8:85a3::8a2e:0:7334

In the output, 0db8 became db8, and :0000:0000: became ::, and the final 0000 became 0. An address with many leading zeros, e.g., localhost (0000:0000:0000:0000:0000:0000:0000:0001), gets dramatically shortened using these rules:

SELECT '0000:0000:0000:0000:0000:0000:0000:0001'::inet;
 inet
------
 ::1

This highlights perhaps the most confusing aspect of IPv6 addresses — you can't just visually compare two IPv6 addresses to check for equality, like you can for IPv4. You must use the IPv6 rules for comparisons.

Use of colons is strictly for IPv6 addresses, i.e., 1::127 represents a 128-bit IPv6 value, not a 32-bit IPv4 one, as illustrated by the IP family() function:

SELECT family('1::127');
 family
--------
      6
 
SELECT family('1.0.0.127');
 family
--------
      4
 
SELECT '1::127'::inet = '1.0.0.127'::inet;
 ?column?
----------
 f

Postgres is a fun, interactive way to play with the IPv6 address rules, because we are all going to have to learn them eventually.

View or Post Comments

Odd Month Arithmetic

Wednesday, June 21, 2017

You might be aware of the interval data type, which allows mathematical operations on date, time, and timestamp values. This can lead to odd behavior, but this email posting showed me a new oddity when dealing with months containing a different number of days. First, let's summarize how many days are in the first five months of 2017:

2017-01     31
2017-02     28
2017-03     31
2017-04     30
2017-05     31

Let's add four months to the last day of January 2017 in two different ways:

SELECT '2017-01-31'::date + '4 month'::interval;
      ?column?
---------------------
 2017-05-31 00:00:00
 
SELECT '2017-01-31'::date + ('1 month'::interval + '3 month'::interval);
      ?column?
---------------------
 2017-05-31 00:00:00

Looks good — the results are the same, and they make sense. But look what happens when the calculation take a stop-over in February (operations are performed left to right):

SELECT '2014-01-31'::date + '1 month'::interval + '3 month'::interval;
      ?column?
---------------------
 2014-05-28 00:00:00

and a stop-over in April:

SELECT '2014-01-31'::date + '3 month'::interval + '1 month'::interval;
      ?column?
---------------------
 2014-05-30 00:00:00

It seems that once the calculation lands on a lower-numbered day of a month, because the month doesn't have as many days as the previous stop, it never advances past that day number, even if later months have more days. Let's be more overt about this using parentheses:

SELECT ('2014-01-31'::date + '1 month'::interval) + '3 month'::interval;
      ?column?
---------------------
 2014-05-28 00:00:00
 
SELECT '2014-01-31'::date + ('1 month'::interval + '3 month'::interval);
      ?column?
---------------------
 2014-05-31 00:00:00
 
SELECT ('2014-01-31'::date + '3 month'::interval) + '1 month'::interval;
      ?column?
---------------------
 2014-05-30 00:00:00

All three queries produce different results. This shows that addition with such values is neither associative (queries one and two) nor commutative (queries one and three). Real number addition is both associative and commutative, so it is confusing that addition with intervals is not for some values, though it is for others. (Other interval operations have similar issues, e.g., subtraction.)

While the results are surprising, it is unclear how Postgres could do any better without adding some very complex logic — so, developers beware.

View or Post Comments

Use With Time Zone

Monday, June 19, 2017

If you often use the timestamp data type, you might not be making full use of it. In these queries:

CREATE TABLE tztest (x TIMESTAMP);
 
INSERT INTO tztest VALUES (CURRENT_TIMESTAMP);
 
SELECT * FROM tztest;
             x
----------------------------
 2016-10-25 18:49:20.220891
 
SHOW timezone;
  TimeZone
------------
 US/Eastern
 
SET timezone = 'Asia/Tokyo';
 
SELECT * FROM tztest;
             x
----------------------------
 2016-10-25 18:49:20.220891

Notice that the time has not changed even though the session time zone has changed. A timestamp date type specification defaults to timestamp without time zone:, to match the sql specification. Contrast the above output with using timestamp with time zone:

CREATE TABLE tztest2 (x TIMESTAMP WITH TIME ZONE);
 
INSERT INTO tztest2 VALUES (CURRENT_TIMESTAMP);
 
SELECT * FROM tztest2;
               x
-------------------------------
 2016-10-25 18:57:04.096335-04
 
SHOW timezone;
  TimeZone
------------
 US/Eastern
 
SET timezone = 'Asia/Tokyo';
 
SELECT * FROM tztest2;
               x
-------------------------------
 2016-10-26 07:57:04.096335+09

Notice that when the session time zone changed the date and hours were adjusted and the suffix time zone offset changed. This allows users from multiple time zones to insert into the same table and for viewers to see all rows in their local time zone.

Timestamp with time zone is implemented by adjusting all values to utc, meaning there is no storage of the original time zone name or its offset. If you need to record the original time zone offset, you can store the output of SELECT EXTRACT(timezone FROM CURRENT_TIMESTAMP) in a separate column.

View or Post Comments

PL/Java Adoption

Friday, June 16, 2017

PL/Java has been around since 2005, but it has regularly struggled to gain users. Unfortunately, a lot of these problems are specific to the Java language and hamper its adoption.

First, there are a limited number of people who know both Java and the Postgres backend code. Fortunately Chapman Flack has recently resumed PL/Java development.

Second, there is the deployment complexity of binding PL/Java to a Java Runtime Environment (jre) in an easily-deployed way. This 2014 email thread discusses the problems of jre packaging and installation.

Third, PL/Java functions must be compiled into jar files and installed on the server, similar to server-side C functions. While this isn't a huge hurdle, it does hamper PL/Java when competing against interpreted languages like PL/pgSQL and PL/Perl.

Fourth, every stored procedure language has to compete with PL/pgSQL, which has good alignment with the scripting needs of the sql language.

However, the problems go deeper. Because PL/Java isn't interpreted, it must be compiled, but compiling doesn't produce a binary that can be run on the cpu. Instead, a Java Virtual Machine (jvm) is required and it must be installed in a place that can easily receive function arguments and return results. An early attempt was pl/j, which uses a single jvm process — each backend connects to the jvm process via ipc and runs in its own thread. Unfortunately ipc overhead, context switching, and security concerns doomed this attempt.

PL/Java takes a different approach by placing a jvm in each database backend process and using the Java Native Interface (jni) to pass in function arguments and return results. While this reduces communication overhead, it causes backend processes to use much more memory because each backend has its own jvm, and each jvm must be started the first time a PL/Java function is called. (Connection pooling can reduce the startup overhead.) The bottom line is that stored procedures have short run-times, and the heavy nature of Java isn't a good match.

Ultimately, with all these headwinds against PL/Java, it is hard for it to get traction. A lot of people like the idea of PL/Java because they can reuse their Java skills and for portability with other databases that support Java stored procedures. The problem is that once they hit these headwinds, they start looking around. They realize that other languages are a better match for sql, that there isn't that much code reuse from the client-side, and they switch to another server-side language. I am sure there are some things that only Java can do well, but those use cases are not common in server-side functions. (Contrast that with pl/r, which definitely does things no other server-side language can do.)

In summary, the slow adoption of PL/Java isn't an accident, but the result of multiple challenges that hamper its adoption. If PL/Java is get more popular, these headwinds must be addressed.

View or Post Comments

When To Use Server-Side Logic

Wednesday, June 14, 2017

The use of server-side logic, particularly stored procedures, has been a highly contentious topic among database professionals for decades. The question has always been what amount of logic should be encoded in the database vs. in client applications or application servers.

Beyond using the database as a simple data store, there are three levels in which logic can be added to Postgres:

Database constraints, e.g., unique, check, foreign keys
Triggers
Manually-called stored procedures, i.e., functions not called as triggers

Let's look at each of these in turn:

Database constraints are often seen as optional by database application developers. They don't really do anything except prevent invalid data from being entered into the database. The overhead of performing these checks is often seen as negative, particularly foreign key checks. (Default clauses that call stored procedures are similar to triggers; the serial data type uses this.)
Triggers are stored procedures assigned to tables that are executed automatically during insert, update, delete, or truncate commands. They can be triggered before the command runs, usually for complex constraint checks or to modify incoming data, e.g., capitalization. When triggered after the command, they perform post-command operations, such as adding entries to a log table.
Manually-called stored procedures are functions called usually in where clauses or in the target list of select queries. Stored procedures, including ones used as triggers, can be written in many languages.

Now, on to the question of when to use these features. Some Postgres users use none of these features for reasons of performance and portability. Others use all of them, and require applications to call stored procedures to perform tasks, rather than issuing sql statements. This allows database administrators to control every aspect of database access. For most users, the best solution is something in the middle, and figuring out when to use what can be tricky.

For database constraints, it is often possible to perform constraint checks in the application, rather than in the database. While this is possible for most check constraints, it is hard to do for unique constraints because multi-user applications rarely share state except inside the database. If database constraints are not used, applications and manually-issued sql queries must perform the checks, and any mistakes must be detected later and corrected. If multiple applications access the database, they must perform the checks in the same way — this is particularly difficult if they are written in different languages. Application upgrades also require constraint synchronization in multi-application environments.

Triggers can be avoided by having applications perform the checks and queries that triggers would have performed. Performing data checks application-side have the same downsides as avoiding database constraints. Additional application queries necessary when triggers are avoided can lead to slower performance due to network round-trip delays.

Avoiding manually-called stored procedures requires all logic to be in the application. This can lead to serious slowdowns because if a function cannot be used in a where clause, the entire data set must be transferred to the client application where filtering can be done.

Ultimately, the decision of when to use server-side logic revolves around efficiency — efficiency of hardware utilization, efficiency of development, and efficiency of data management. Your use of server-side logic will be dependent on which area of efficiency is most important to you. This email thread outlines many of the tradeoffs seen by Postgres users.

Update: This email explains some of the operational reasons to avoid server-side logic. 2018-09-05

View or Post Comments

Which NoSQL Database For New Project?

Monday, June 12, 2017

Oh, how I love the title of this 2014 Slashdot request, "Which NoSQL Database For New Project?" The user already knows the type of clients (iPhones and Android phones) and the middleware (Php/Symfony or Ruby/Rails) and then comes the zinger, "I would like to start with a NoSQL solution for scaling …." In addition to the question of whether Ruby on Rails is a scalable solution, the replies are illuminating, and even funny, e.g., "I'm working on a new independent project. It will soon become the new Facebook, and I'll be billionaire next quarter. The only problem is that I don't know which luxury yacht to buy with all this money."

OK, now on to the serious replies, which are many, and seem to come from seasoned Sql and NoSql veterans and all seem to fall under the heading of premature optimization and scaling:

* NoSQL solutions can be ridiculously fast and scale beautifully over billions of rows. Under a billion rows, though, and they're just different from normal databases in various arguably-broken ways. By the time you need a NoSQL database, you'll be successful enough to have a well-organized team to manage the transition to a different backend. For a new project, use a rdbms, and enjoy the ample documentation and resources available.

* However, like a lot of other posters, I'm very sceptical that NoSQL is the place to start. SQL databases can do a LOT for you, are very robust and can scale very considerably. As your requirements grow you might find yourself wanting things like indexes, transactions, referential integrity, the ability to manually inspect and edit data using SQL and the ability to store and access more complex structures. You're likely to give yourself a lot of pain if you go straight for NoSQL, and even if you DO need to scale later combining existing SQL and new NoSQL data stores can be a useful way to go.

* So many developers start with the phrase "I need NoSQL so I can scale" and almost all of them are wrong. The chances are your project will never ever ever scale to the kind of size where the NoSQL design decision will win. Its far more likely that NoSQL design choice will cause far more problems (performance etc), than the theoretical scaling issues. … NoSQL does not guarantee scaling, in many cases it scales worse than an SQL based solution. Workout what your scaling problems will be for your proposed application and workout when they will become a problem and will you ever reach that scale. Being on a bandwagon can be fun, but you would be in a better place if you really think through any potential scaling issues. NoSQL might be the right choice but in many places I've seen it in use it was the wrong choice, and it was chosen base on one developers faith that NoSQL scales better rather than think through the scaling issues.

* Given 3 trillion users your options are pretty much limited to horizontal scaling, no SQL etc. but most people never get that far with their applications and in that case, storing the data in a noSQL database and then getting actionable information out of it (which is the hardest part IMO) is a lot of effort spent for something much cheaper and easier done with an rdbms.

* "Why not" is because the cost/benefit analysis is not in NoSQL's favor. NoSQL's downsides are a steeper learning curve (to do it right), fewer support tools, and a more specialized skill set. Its primary benefits don't apply to you. You don't need ridiculously fast writes, you don't need schema flexibility, and you don't need to run complex queries on previously-unknown keys.

* It's a mistake to think that "NoSQL" is a silver bullet for scalability. You can scale just fine using MySQL (FlockDB) or Postresgl if you know what you're doing. On the other, if you don't know what you're doing, NoSQL may create problems where you didn't have them.

* Databases don't scale for people who don't understand SQL, don't understand data normalization, indexing and want to use them as flat files. Unfortunately, a way too common anti-pattern :(

*So default answer to "Which NoSQL database should I use?" is always "Don't use NoSQL."

There were many positive comments about Postgres, both from a relational database perspective and touting its scalability and NoSQL-like features. One poster even wrote about their accounting system using Mongo that they are porting to Postgres.

View or Post Comments

You Don't Need Every Feature of Your Previous Database

Friday, June 9, 2017

In a February blog post I talked about the mismatch between what people expect from Postgres in terms of hints, and what exists. In this blog post I would like to cover the more general case of when people should expect feature parity with their previous database, and when such expectations are unreasonable.

First, imagine if every database had the features of every other database — that would be great for compatibility but terrible for usability. You would have so many ways of doing something, with slightly different behaviors, that development and administration would be much more difficult. Of course, Perl has its "There's more than one way to do it," but at least that allows tradeoffs between clarity and conciseness, and allows different programming styles. Having all features in every database would have few of those benefits. Also consider that some features are mutually exclusive, so this would be impossible. Therefore, we have to make tradeoffs in the features each database system supports.

Let's think of Postgres compatibility at three levels: sql, tuning, and monitoring. At the sql level, you should expect parity between Postgres and your previous database. The syntax might be different, but all the capabilities should be similar, e.g., if your application used save-points with the previous database, Postgres should support that, and in almost all cases, it does.

For tuning, parity is more murky. The internals of database systems differ greatly, so the tuning requirements will differ — and let's be frank — the less tuning you have to do to attain good performance, the better, so having a tuning item missing in Postgres might be good thing. Postgres might have good performance without requiring that tuning knob. Postgres might require tuning that wasn't required in your previous database system — that is a negative. So, if Postgres doesn't have a tuning knob you had in your previous database, that might be good (Postgres auto-tunes it), or it might be bad (you can't adjust Postgres to achieve good performance for your workload). (I previously covered the tradeoffs of adding performance settings.)

Monitoring has a similar murky outlook. Sometimes monitoring is required to meet organizational needs, e.g., what are the currently running queries. For most organizational monitoring needs, Postgres has the required features, whether it is log_line_prefix, pg_stat_statements, or external tools like pgBadger or PGAudit. Often, Postgres tooling isn't as polished or as easy to use as tools in more-established database systems, but the tools exist, and are often more flexible.

Just like tuning, sometimes monitoring that was required in your previous database system isn't required in Postgres. For example, Postgres's streaming replication is rock-solid, so there is no need to monitor for streaming replication corruption. Similarly, many organizations built elaborate monitoring of their previous database systems to check for optimizer plan and statistics changes. These are often to avoid optimizer bugs in their previous systems that caused problems. However, you can't assume this kind of monitoring is necessary for Postgres. Postgres isn't bug-free, but it might require different monitoring than what was required in the previous system.

In summary, Postgres is different from your previous database system. Hopefully it supports all the necessary features your applications require, but it will not exactly match your previous tuning and monitoring requirements. It is necessary to approach Postgres with an open mind to make the best use of your new database system.

Update: The value of software simplicity 2022-06-12

View or Post Comments

Raw Device Optimization

Wednesday, June 7, 2017

Historically, most relational database systems supported raw devices, i.e., the ability to write data directly to the storage subsystem to avoid file system overhead. Raw device support was added to databases 20–30 years ago when cpus were much slower, and file systems were much less efficient.

Modern recommendations discourage the use of raw devices unless every last bit of performance is required. Raw devices improve performance perhaps a few percentage points, but are difficult to administer and resize. The other problem is that raw devices have to be configured when the database cluster is initialized, meaning you can't easily remove raw devices or add them later. This makes it difficult to know if raw devices would even help your workload.

Postgres has never supported raw devices, and probably never will. Modern file systems are so good, and the flexibility of file system storage so powerful, that raw devices just don't make sense for modern Postgres.

Update: This email gives more details about the history of raw device database usage. 2021-03-29

View or Post Comments

Double Buffering Blues

Monday, June 5, 2017

Postgres does all heap and index page operations in a shared memory area called shared buffers. Data is read into shared buffers by requesting 8k blocks from the kernel, which will either satisfy reads from the kernel cache or retrieve them from storage devices. Writes from shared buffers are sent to the kernel, which are eventually written to permanent storage. The write-ahead log (wal) allows writes to be performed asynchronously; this is illustrated in this presentation.

New users are often surprised that Postgres uses the kernel cache for reads and writes, but there are advantages. While the shared buffer size is fixed at server start, the kernel cache is resized based on the amount of unused memory in the system. This blog post explains how to determine the size of the Linux kernel cache.

Many database systems use direct I/O to read and write data, which bypasses the kernel cache. This has two advantages:

Avoids the double-copy of data from storage to the kernel cache and then from the kernel cache to shared buffers
Avoids double buffering (storing) of data in the kernel cache and shared buffers

However direct I/O has some disadvantages:

Prevents the kernel from reordering reads and writes to optimize performance
Does not allow free memory to be used as kernel cache

In summary, direct I/O would improve performance in cases where shared buffers is sized properly, but it would dramatically decrease performance for workloads where shared buffers is missized. This email report confirms this analysis. Postgres does use direct I/O for wal writes (if supported by the operating system) because wal must be flushed immediately to storage and is read only during crash recovery, so kernel buffering is useless. (It is also read by the walsender and archive_command.)

Someday Postgres might support an option to use direct I/O for data files but the downsides make it unlikely it would be enabled by default.

View or Post Comments

Tablespaces on Transient Storage

Friday, June 2, 2017

Postgres has a flexible tablespace feature that allows you to place data in any filesystem accessible from the database server, whether it is directly attached storage (das), on a nas or san, a directory in /tmp, or on a ramdisk in memory.

You might be tempted to put data in any of these places, with the idea that if one of the storage areas goes away, either through hardware failure or server reboot, Postgres will continue to function. Unfortunately, that is not the case. If a tablespace disappears or is erased, Postgres will have problems because system tables will now point to nonexistent files. Regular maintenance operations, like autovacuum, will throw errors. The wal might also still contain references to the missing files, preventing crash recovery from completing.

Perhaps someday Postgres will have the concept of transient tablespaces, but right now it doesn't so only create tablespaces on durable storage. Even temp_tablespaces (used to store temporary objects) cannot be placed on transient storage.

View or Post Comments

Ssds and effective_io_concurrency

Friday, May 19, 2017

Effective_io_concurrency controls how many concurrent requests can be made by bitmap heap scans. The default is 1, which means no concurrency. For magnetic disks, this can be increased, perhaps to 8, but not much higher because of the physical limitations of magnetic disks. However, for ssds, this can be increased dramatically, perhaps in the hundreds.

This is documented in Postgres 9.6, as well as the ability to set effective_io_concurrency at the tablespace level. I have updated my Database Hardware Selection Guidelines presentation to mention this too.

View or Post Comments

Reorganized Presentations

Wednesday, May 17, 2017

I have reorganized my twenty active presentations to be easier to find. I was getting confused, so I assume others were as well. The new categories seem clearer and more closely match the categories I use for Postgres blog entries. I have reordered items within categories. I have also indicated more presentations that can be given in pairs.

Also, this web page now defaults to using https/tls.

View or Post Comments

Raid Controllers and Ssds

Monday, May 15, 2017

Disk array controllers gained popularity in a world of magnetic disks by providing

Hardware raid support
Large read/write cache
Write reordering
Instant durable storage of writes using a bbu

With the popularity of ssds, particularly those with bbus (which are the only kind that should be used with Postgres), the usefulness of raid controllers is now unclear:

Modern cpus are very fast and can easily do raid in software
Modern servers have tons of ram
Write reordering is unnecessary for ssds, which have good random write performance
Bbus on ssds make raid bbus redundant

And raid controllers introduce problems of their own:

There is no question that there was a time that raid controllers were necessary for good database performance, but that period might be over for direct-attached storage (das).

View or Post Comments

Postgres Window Magic

Friday, May 12, 2017

Having given the first presentation of my new window function talk, Postgres Window Magic, I have now put the slides online.

View or Post Comments

Postgres 10 Features Presentation

Wednesday, May 10, 2017

I did my first presentation covering Postgres 10 features, so my slides are now online.

View or Post Comments

Draft of Postgres 10 Release Notes

Wednesday, April 26, 2017

I have completed the draft version of the Postgres 10 release notes. Consisting of 180 items, I think you will be impressed with the substance of the improvements. The release notes will be continually updated until the final release, which is expected in September or October of this year. (I am speaking tomorrow about Postgres 10 in Boston.)

Fyi, future major versions of Postgres will consist of a single number, e.g., Postgres 11 will be released in 2018. This is mentioned in the release notes as well.

View or Post Comments

Das vs San vs Nas

Monday, April 24, 2017

Modern systems offer several storage options and databases are very sensitive to the I/O characteristics of these options. The simplest is direct-attached storage (das), e.g., sata, sas. It is the simplest because it uses a dedicated connection between the server and storage.

Another storage option is a storage area network (san), e.g., fibre channel, iscsi. Like das, it offers a sequential set of blocks upon which a local file system can be built.

A third options is network-attached storage (nas), e.g., nfs. While nas is networked storage like san, it offers a remote file system to the server instead of remote storage blocks. (The fact that the san/nas acronym letters are just reversed only adds to the confusion. I remember it as (block) "storage" being the first letter of san.)

So, which is best for you? Well, das is the simplest topology because all storage is connected to a single server, and it is also often the fastest. San's add flexibility by replacing direct connection with a network, e.g., fcp, tcp/ip. This allows for a larger number of devices to be attached than das, and if a server fails another server can access the device's block storage and restart.

Nas is quite different because it exports a file system to the server. While nas-exported file systems can be mounted by multiple servers simultaneously, that doesn't help Postgres because only a single server can safely access the Postgres data directory. Nas usually includes sophisticated management tooling, backup capabilities, and caching layers. Because multiple servers usually access a nas, contention can be a problem, though ease of administration can be a big benefit for large organizations.

View or Post Comments

External Poolers

Friday, April 21, 2017

Connection poolers have two advantages:

They reduce the number of active connections, allowing the server to operate more efficiently, without the overhead of managing many idle connections.
They speed startup by allowing previously started sessions to be used as soon as a connection request arrives.

Postgres has two popular external connection poolers, PgBouncer and pgpool. External poolers have several advantages compared to poolers built into the database:

They can be placed closer to applications, perhaps on application servers.
They allow the pooling load to be handled by a dedicated server.
If the database server goes down, poolers can be easily re-pointed to new database servers.

But external connection poolers have disadvantages over internal ones:

Authentication not integrated with the database server
Additional network delay going through a pooler

While there are some significant advantages of external pools, there are a few advantages of internal poolers, so someday Postgres might decide to implement a simple connection pooler internally.

View or Post Comments

Switchover vs. Failover

Wednesday, April 19, 2017

With streaming replication, Postgres allows sophisticated setups of primary and standby servers. There are two ways to promote a standby to be the new primary. A switchover is when the change happens in a planned way:

All clients are disconnected from the master to prevent writes
A sufficient delay allows the final write-ahead log (wal) records to be transferred to all standbys (also performed by step 3)
The primary is cleanly shut down
The standby is promoted to be the primary

A failover happens when the steps above can't be performed, usually because the primary has failed in some catastrophic way. The major difficulty with failover is the possibility that some of the final database changes contained in the wal are not transferred to standbys, unless synchronous_standby_names was used. When a standby is promoted to primary after a failover, the final missing wal records can cause problems:

Some transactions on the old primary that were acknowledged to clients might be lost
If the old master needs to be reconnected as a standby without reimaging, it might be necessary to use pg_rewind

Make sure you practice both methods of promoting a standby so, when you have to do the promotion in production, you are ready.

View or Post Comments

Incremental File System Backup

Monday, April 17, 2017

When using continuous archiving, you must restore a file system backup before replaying the wal. If the file system backup was taken long ago, wal replay might take a long time. One way to avoid this is to take file system backups frequently.

Another option is to perform an incremental file system backup that can be laid over the original file system backup, then replay wal over that. This reduces restore time because you only need to replay wal from the start of the incremental backup, not the start of the full backup. This also reduces the amount of wal that must be retained.

However, Postgres doesn't natively support incremental backup. The best you can do is to use a tool like pgBackRest or Barman that supports incremental backup at the file level. The only problem is that the database files are potentially one gigabyte in size, so the granularity of the incremental backup isn't great. Ideally solutions will be developed that do page-level (8k) incremental backups, which would be much smaller. The trick is finding an efficient way to record which 8k pages have been changed since the last file system backup.

View or Post Comments

What Are Checkpoints?

Friday, April 14, 2017

You probably have heard the term "checkpoint" before, or seen it mentioned in the postgresql.conf file. A checkpoints is a usually-invisible cleanup feature present in most database systems, but it is useful to know what it does.

This diagram illustrates checkpoints. At the top are three Postgres database sessions. Each session reads and writes to the shared buffer cache. Every modification to shared buffers also causes a change record to be written to the write-ahead log (wal, blog entry). Over time the wal would grow unbounded in size if it were not trimmed occasionally — that is what checkpoints do.

A checkpoint writes previously-dirtied shared buffers to durable storage over a period of several minutes, at which point the wal representing those writes is no longer needed for crash recovery. (Hopefully continuous archiving and streaming replication have also processed those wal files.) Therefore, the old wal can then be removed or recycled.

This diagram illustrates the process. In the diagram, 1 marks three dirty buffers in the shared buffer cache at the start of the checkpoint. During the checkpoint, additional buffers are dirtied, 2, and the wal pointer is advanced. At the end of the checkpoint all dirty 1 buffers have been written to durable storage and the old wal file are moved to the end, to be reused.

This all happens automatically, though it can be tuned and monitored. Fortunately, Postgres doesn't have any pathological behavior related to checkpoints, so most administrators never have to think about it.

View or Post Comments

Wal Archive Management

Wednesday, April 12, 2017

The write-ahead log (wal) file format changes with every major release. Also, initdb, which is required to install the new major version, starts wal file numbers at 000000010000000000000001.

For these reasons, if you are archiving wal, it is wise to use the major version number in the name of the wal archive directory, e.g., /archive/pgsql/9.6. This avoids the problem of wal from an old Postgres major version conflicting with wal files from a new major version.

View or Post Comments

Inside the Wal

Monday, April 10, 2017

The write-ahead log (wal) does many things:

It is great that a single facility is used for so much. Fortunately, it is possible to look inside of the wal. This is particularly useful for setting recovery_target_xid,

The binary portion of the wal can be viewed using pg_xlogdump. This is the output of transaction 558, which was an insert into a table with one index:

rmgr: Heap        len (rec/tot):  3/  3760, tx:  558, lsn: 0/01510D60,
      prev 0/01510D28, desc: INSERT off 102, —
 
rmgr: Btree       len (rec/tot):  2/  2133, tx:  558, lsn: 0/01511C10,
      prev 0/01510D60, desc: INSERT_LEAF off 1, —
 
rmgr: Transaction len (rec/tot): 20/    46, tx:  558, lsn: 0/01512480, 
      prev 0/01511C10, desc: COMMIT 2016-12-26 —

You can see the insert into the heap, the index insert, and the transaction commit record. When viewing pg_xlogdump output, remember that while transactions are assigned in start order, higher-numbered shorter transactions can commit before lower-numbered longer transactions (see this slide). For example, transaction 32 commits before transaction 30 because transaction 30 runs much longer than 32. (This can make choosing recovery_target_xid particularly tricky.)

To generate more detailed wal activity, set wal_level to logical and use test_decoding to view logical wal information:

SELECT *
FROM pg_logical_slot_get_changes(
         'test_slot', NULL, NULL, 'include-xids', '0');
 location  | xid |                   data
-----------+-----+-------------------------------------------
 0/1510D60 | 558 | BEGIN
 0/1510D60 | 558 | table public.test: INSERT: x[integer]:100
 0/15124B0 | 558 | COMMIT

This information will be used for logical replication in Postgres 10.

View or Post Comments

Postgres Keywords

Friday, April 7, 2017

You might be aware that the sql standard reserves certain identifiers that cannot be used for naming user objects. Postgres follows that standard, with slight modifications. For example, you cannot create a table called all:

CREATE TABLE all (x INTEGER);
ERROR:  syntax error at or near "all"
LINE 1: CREATE TABLE all (x INTEGER);
                     ^

It is actually very easy to find what identifiers are reserved because they are listed in the documentation. More interestingly, they are also accessible via the sql function pg_get_keywords():

SELECT *
FROM pg_get_keywords()
ORDER BY 1;
       word        | catcode |                   catdesc
-------------------+---------+----------------------------------------------
 abort             | U       | unreserved
 absolute          | U       | unreserved
 access            | U       | unreserved
 action            | U       | unreserved
…

You can also do statistical analysis on it:

SELECT catcode, COUNT(*)
FROM pg_get_keywords()
GROUP BY catcode
ORDER BY 2 DESC;
 
 catcode | count
---------+-------
 U       |   276
 R       |    77
 C       |    48
 T       |    23

You can even join pg_get_keywords() to system tables to identify the use of system keywords by user objects, e.g., a check for column names:

SELECT nspname, relname, attname
FROM pg_namespace JOIN pg_class ON (pg_namespace.oid = relnamespace)
     JOIN pg_attribute ON (attrelid = pg_class.oid)
     JOIN pg_get_keywords() ON (word = attname)
WHERE nspname NOT IN ('pg_catalog', 'information_schema')

Read the documentation to understand the types of system keywords, i.e., using them is not always a problem.

View or Post Comments

Casting to Data Types

Wednesday, April 5, 2017

There are four ways to cast a value to another data type:

SELECT int4 '123' + 1;
 ?column?
----------
      124
 
SELECT CAST('123' AS int4);
 int4
------
  123
 
SELECT '123'::int4 + 1;
 ?column?
----------
      124
 
SELECT int4('123') + 1;
 ?column?
----------
      124

The first two are sql standard, the third one is a Postgres-ism that is often convenient, and the final one relies on the existence of named functions to do the conversion. Some of the more complex data type specifications have shortcuts, e.g., timestamp with time zone can use ::timestamptz.

View or Post Comments

Allowing Only One Null

Monday, April 3, 2017

While the sql standard allows multiple nulls in a unique column, and that is how Postgres behaves, some database systems (e.g., ms sql) allow only a single null in such cases. Users migrating from other database systems sometimes want to emulate this behavior in Postgres. Fortunately, this can be done. First, let me show the default Postgres behavior:

CREATE TABLE nulltest (x INTEGER UNIQUE);
 
INSERT INTO nulltest VALUES (NULL);
INSERT INTO nulltest VALUES (NULL);

A column can be constrained to a single-null value by creating a partial expression index that indexes only null values (the partial part), and uses is null to store true in the unique index (the expression part):

DELETE FROM nulltest;
 
CREATE UNIQUE INDEX i_nulltest ON nulltest ((x IS NULL)) WHERE x IS NULL;
 
INSERT INTO nulltest VALUES (NULL);
INSERT INTO nulltest VALUES (NULL);
ERROR:  duplicate key value violates unique constraint "i_nulltest"
DETAIL:  Key ((x IS NULL))=(t) already exists.

This method can also be used to constrain a column to allow only a single null for each non-null composite indexed value:

CREATE TABLE nulltest2 (x INTEGER, y INTEGER);
 
CREATE UNIQUE INDEX i_nulltest2 ON nulltest2 (x, (y IS NULL)) WHERE y IS NULL;
 
INSERT INTO nulltest2 VALUES (1, NULL);
INSERT INTO nulltest2 VALUES (2, NULL);
INSERT INTO nulltest2 VALUES (2, NULL);
ERROR:  duplicate key value violates unique constraint "i_nulltest2"
DETAIL:  Key (x, (y IS NULL))=(2, t) already exists.
 
INSERT INTO nulltest2 VALUES (2, 3);
INSERT INTO nulltest2 VALUES (2, 3);

The i_nulltest2 index allows only one y null value for each x value. This can actually be useful in certain data models. This illustrates how expression and partial index features can be combined for some interesting effects.

View or Post Comments

Use All Your Constraints

Monday, March 27, 2017

Many database administrators use databases as simple data stores. However, relational systems can do much more, with advanced querying, analysis, and transaction control capabilities. Another area that is often overlooked is constraints. Constraints allow new and updated data to be checked against defined constraints and prevent changes if the constraints would be violated.

Constraints are odd in that they don't do anything if the data is consistent — it is more like an insurance policy against invalid data being entered into the database. If constraints are missing, there often are no initial problems, but over time erroneous or unexpected data gets in, causing problems with applications and reporting.

Do yourself a favor the next time you create a table — take the insurance and create useful check, not null, default, unique, primary key, and foreign key constraints. If your tables are already created, you can use alter table to add constraints to existing tables.

View or Post Comments

Referencing Select Column Aliases in Where Clauses

Friday, March 24, 2017

Ever wonder why using select column aliases in where clauses fails?

SELECT random() AS confusion
FROM generate_series(1,10)
WHERE confusion > 0.5;
 
ERROR:  column "confusion" does not exist
LINE 3: WHERE confusion > 0.5;

It is because the order in which select clauses are evaluated is specified by the sql standard, and it isn't top to bottom. Tom Lane's excellent email post goes into the details. The thread also mentions the unfortunate effect that users of union often want the behavior of union all.

View or Post Comments

Characters Sets, Encodings, and Collations

Wednesday, March 22, 2017

Understanding how characters sets, encodings, and collations work together can be confusing. I would like to explain them in this blog post:

Characters Sets

Postgres databases can be initialized to support a variety of character sets. A character set is a full set of the characters or glyphs that can be represented. Popular characters sets are ascii (127 characters), Latin1 (ISO8859-1, 255 characters), and Unicode (128k+ characters).

Encodings

Encoding is the way a character set is stored. For single-byte character sets like ascii and Latin1, there is only a single way to encode the character set, i.e., as a single byte. For more complex multi-byte character sets there can be several ways to encode the character set. For example, Unicode can be encoded as UTF-8 (8-bit granularity), UTF-16 (16-bit), or UTF-32 (32-bit). (Postgres only supports UTF-8 for Unicode, server-side.) Asian languages also often support multiple encodings for a single character set, e.g., Big5, GB2312, Shift-JIS, EUC-JP.

Collations

Collations specify the order of characters in a character set. Again, for single-byte character sets, there is usually only one possible collation, which uses encoded byte values to provide the ordering, though this can lead to odd orderings, e.g., Z (uppercase) ordered before a (lowercase), or z ordered before á. For a complex character set like Unicode, the user can often select the desired collation.

It is possible to illustrate different collations of the same character set using Unicode:

SHOW lc_collate;
 lc_collate
-------------
 en_US.UTF-8
 
SELECT * FROM (VALUES ('a'), ('_b'), ('_C')) AS f(x) ORDER BY x;
 x
----
 a
 _b
 _C
 
-- the same, with the collation explicitly stated
SELECT * FROM (VALUES ('a'), ('_b'), ('_C')) AS f(x) ORDER BY x COLLATE "en_US.utf8";
 x
----
 a
 _b
 _C
 
-- with a collation based on byte values, matching ASCII ordering
SELECT * FROM (VALUES ('a'), ('_b'), ('_C')) AS f(x) ORDER BY x COLLATE "C";
 x
----
 _C
 _b
 a

You can see in the first two queries that collation en_US ignores case in letter comparisons, and ignores underscores. The last query, using C collation, reverses the order because the byte values for uppercase letters (C) are less than lowercase letters (b), and the underscore character is lower than lowercase letters (a).

This can be seen at the operating system level too:

$ echo $LANG
en_US.UTF-8
 
$ (echo 'a'; echo '_b'; echo '_C') | sort
a
_b
_C
 
$ (echo 'a'; echo '_b'; echo '_C') | LANG="en_US.UTF-8" sort
a
_b
_C
 
$ (echo 'a'; echo '_b'; echo '_C') | LANG="C" sort
_C
_b
a

The handling of whitespace can also be controlled by sql-level collations:

SELECT * FROM (VALUES (' '), ('a'), (' b')) AS f(x) ORDER BY x COLLATE "en_US.utf8";
 x
----
 
 a
  b
 
SELECT * FROM (VALUES (' '), ('a'), (' b')) AS f(x) ORDER BY x COLLATE "C";
 x
----
 
  b
 a

In the first query, while a single space is ordered before a, it does not cause space-b to order before a; it does in the C collation. Things get more complicated with letters with accents, tildes, and pictographic languages.

Postgres relies on the operating system locale to support characters sets, encodings, and collations. On Linux, you can see the list of supported locales via locale -a. While initdb sets the default locale for the cluster based on evironment variables, the locale can be overridden in a variety of ways, including by CREATE TABLE.

View or Post Comments

Outer Joins and Where Clauses

Monday, March 20, 2017

Postgres supports both traditional join syntax, which uses the where clause to specify joined columns, and ansi join syntax, that uses the word join in the from clause. While both syntaxes can be used for inner joins, only the ansi join syntax supports outer joins in Postgres.

Because column restrictions like col = 4 are not related to joins, you would think that restrictions have the same effect whether they appear in the where clause or as part of a join clause, e.g., a outer join b on a.x = b.x and col = 4. However, this is not always true. Restrictions in the join clause are processed during joins, while where clause restrictions are processed after joins.

This is only significant in outer joins (and cross joins) because columns from unjoined rows are manufactured by outer joins. Here is an example:

CREATE TABLE test1 (x INTEGER);
CREATE TABLE test2 (x INTEGER);
 
INSERT INTO test1 VALUES (1), (2), (3);
INSERT INTO test2 VALUES (1), (2);
 
SELECT * FROM test1 LEFT JOIN test2 ON test1.x = test2.x;
 x |   x
---+--------
 1 |      1
 2 |      2
 3 |  (null)
 
SELECT * FROM test1 LEFT JOIN test2 ON test1.x = test2.x AND test2.x <= 2;
 x |   x
---+--------
 1 |      1
 2 |      2
 3 |  (null)
 
SELECT * FROM test1 LEFT JOIN test2 ON test1.x = test2.x WHERE test2.x <= 2;
 x | x
---+---
 1 | 1
 2 | 2

As you can see, the first select performs the outer join with no column restrictions. The second select returns the same result because the column restriction matches all rows in test2 (before the outer join happens). The final select operates on the result of the join and since null <= 2 returns null, which behaves like false, the third output row is suppressed.

What is even more interesting is seeing how a column restriction can cause a query that would normally be fully joined to return unjoined column values:

INSERT INTO test2 VALUES (3);
 
SELECT * FROM test1 LEFT JOIN test2 ON test1.x = test2.x;
 x | x
---+---
 1 | 1
 2 | 2
 3 | 3
 
SELECT * FROM test1 LEFT JOIN test2 ON test1.x = test2.x AND test2.x <= 2;
 x |   x
---+--------
 1 |      1
 2 |      2
 3 | (null)
 
SELECT * FROM test1 LEFT JOIN test2 ON test1.x = test2.x WHERE test2.x <= 2;
 x | x
---+---
 1 | 1
 2 | 2

In the second select, the test2.x value of 3 is excluded from the join, resulting in an outer-join value of null for the column.

In summary, be careful of the placement of restriction clauses on inner-side tables in outer-join queries and clearly determine whether you want restrictions to happen at join time or post-join.

View or Post Comments

'You need to set a tone where everyone in the community feels valued'

Friday, March 17, 2017

As open source gains popularity in the enterprise, there is increased study of open source communities and how they function. Those studying such things often ask about Postgres because of its unusually healthy community and recent successes.

While I was in India in February, I was interviewed by an open source magazine; an excerpt from that interview is now online. It covers open source leadership, encouraging new developers, and healthy software ecosystems.

View or Post Comments

Column Storage Internals

Wednesday, March 15, 2017

Postgres uses native cpu alignment to store values in a row. This allows blocks to be copied unchanged from disk into shared buffers and accessed as local variables, as outlined in this presentation.

This can be illustrated by pg_column_size(). First, an empty row size:

SELECT pg_column_size(row());
 pg_column_size
----------------
             24

Predictably, 2-byte, 4-byte, and 8-byte integers increase the length as expected:

SELECT pg_column_size(row(1::int2));
 pg_column_size
----------------
             26
 
SELECT pg_column_size(row(1::int4));
 pg_column_size
----------------
             28
 
SELECT pg_column_size(row(1::int8));
 pg_column_size
----------------
             32

You can also see alignment effects — the first query has no alignment padding, while the second query has a two-byte padding because the row length is unchanged:

SELECT pg_column_size(row(1::int2, 1::int2, 1::int4));
 pg_column_size
----------------
             32
 
SELECT pg_column_size(row(1::int2, 1::int4));
 pg_column_size
----------------
             32

You can see the alignment requirements for each data type by querying the system catalogs:

SELECT typalign, typname FROM pg_type ORDER BY 1, 2;
 typalign | typname
----------+---------
 c        | bool
 c        | char
 c        | cstring
 c        | name
 c        | unknown
 c        | uuid
 d        | _box
 d        | _circle
 d        | _float8
…

The Postgres source code file src/include/catalog/pg_type.h documents the meaning of the letters:

'c' = CHAR alignment, ie no alignment needed.
's' = SHORT alignment (2 bytes on most machines).
'i' = INT alignment (4 bytes on most machines).
'd' = DOUBLE alignment (8 bytes on many machines, but by no means all).

It is possible to define table columns in an order that minimizes padding. Someday Postgres might do this automatically.

The 24-byte row header includes an 8-bit mask to record null values. You can see below that the 8-bit mask is sufficient for eight nulls, but the ninth null requires the null bit mask to be expanded, with additional alignment:

SELECT pg_column_size(row(NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL));
 pg_column_size
----------------
             24
 
SELECT pg_column_size(row(NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL));
 pg_column_size
----------------
             32

These things all happen automatically, but sometimes it is in interesting to see it working.

View or Post Comments

Getting the Most Out of application_name

Monday, March 13, 2017

Application_name might be one of those Postgres settings that you have seen in various places but never understood its purpose. It is true that setting application_name doesn't change the behavior of Postgres (with one small exception), but it is very useful for monitoring. Its value appears in pg_stat_activity and can be prepended to every log line with log_line_prefix.

Applications can set application_name when they connect as part of the connection string. Psql, pg_dump, and other tools set application_name. (They actually set fallback_application_name so application_name can be overridden by users.)

A more interesting use of application_name is to change it while the application is running. pgAdmin updates the application_name when a user changes screens so administrators can know exactly what screen users are on. While you can't change application_name while a query is running, you can change it between queries, which allows you to monitor the progress of long-running batch jobs, e.g.:

SELECT application_name
FROM pg_stat_activity
WHERE pid = pg_backend_pid();
 application_name
------------------
 psql
 
SET application_name = 'demo';
 
SELECT application_name
FROM pg_stat_activity
WHERE pid = pg_backend_pid();
 application_name
------------------
 demo

Some people might complain about the overhead of a separate query to update application_name. There are two solutions. One solution is to send two queries in a single string, e.g., select application_name = 'demo2'; select 100. Some libraries like libpq support this, but psql sends these as two separate queries — you have to enable log_statement to see how queries are sent to the server.

Another approach is to bundle the setting of application_name inside the query:

SELECT 100 FROM set_config('application_name', 'demo3', false);
 
SELECT application_name
FROM pg_stat_activity
WHERE pid = pg_backend_pid();
 application_name
------------------
 demo3

Unfortunately, there is no way to guarantee that set_config() will be run first, e.g., in the first query, set_config() is run first, and in the second query, run second, and the optimizer is allowed to run from-clause expressions in any order:

SELECT 100 FROM set_config('application_name', 'demo4', false), pg_sleep(5);
 
SELECT 100 FROM pg_sleep(5), set_config('application_name', 'demo5', false);

As you can see, application_name gives you a way to monitor what Postgres is doing, not only at connection time but during the life of the connection.

View or Post Comments

Dynamic Logging Control

Friday, March 10, 2017

Postgres has many logging options, e.g., log_statement, log_min_duration_statement, and log_line_prefix. These values can be set at various levels, which I have already covered. One level that has particular flexibility is postgresql.conf. Settings in this file affect all running sessions, unless the values are overridden at lower levels.

When debugging, it is often hard to know which log settings to enable. One approach is to enable all settings beforehand, but that can generate lots of log traffic and be hard to interpret. Ideally you can enable just the settings you need at the time you need them — this is where postgresql.conf comes in. With postgresql.conf or alter system, you can change any logging settings you want and then signal a reload by either sending a sighup signal to the server, running "pg_ctl reload", or executing "SELECT pg_reload_conf()". One nice thing is that after a reload, all running sessions receive the new settings when their transactions complete. This allows logging settings to be enabled and disabled as needed.

View or Post Comments

Postgres Alerts

Wednesday, March 8, 2017

It is cool to be able to stare at colorful graphs to see what Postgres is doing, but sometimes you just want to setup something, walk away, and be informed when there is problem. That is what check_postgres and tail_n_mail are designed to do.

check_postgres is a script designed to be run from cron or a monitoring tool like Nagios. It reports on areas in the database that need administrator attention. tail_n_mail has a similar purpose, but monitors the Postgres log files for important messages.

Few people stare at their mobile phones waiting for something to happen — they configure their phones to notify them when something important happens. Why not do that for your Postgres clusters? That's what check_postgres and tail_n_mail are meant to do.

View or Post Comments

Wait Event Reporting

Tuesday, February 28, 2017

Postgres often lacks the sophisticated reporting of more established enterprise relational database systems. Sometimes that is because Postgres doesn't require as much monitoring and tuning, but there are legitimate monitoring needs which Postgres doesn't support.

Thanks to the wait_event_type and wait_event columns added to the pg_stat_activity view in Postgres 9.6, it is possible to find which parts of the system are causing query delays. In Postgres 9.6 wait_event_type can have four values:

LWLockNamed
LWLockTranche
Lock
BufferPin

and this table lists all the possible wait_event column values, grouped by wait_event_type. Below is typical pg_stat_activity output during pgbench:

SELECT pid, wait_event_type, wait_event
FROM pg_stat_activity;
  pid  | wait_event_type |  wait_event   
-------+-----------------+---------------
 14154 | LWLockNamed     | WALWriteLock  
  8923 |                 |               
 14155 | LWLockNamed     | WALWriteLock  
 14158 | LWLockNamed     | WALWriteLock  
 14160 | LWLockNamed     | WALWriteLock  
 14161 | Lock            | transactionid 
 14163 | LWLockNamed     | WALWriteLock  
 14165 | LWLockNamed     | WALWriteLock  
 14166 | LWLockNamed     | WALWriteLock  
 14167 | Lock            | tuple         
 …

Postgres 10 will have even more monitoring capabilities, with at least four new wait event types. One nice thing about this feature is that it has almost no overhead so is enabled by default. Unfortunately, there is no efficient way to measure event duration except by periodic sampling of event states.

This excellent email post from Robert Haas statistically analyzes wait events to show how different queries have different wait behaviors, e.g., heavy wal writes, unlogged tables, all data in shared buffers. The results match what you would expect from these kinds of workloads, but it is much more detailed than you would get from guessing.

These new pg_stat_activity columns give us a new window into performance. Before this, we had to guess what the bottleneck was, and we were very good at it. This new instrumentation gives us very accurate statistics on where we are losing performance. I expect this feature to yield major performance improvements in the years to come.

View or Post Comments

Creating Descending Indexes

Monday, February 27, 2017

You might have noticed that create index allows you to specify asc or desc for each column being indexed, and you might have wondered why. Asc is the default, so there is no need to use it. Desc actually does have a use, but it isn't obvious.

Postgres can access indexes forward or backward, so there is no need to specify ascending/descending if all columns are ascending or descending. The use case for specifying the index order is for multi-column indexes with mixed ordering, assuming queries also use the same mixed ordering. For example, let's create a two-column table with 1000 rows, and an index in all-ascending order:

CREATE TABLE test(x INTEGER, y INTEGER);
INSERT INTO test
        SELECT x, x FROM generate_series(1, 1000) AS f(x);
 
CREATE INDEX i_test ON test (x, y);

Postgres has no problem using the index if the query is all ascending or all descending, but it can't be used with mixed-column ordered queries:

EXPLAIN (COSTS off) SELECT * FROM test ORDER BY x, y;
              QUERY PLAN
--------------------------------------
 Index Only Scan using i_test on test
 
EXPLAIN (COSTS off) SELECT * FROM test ORDER BY x DESC, y DESC;
                  QUERY PLAN
-----------------------------------------------
 Index Only Scan Backward using i_test on test
 
EXPLAIN (COSTS off) SELECT * FROM test ORDER BY x, y DESC;
       QUERY PLAN
------------------------
 Sort
   Sort Key: x, y DESC
   ->  Seq Scan on test

Creating an index that matches the mixed ordering of the query allows the index to be used:

CREATE INDEX i_test2 ON test (x, y DESC);
 
EXPLAIN (COSTS off) SELECT * FROM test ORDER BY x, y DESC;
              QUERY PLAN
---------------------------------------
 Index Only Scan using i_test2 on test

The mixed-ordered index can even be used for backward scans:

EXPLAIN (COSTS off) SELECT * FROM test ORDER BY x DESC, y;
                   QUERY PLAN
------------------------------------------------
 Index Only Scan Backward using i_test2 on test

I am not sure how many people use mixed-ordered queries that would benefit from indexes, but I wanted to show how this feature can be useful.

View or Post Comments

Index Order Does Matter

Friday, February 24, 2017

Postgres has supported multi-column indexes since 1997, e.g., create index i_test on test (a, b, c). It can easily use an index if the supplied columns are all at the front of the index, e.g., a and b in the previous index, but it can also use the index if some of the indexed column values are not supplied, e.g., columns a and c in the previous index. It does this by looking up a in the index, then looking through the index for matches of c, ignoring values of b, e.g.:

CREATE TABLE test (a INT, b INT, c INT);
INSERT INTO test
        SELECT x, x, x FROM generate_series(1, 100000) AS f(x);
CREATE INDEX i_test ON test(a, b, c);
ANALYZE test;
 
EXPLAIN (COSTS false)
SELECT * FROM test WHERE a = 1 AND c = 1;
              QUERY PLAN
--------------------------------------
 Index Only Scan using i_test on test
   Index Cond: ((a = 1) AND (c = 1))

Of course, if possible, you should always put the most commonly-supplied columns first in the index because skipping columns during index scans (called "index skip scans") is expensive.

However, if you are supplying all the column values referenced in the index, I assumed it didn't matter what order the columns were specified in the index, but Robert Haas recently mentioned this is not always true. For example, if a has many duplicate values, and b has mostly unique values, having a at the start of the index is suboptimal — better to use b first, which will more effectively narrow the search space when looking for matches on a.

A more specific example of this is range queries. In that case, if the range test of one column is less restrictive than the equality test of another, it would be better for the equality test column to be first in an index. For example, in the queries below, the first query uses an index because it is very restrictive on the first column, the second query uses an index because it is moderately restrictive on the first column, while the third does not use an index because it is effectively unrestrictive on the first column:

EXPLAIN SELECT * FROM test WHERE a = 1 AND b >= 1 AND b <= 100000;
                               QUERY PLAN
-------------------------------------------------------------------------
 Index Only Scan using i_test on test  (cost=0.42..8.44 rows=1 width=12)
   Index Cond: ((a = 1) AND (b >= 1) AND (b <= 100000))
 
EXPLAIN SELECT * FROM test WHERE a >= 1 AND a <= 50000 AND b = 1;
                                 QUERY PLAN
----------------------------------------------------------------------------
 Index Only Scan using i_test on test  (cost=0.42..1404.10 rows=1 width=12)
   Index Cond: ((a >= 1) AND (a <= 50000) AND (b = 1))
 
EXPLAIN SELECT * FROM test WHERE a >= 1 AND a <= 100000 AND b = 1;
                       QUERY PLAN
--------------------------------------------------------
 Seq Scan on test  (cost=0.00..2291.00 rows=1 width=12)
   Filter: ((a >= 1) AND (a <= 100000) AND (b = 1))

Notice the increasing costs, even though all queries match one indexed row.

Obviously, in cases where you are not specifying all indexed columns in every query, you should put the most frequently referenced columns first in the index to avoid the overhead of index skip scans. However, for cases where most indexed columns are going to be supplied in queries, placing the most restrictive columns first in indexes is a win.

View or Post Comments

Expression Index Statistics and Joins

Wednesday, February 22, 2017

In my previous blog post, I showed how statistics generated on expression indexes can be used to produce more accurate row counts, and potentially better plans. While I did show more accurate row counts via explain, I did not show changed query plans. I plan to do so in this blog post. First, the setup:

CREATE TABLE test1 AS
        SELECT * FROM generate_series(1, 100000) AS f(x);
CREATE TABLE test2 AS
        SELECT * FROM generate_series(1, 2) AS f(x);
 
ANALYZE test1;
ANALYZE test2;

then a join query using modulus with no expression index:

EXPLAIN SELECT test1.x
FROM test1, test2
WHERE test1.x % 2 = 1 AND test1.x = test2.x;
                           QUERY PLAN
-----------------------------------------------------------------
 Nested Loop  (cost=0.00..1959.02 rows=1 width=4)
   Join Filter: (test1.x = test2.x)
   ->  Seq Scan on test1  (cost=0.00..1943.00 rows=500 width=4)
         Filter: ((x % 2) = 1)
   ->  Materialize  (cost=0.00..1.03 rows=2 width=4)
         ->  Seq Scan on test2  (cost=0.00..1.02 rows=2 width=4)

A nested loop join is used, which is suboptimal because the row count for test1 is one hundred times too small. With proper statistics on the modulus operation on test1.x, a more efficient hash join is used:

CREATE INDEX i_test1 ON test1((x % 2));
ANALYZE test1;
ANALYZE test2;
 
EXPLAIN SELECT test1.x
FROM test1, test2
WHERE test1.x % 2 = 1 AND test1.x = test2.x;
                            QUERY PLAN
------------------------------------------------------------------
 Hash Join  (cost=1.04..2132.29 rows=1 width=4)
   Hash Cond: (test1.x = test2.x)
   ->  Seq Scan on test1  (cost=0.00..1943.00 rows=50197 width=4)
         Filter: ((x % 2) = 1)
   ->  Hash  (cost=1.02..1.02 rows=2 width=4)
         ->  Seq Scan on test2  (cost=0.00..1.02 rows=2 width=4)

Notice the test1 row count is now much more accurate, and that analyzing the base table also analyzes the expression index. The total cost is now slightly higher (2132.29 vs. 1959.02), but that is not because the hash join is more expensive. Rather, it is because the nested loop misestimated how many rows it would need to process because it didn't know the selectivity of the modulus operation.

One thing I learned in researching this blog post is how much the optimizer "loves" hash joins. If test2 has three or more rows, or if test1 has ten times more rows and parallelism is enabled, a hash join is used even without expression index statistics. Hash joins are very robust despite misestimation so they are favored by the optimizer. The takeaway is that the creation of expression indexes for statistical purposes is recommended only if testing shows they actually improve query plans, i.e., improving explain row counts alone has little benefit.

View or Post Comments

Statistics on Expression Indexes

Monday, February 20, 2017

Most people know that Postgres allows the creation of indexes on expressions. This is helpful if you need index lookups of expressions used in where clauses.

However, there is another benefit to expression indexes, and that is optimizer statistics. Not only do expression indexes allow rapid lookups of matching expressions, but they also provide optimizer statistics, which improve row estimates and hence query plans. Here is an example:

CREATE TABLE test (x INTEGER);
INSERT INTO test SELECT x FROM generate_series(1, 100) AS t(x);
 
ANALYZE test;
 
SELECT COUNT(*) FROM test WHERE x % 2 = 1;
 count
-------
    50
 
EXPLAIN SELECT * FROM test WHERE x % 2 = 1;
                     QUERY PLAN
----------------------------------------------------
 Seq Scan on test  (cost=0.00..2.50 rows=1 width=4)
   Filter: ((x % 2) = 1)

The optimizer doesn't know the selectivity of the modulus operator, so it initially assumes only one row is returned. Once an expression index is created and analyze statistics generated, the optimizer knows exactly how many rows will be returned:

CREATE INDEX i_test ON test ((x % 2));
 
ANALYZE test;
 
EXPLAIN SELECT * FROM test WHERE x % 2 = 1;
                     QUERY PLAN
-----------------------------------------------------
 Seq Scan on test  (cost=0.00..2.50 rows=50 width=4)
   Filter: ((x % 2) = 1)

Interestingly, the optimizer used expression index statistics, even though the expression index itself was not used. In the example above, the modulus operator is not selective enough to make the index useful, but expression statistics would be useful for more complex queries, e.g., with joins. This method can also be used to create statistics on functions.

It is also possible to create an expression index that generates cross-columns statistics, the benefits of which I mentioned in an earlier blog post. For example, this expression index would supply accurate statistics for state/city combinations, but queries would need to use the exact concatenation construction:

CREATE INDEX i_customer_state_city ON customer ((state || '|' || city));

It would be nice if there a way to create expression statistics without the overhead of creating and maintaining indexes.

View or Post Comments

Still No Query Hints?

Friday, February 17, 2017

So, it is 2017 and Postgres still doesn't support query hints like other relational databases? Yep, it's true, and probably will be forever since "'Oracle-style' optimizer hints" is listed in the "Features We Do Not Want" section of the Postgres todo list. A wiki page outlines the reasons for this.

While this seems perfectly logical to people who have used Postgres for years, it strikes new users as rigid and extreme. There are several reasons for this divergence.

First, what new users don't realize is that there are ways to control the optimizer, just not with inline query hints. The most useful ways are by changing planner constants and other planner options, which allow you to tailor the optimizer to your specific hardware and query types. This is a more holistic approach to optimizer tuning because, if set properly, they improve all queries, compared to "use this index"-style hints which must be added to every query.

There are also several crude enable settings that can be changed at the session level to prevent the optimizer from considering certain executor capabilities, like join and scan types. Unfortunately, these effect the entire query, i.e., you can't turn off sequential scan for just one table. It also doesn't allow the specification of which index to use. These are really designed for debugging.

In the same vein, an area that often causes suboptimal plans are queries that reference several columns from the same table where column values are correlated, e.g., a US state column with the value 'NY' is more likely to have a city value of 'New York City' than a state value of 'AZ' would. While inline query hints might be able to fix this case for constants, it can't help when variables are passed into the query, like with parameterized queries. A patch being considered for Postgres 10, multivariate statistics, would benefit all queries by collecting statistics on column combinations in addition to the single column statistics already collected. This, again, is an example of a holistic approach to improving the optimizer.

There are two additional ways to control single queries that are often used in production. Common table expressions allow a query to be broken up into stages which are executed sequentially because each with clause acts as an optimization fence. Secondly, the use of offset 0 prevents subqueries from being moved into the outer query.

I expect future improvements to address other optimizer shortcomings. One area where inline query hints would really help is fixing queries in an emergency, when you don't have time to research the cause. Postgres currently doesn't have an easy solution for that.

View or Post Comments

Going Deep on Stats

Wednesday, February 15, 2017

Postgres includes lots of monitoring tools that allow you to see what is happening, but there are a few settings that really go into detail, e.g., log_statement_stats:

SET client_min_messages = log;
SET log_statement_stats = true;
 
SELECT 100;
LOG:  QUERY STATISTICS
DETAIL:  ! system usage stats:
1 !       0.000000 s user, 0.000000 s system, 0.000201 s elapsed
2 !       [0.000000 s user, 0.000000 s system total]
3 !       0/0 [0/0] filesystem blocks in/out
4 !       0/14 [0/363] page faults/reclaims, 0 [0] swaps
5 !       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
6 !       0/0 [3/0] voluntary/involuntary context switches
 ?column?
----------
      100

What do all those numbers mean? Well, the Linux getrusage() manual page has a clue — these are kernel resource usage statistics. The first two lines show the user and kernel ("system") cpu time used, as well as elapsed time. (The numbers in brackets on line #2 and following are totals for this process.) Line #3 shows I/O from/to the storage device (not from the kernel cache). Line #4 covers memory pages faulted into the process address space and reclaimed. Line #5 shows signal and ipc message activity. Line #6 shows process context switches.

Let's look at some more interesting queries by populating a new table:

CREATE TABLE test (x INTEGER);
INSERT INTO test SELECT x FROM generate_series(1, 1000) t(x);

By clearing the kernel buffers with echo 3 > /proc/sys/vm/drop_caches and restarting the server we can see reads from the storage device (via log_statement_stats) and reads from the kernel cache (via explain (buffers true):

EXPLAIN (ANALYZE true, BUFFERS true, TIMING false, COSTS false)
SELECT COUNT(*) FROM test;
LOG:  QUERY STATISTICS
DETAIL:  ! system usage stats:
!       0.000000 s user, 0.000000 s system, 0.019259 s elapsed
!       [0.008000 s user, 0.000000 s system total]
!       1712/0 [5840/280] filesystem blocks in/out
!       7/94 [21/586] page faults/reclaims, 0 [0] swaps
!       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!       29/1 [73/3] voluntary/involuntary context switches
                    QUERY PLAN
---------------------------------------------------
 Aggregate (actual rows=1 loops=1)
   Buffers: shared read=5
   ->  Seq Scan on test (actual rows=1000 loops=1)
         Buffers: shared read=5
 Planning time: 3.673 ms
 Execution time: 2.221 ms
 
EXPLAIN (ANALYZE true, BUFFERS true, TIMING false, COSTS false)
SELECT COUNT(*) FROM test;
LOG:  QUERY STATISTICS
DETAIL:  ! system usage stats:
!       0.000000 s user, 0.000000 s system, 0.000408 s elapsed
!       [0.008000 s user, 0.000000 s system total]
!       0/0 [5840/288] filesystem blocks in/out
!       0/3 [21/591] page faults/reclaims, 0 [0] swaps
!       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!       0/0 [74/3] voluntary/involuntary context switches
                    QUERY PLAN
---------------------------------------------------
 Aggregate (actual rows=1 loops=1)
   Buffers: shared hit=5
   ->  Seq Scan on test (actual rows=1000 loops=1)
         Buffers: shared hit=5
 Planning time: 0.027 ms
 Execution time: 0.250 ms

Notice the first count reads 1712 512-byte blocks from storage and 5 8kB blocks from the kernel cache into the shared buffer cache. (The unrecorded reads are probably system table information.) The second query shows no I/O from the storage device, and a hit on 5 8kB shared buffer pages.

As you might know, Postgres executes queries in stages. Fortunately, you can get rusage statistics for each stage with log_parser_stats, log_planner_stats, and log_executor_stats. (explain (analyze true) also shows planning and execution durations, which you can see above.)

As you can see, you can access a wealth of information about Postgres and how it is using the cpus, storage, and caches. These have a large impact on how quickly queries execute.

View or Post Comments

Performance Goalposts

Monday, February 13, 2017

I often get the question, "How do I get good performance out of Postgres?," and sometimes I am kind of miffed because I get the feeling that they expect a short answer. Those who are familiar with Postgres know there is no short answer, and it really isn't Postgres's fault because there isn't a simple answer to that question for any relational database.

I guess it comes down to the complexity of relational systems. There is the sql parser, the optimizer, and background workers. No matter how much auto-tuning we do, there are still going to be things that administrators must adjust for optimal performance. And then there is the hardware, which is stressed by all relational systems. Proper tuning of Postgres must happen at all three levels:

Hardware
Database parameter configuration
Sql (both data modeling and queries)

Unless all three are tuned, you can't say the system is optimized. I have already covered hardware tuning in my Database Hardware Selection Guidelines presentation. Database parameter configuration is a tough one because every site has different requirements, though there are a few common tuning cases. Tuning at the sql level involves proper schema and index setup, plus monitoring with something like log_min_duration_statement, pg_stat_statements, or pgBadger. Once you identify a problem sql query, you often have to do detective work with explain.

None of this is easy, but hey, if it were easy everyone could do it. The good news is that, when done well, database performance tuning can yield big dividends.

View or Post Comments

Postgres Gives Developers More Options with NoSQL

Saturday, February 11, 2017

I just co-authored a blog for ibm's Linux on Power developer site. While there is similar content in my YeSQL slides, the ibm blog more concisely describes the four options available to application developers, from pure relational to pure NoSQL, and the options in between possible with Postgres.

View or Post Comments

Creating Extensions in Schemas

Thursday, February 2, 2017

Postgres extensions are a powerful way to add functionality to Postgres and dozens of extensions are available. However, it can be difficult to manage them. By default, extensions are loaded into the first existing schema in search_path (usually the public schema) e.g.:

CREATE EXTENSION pgcrypto;
\df
                                        List of functions
 Schema |         Name          | Result data type |        Argument data types         |  Type
--------+-----------------------+------------------+------------------------------------+--------
 public | armor                 | text             | bytea                              | normal
 public | armor                 | text             | bytea, text[], text[]              | normal
…

This loads 36 functions into the public schema. This can be difficult to manage, particularly if multiple extensions are loaded into the same database.

One solution is to place each extension into its own schema. This can be accomplished by setting search_path before loading the extension, or using the with schema clause in the create extension command:

CREATE SCHEMA pgcrypto;
 
CREATE EXTENSION pgcrypto WITH SCHEMA pgcrypto;
 
\df pgcrypto.*
                                         List of functions
  Schema  |         Name          | Result data type |        Argument data types         |  Type
----------+-----------------------+------------------+------------------------------------+--------
 pgcrypto | armor                 | text             | bytea                              | normal
 pgcrypto | armor                 | text             | bytea, text[], text[]              | normal
…

Users can then set their search_path whenever they want to use the extension, or prefix the function call with the schema name, e.g., select pgcrypto.armor('abc').

View or Post Comments

Beware of Antivirus Software with Postgres

Monday, January 30, 2017

Antivirus software helps to protect your computer from external attack. Unfortunately, it can cause problems with Postgres. Antivirus software constantly scans for files and behavior that suggests attacks, but Postgres can easily get caught in that net.

Postgres regularly creates heap, index, and wal files, and these files are binary. Depending on what data is written, these files can contain patterns that are mistaken for viruses, causing the files to be removed from Postgres-controlled directories and quarantined. Obviously, this causes bad things to happen to Postgres. Postgres sites using anti-virus software should disable anti-virus scans of directories controlled by Postgres.

View or Post Comments

Updated Blog Categories, Images

Saturday, January 28, 2017

When the number of my blog entries surpassed 400, I realized that the categories I was using to group entries for easier reference were too coarse. Therefore, I have added new blog categories to make future reference easier. Also, I have dozens of new blog entries ready to post in the coming months.

Related to restructuring, I have updated many of the stock images in my presentations.

View or Post Comments

Two More Interviews

Wednesday, January 25, 2017

I did two hour-long audio interviews recently with the group Three Devs and a Maybe. The first is covers performance and the second covers database concurrency.

View or Post Comments

Moscow Interview

Monday, January 23, 2017

I did a 12-minute interview in Moscow in November and a recording is now available online. The first four minutes cover Russian-specific aspects of Postgres, and the remainder discusses Postgres's general direction, challenges, and its place in the database ecosystem.

View or Post Comments

Using Ssl Certificates

Tuesday, January 17, 2017

Having covered ssl certificate creation and the use of certificate authorities (ca), I would like to put it all together and show how certificates and certificate authorities work to ensure trusted Postgres communication.

I have created a diagram showing server, client, and certificate authority certificates. None of these certificates is secret, e.g., the server sends its ssl certificate to the client, and visa versa. In the diagram, the server and client use the same certificate authority certificate. (Intermediate certificate authorities could also be used.)

When the client connects, the server sends its certificate to the client. The client uses the public key in its certificate authority certificate to verify that the server certificate was signed by its trusted certificate authority (the red line). It then uses the public key in the server certificate to encrypt a secret key that is sent to the server. Only a server with the matching private key can reply to generate a session key. It is not the possession of the server certificate that proves the server's identity but the possession of the private key that matches the public key stored in the server's certificate. The same is true for client certificates used for client host and user authentication (the blue line).

Hopefully this diagram helped you see how the same certificate authority certificate on the server and client allows for identity verification. Interestingly, identity verification is not required for two systems to communicate in an eavesdrop-proof manner, but if you can't be sure who you are communicating with, it isn't very useful.

View or Post Comments

Creating Ssl Certificates

Thursday, January 12, 2017

Having covered the choice of certificate authorities, I want to explain the internals of creating server certificates in Postgres. The instructions are already in the Postgres documentation.

When using these instructions for creating a certificate signing request (csr), two files are created:

certificate signing request file with extension req
key file, containing public and private server keys, with extension pem

(It is also possible to use an existing key file.) You can view the contents of the csr using openssl, e.g.:

$ openssl req -in server.req -text
Certificate Request:
    Data:
        Version: 0 (0x0)
        Subject: … CN=momjian.us …
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                Public-Key: (2048 bit)
                Modulus:
                    00:b2:ea:53:a6:5e:0a:bd:dd:59:08:52:0d:bb:77:
                    …
                Exponent: 65537 (0x10001)
        Attributes:
            a0:00
    Signature Algorithm: sha256WithRSAEncryption
…
-----BEGIN CERTIFICATE REQUEST-----
…
-----END CERTIFICATE REQUEST-----

The first argument to the openssl command-line tool always starts with a subcommand, in this case req. (The openssl manual pages are split based on the subcommand, so man req shows openssl arguments for the subcommand req.)

The common name (cn) shown above is potentially checked by the client. The encoded section ("certificate request") can be suppressed with -noout. The file server.req already contains this information in text format because -text was specified during its creation.

The server public and private keys stored in privkey.pem can also be viewed:

$ openssl pkey -in privkey.pem -text
-----BEGIN PRIVATE KEY-----
MIIEvwIBADANBgkqhkiG9w0BAQEFAASCBKkwggSlAgEAAoIBAQCy6lOmXgq93VkI
…
-----END PRIVATE KEY-----
Private-Key: (2048 bit)
modulus:
    00:b2:ea:53:a6:5e:0a:bd:dd:59:08:52:0d:bb:77:
…
-----END PRIVATE KEY-----
Private-Key: (2048 bit)
modulus:
    00:a3:ad:e4:61:2b:5f:c0:5b:23:fc:8c:b4:ab:70:
    …
publicExponent: 65537 (0x10001)
    …
privateExponent:
    …
prime1:
    …
prime2:
    …
exponent1:
    …
exponent2:
    …
coefficient:
    …

All of this can be very complex so I have created a diagram which illustrates what is happening. At the top-left is the server key generated by openssl req. This command also creates the certificate signing request (the csr, top-right) which contains

Certificate signing information in X.509 format, e.g., common name (cn)
Public server key to be used by clients wishing to authenticate the server's identity
The above signed by the server's private key to prove that the server owner supplied this information

The certificate signing request (csr) can't be used as a certificate because it is unsigned. For self-signed certificates, the server's private key is used to sign the csr. (These are the instructions given in the Postgres documentation.) A more powerful option, as I mentioned in my previous blog post, is to use a local certificate authority. In my diagram, a certificate authority (bottom-left) uses its private key to sign the csr and create a certificate (bottom-right).

Once the server has a certificate, any client who has the signer of the server's certificate in their ~/.postgresql/root.crt file can authenticate the server certificate by using the sslmode connection parameter verify-ca. They can also verify that the certificate was created for the specific server's host name by checking the common name (cn) using verify-full. (Clients can record revoked certificates in ~/.postgresql/root.crl.)

Hopefully you can make better use of ssl server certificates now that you understand how they are created, inspected, and verified.

View or Post Comments

Ssl Certificates and Certificate Authorities

Monday, January 9, 2017

When setting up ssl in Postgres, you can't just enable ssl. You must also install a signed certificate on the server.

The first step is to create a certificate signing request (csr) file that contains the host name of the database server. Once created, there are three ways to sign a csr to create a certificate:

Self-signed
Signed by a local certificate authority
Signed by a public certificate authority

If the certificate is to be self-signed, use the key created by the certificate signing request to create a certificate. If using a local certificate authority, sign the csr file with the local certificate authority's key.

The use of public certificate authorities doesn't make sense for most databases because it allows third parties to create trusted certificates. Their only reasonable use is if you wish to allow public certificate authorities to independently issue certificates that you wish to trust. This is necessary for browsers because they often connect to unaffiliated websites where trust must be established by a third party. (Browsers include a list of public certificate authorities who can issue website certificates it trusts.)

Issuing certificates to clients as well enables additional features:

Server validation of client's certificates
Client validation of the server's certificate
Client authentication using certificates

These items require the server and client certificates be signed by the same certificate authority.

Ssl setup in Postgres can be complicated, but it offers a unique set of security and usability features that are unmatched.

View or Post Comments

Use Kill -9 Only in Emergencies

Friday, January 6, 2017

During normal server shutdown, sessions are disconnected, dirty shared buffers and pending write-ahead log (wal) records are flushed to durable storage, and a clean shutdown record is written to pg_control. During the next server start, pg_control is checked, and if the previous shutdown was clean, startup can ignore the wal and start immediately.

Unfortunately, a clean shutdown can take some time, and impatient database administrators might get into the habit of using kill -9 or pg_ctl -m immediate to quicken the shutdown. While this does have the intended effect, and you will not lose any committed transactions, it greatly slows down the next database startup because all wal generated since the last completed checkpoint must be replayed. You can identify an unclean shutdown by looking at the server logs for these two ominous lines:

LOG:  database system was interrupted; last known up at 2016-10-25 12:17:28 EDT
LOG:  database system was not properly shut down; automatic recovery in progress

These crash database shutdowns can also happen if the computer crashes, the operating system crashes, or if a proper database shutdown script isn't triggered on computer shutdown.

So, in summary, administrators should avoid Postgres crash shutdowns unless shutdown time is more precious than startup time. Every time those log lines appear, Postgres is doing more work than would have been necessary if the previous shutdown had been clean.

View or Post Comments

Controlling Autovacuum

Tuesday, January 3, 2017

Unlike other database systems, Postgres makes the cleanup process visible and tunable to users. Autovacuum performs recycling of old rows and updates optimizer statistics. It appears in ps command output, the pg_stat_activity system view, and optionally in the server logs via log_autovacuum_min_duration.

Postgres also allows fine-grained control over the autovacuum cleanup process. Occasionally users find that cleanup is slowing the system down, and rather than modifying the behavior of autovacuum, they decide to turn it off via the autovacuum setting.

However, turning off autovacuum can cause problems. Initially the system will run faster since there is no cleanup overhead, but after a while old rows will clog up user tables and indexes, leading to increasing slowness. Once that happens, you can turn on autovacuum again, and it will recycle the old rows and free up space, but there will be much unused space that can't be reused quickly, or perhaps ever.

Turning off autovacuum is kind of like someone trying to save time by not changing the engine oil in their car. They are correct that for a while their car will be in the shop less often, but ultimately the car will be in the shop for a long time, or will stop working. Therefore, don't turn off autovacuum.

Another problem is that while autovacuum is turned off, optimizer statistics were not being updated, perhaps causing slower execution plans. Fortunately restarting autovacuum does fully fix that problem.

Now that we know that turning off autovacuum is a bad idea, what are the options if administrators want to reduce the overhead of autovacuum. Well, first, there are many autovacuum tuning parameters that allow autovacuum activity to happen less often, or consume fewer resources while running.

A more aggressive, and perhaps creative, approach is to change when autovacuum runs. Most systems have busy times and idle times. Sometimes this can be determined by the time of the day or the day of the week, e.g., Sundays, 0200-0500. In other cases it can be determined by the system's load average or number of active sql sessions in pg_stat_activity.

No matter how it is determined, wouldn't it make sense for autovacuum to run during these idle periods, and not run during busy periods? Well, it's not quite that simple. You don't really want to force autovacuum to run during idle times if there is no useful activity to perform, nor do you want it to never run during busy times in case there is a massive delete that requires cleanup or a large data change that requires updated optimizer statistics.

What you want is to encourage autovacuum to run during idle periods, and discourage it from running during busy times. Fortunately, that is easy to do, especially with the alter system command added in Postgres 9.4.

Let's assume you have your own method to determine busy and idle times. When an idle time starts, simply execute:

ALTER SYSTEM SET autovacuum_vacuum_scale_factor = 0.1;
ALTER SYSTEM SET autovacuum_analyze_scale_factor = 0.05;
 
SELECT pg_reload_conf();
 pg_reload_conf
----------------
 t

Of course, if you changed these settings in postgresql.conf you can use this in psql to reduce the current value of autovacuum_vacuum_scale_factor by half:

SELECT current_setting('autovacuum_vacuum_scale_factor')::float8 / 2 AS scale
\gset
ALTER SYSTEM SET autovacuum_vacuum_scale_factor = :scale;

You might also want to reduce autovacuum_freeze_max_age and autovacuum_multixact_freeze_max_age during this period. Then, at the end of the idle period, run this:

ALTER SYSTEM RESET autovacuum_vacuum_scale_factor;
ALTER SYSTEM RESET autovacuum_analyze_scale_factor;
 
SELECT pg_reload_conf();
 pg_reload_conf
----------------
 t

This will set the values back to their defaults. Pretty simple, huh? Yet effective. A more sophisticated approach would be to proportionally decrease these settings based on the load on the system.

Let's walk through how this works. Suppose we don't change any settings during idle times and use the default autovacuum_vacuum_scale_factor of 20%. If a table has 19.5% expired rows at midnight, autovacuum will not run. When the percentage of expired rows reaches 20% at 11am, autovacuum will run and remove the expired rows. Now suppose that autovacuum_vacuum_scale_factor is reduced by half at midnight — autovacuum will run and all tables will have less than 10% expired rows by the time the idle period ends. Now, during the busy time autovacuum will only run if a table has increased from 10% expired rows to 20%.

It would be nice if this proportional behavior could be added to autovacuum but I am not sure how to implement that in a clean way.

View or Post Comments