Parallel Query Report from the PGCon Developer MeetingThursday, May 17, 2012
As part of yesterday's PGCon Developer Meeting, I hosted a discussion about adding resource parallelism to Postgres, which I blogged about previously. Josh Berkus has kindly summarized the results of that discussion. I am hoping to help track and motivate progress in this area in the coming months; it is a multi-year project to complete.
Template1 Me!Wednesday, May 16, 2012
You might have heard of the template1 database before, or seen it in the output of pg_dumpall:
REVOKE ALL ON DATABASE template1 FROM PUBLIC; …
It is one of those databases, like template0, that sits in the background and appears in psql \l output. While you can go for years never knowing about template1, it does have a useful purpose.
As its name suggests, it is the default "template" used for creating new databases. This means, if you modify template1, every new database created after the modification will contain the changes you made. So, if you need a table, schema, function, server-side language, or extension in every database, add it to template1 and you will have it in every newly-created database. The Postgres documentation covers this in detail.
I mentioned template1 is the default template database, but you can use other databases as templates for newly-created databases. Create database has a template option that allows you to specify a different database to copy from. For example, to create a copy of the production database called demo, have everyone disconnect from the production database, and use production as the template to create demo. The demo database can then be modified without affecting production. Create database does a file-level copy, so all the I/O is sequential and much faster than a dump/restore of the database.
As a side node, template0's purpose is often mysterious too, particularly because you can't even connect to it:
$ psql template0 psql: FATAL: database "template0" is not currently accepting connections
There must be something special in there because it's inaccessible! Well, sorry to kill the mystery, but template0 is used solely by pg_dumpall to properly dump out the customizations made to template1. If template0 did not exist, it would be impossible for pg_dumpall to determine the customizations made to template1 and dump those out appropriately.
Postgres 9.2 Draft Release Notes PublishedThursday, May 10, 2012
I have completed the Postgres 9.2 release notes I started seven days ago. Writing the release notes is always a taxing experience. I have to basically dedicate an entire week to the process of digesting 1100 commit messages to produce 3000 lines of SGML text. Once I am done though, it is rewarding to see the finished product. It is like working on a painting all year, and spending a frustrating week framing it and hanging it on the all — once it is hung, you stand and admire it, and forget much of the sweat it took to produce. I am sure many community members feel the same.
Curious how the 9.2 release item count compares to previous major releases? Here are the results:
| Release | Items |
| 9.2 | 241 |
| 9.1 | 203 |
| 9.0 | 237 |
| 8.4 | 314 |
| 8.3 | 214 |
| 8.2 | 215 |
| 8.1 | 174 |
| 8.0 | 230 |
| 7.4 | 263 |
Of course, this is just the first draft of the release notes; the 9.2 count will change regularly until the final release.
Postgres Drinking GameMonday, May 7, 2012
My children have traveled to many conferences with me, and have heard many webcasts and training calls at home. I guess, after hearing about Postgres so often, they pick up certain phrases of interest, and the big one for them is "shared buffers". Anytime someone uses those words, they start howling and make up some funny sentence using the words.
There are a variety of beer drinking games that trigger drinking when a
word is said — perhaps "shared buffers" is enough on its own to make a drinking game.
The Effectiveness of effective_cache_sizeFriday, May 4, 2012
Having reported the methods for finding the size of the kernel cache on Linux, I wish to highlight the importance of the postgresql.conf setting effective_cache_size.
Unlike other memory settings that control how memory is allocated, effective_cache_size tells the optimizer how much cache is present in the kernel. This is important for determining how expensive large index scans will be. The optimizer knows the size of shared_buffers, but not the kernel cache size, which affects the probability of expensive disk access.
The kernel cache size changes frequently, so run free during a period of normal system load and use that value to set effective_cache_size. The value doesn't have to be perfect, but just a rough estimate of how much kernel memory is acting as secondary cache for the shared buffers.
Starting on 9.2 Release NotesThursday, May 3, 2012
As in previous years, I have started working on major release notes, this time for Postgres 9.2. I start with this command:
$ src/tools/git_changelog --since '2011-06-11 00:00:00 GMT' --master-only \ > --oldest-first --details-after
which generates 11k lines of output, and eventually reduce that to 3k lines of SGML-marked-up release notes. I have outlined the steps in the past, though git has improved the speed of researching items.
Measuring Free Memory and Kernel Cache Size on LinuxWednesday, May 2, 2012
Measuring Linux free memory and kernel cache size can be somewhat tricky. You might look at /proc/meminfo for the answer (commas added):
# cat /proc/meminfo MemTotal: 24,736,604 kB MemFree: 3,805,392 kB Buffers: 743,016 kB Cached: 18,188,208 kB ...
Unfortunately, this brings up two more questions: Why is "MemFree" so small on this idle system, and what is the difference between "Buffers" and "Cached"? Another way of displaying this information is using the Linux free command:
# free -m
total used free shared buffers cached
Mem: 24156 20351 3805 0 743 18188
-/+ buffers/cache: 1419 22737
Swap: 6219 0 6219
Addressing the first question, we see the same 3.8GB for free memory displayed on the first line under "free". This display is so often misinterpreted that it necessitated the creation of an explanatory website.
The Mem line output by free shows memory from the kernel perspective. The free value of 3.8GB represents kernel memory not allocated for any purpose, i.e. free, from the kernel's perspective. The bulk of the memory, 18GB, is shown as used for caching ("cached"). The second line, confusingly labeled "-/+ buffers/cache", represents memory from the process perspective. It is labeled that way because the "buffers" and "cached" have been removed from the "used" column and added to the "free" column, i.e. -/+. From the process perspective that memory is immediately available because it represents read cache that can be discarded as soon as memory is needed by processes. This labeled output summarizes the columns (details):
# free -m
total used free shared buffers cached
Mem: A+B+C+D B+C+D A B C
-/+ buffers/cache: D A+B+C
(The "Swap" line has been removed from this and later free outputs.)
"A" represents kernel free memory, and "D" represents memory used by processes. You can see that "B+C" is added to the "used" column in line 1, and added to the "free" column in line 2. "total" represents all memory, less memory used for basic kernel operation. In fact, "total" always equals "used" + "free" for each line:
# free -m
total used free shared buffers cached
Mem: A+B,C+D A B
-/+ buffers/cache: C D
Here is actual free output:
# free -m
total used free shared buffers cached
Mem: 24156 20351 3805 0 743 18188
-/+ buffers/cache: 1419 22737
Notice that the italicized values on line 1 equal the bold value on line 2. Similarly, these italicized value equal the bold value:
# free -m
total used free shared buffers cached
Mem: 24156 20351 3805 0 743 18188
-/+ buffers/cache: 1419 22737
While free is confusing, its layout does suggest the meaning of various columns. Open source has often debated what "free software"
means, and it seem the Linux kernel also has multiple definitions of the word "free".
The second question regards the difference between "buffers" and "cache". (Josh Berkus recently blogged about a kernel bug that prevented most available ram from being used as cache.) The definitive answer for modern kernels comes from this source code comment:
Buffers: Relatively temporary storage for raw disk blocks
shouldn't get tremendously large (20MB or so)
Cached: in-memory cache for files read from the disk (the
pagecache). Doesn't include SwapCached
The buffers and caches can be cleared using these commands (at least in Linux 2.6.16 and later kernels):
To free pagecache:
echo 1 > /proc/sys/vm/drop_caches
To free dentries and inodes:
echo 2 > /proc/sys/vm/drop_caches
To free pagecache, dentries and inodes:
echo 3 > /proc/sys/vm/drop_caches
With the last command run, the free command shows almost nothing cached, and the first two entries in the "free" column are almost identical, as expected:
# free -m
total used free shared buffers cached
Mem: 24156 899 23257 0 0 91
-/+ buffers/cache: 807 23349
Swap: 6219 1 6218
Hopefully this explanation helps people understand the various perspectives of "free" reported by the Linux kernel and an easy way to find the size of the Linux kernel cache.
Caching LevelsMonday, April 30, 2012
There are several levels of caching used in a typical server — here they are, in order of increasing distance from the cpu:
| 1. | Cpu cache |
| 2. | Random-Access Memory (ram) |
| 3. | Storage controller cache |
| 4. | Storage device (disk) cache |
All of these are a fixed size and set at hardware installation time, i.e. you can't move cache from one level to another. For Postgres, there is flexibility in how #2, random-access memory, is allocated, and this provides a never-ending opportunity for administrators to optimize their systems. The three ram allocation possibilities are:
The last item, kernel cache, isn't really an allocation but rather is based on the ram remaining from the previous two allocations. Some database systems avoid the kernel cache by directly writing to storage (direct i/o or raw disk access). While Postgres does allow direct i/o for writing of the Write-Ahead Log (wal) (via wal_sync_method), it does not support direct i/o or raw device control of data buffers. The reason for this is that Postgres tries to be operating system-neutral, and allows the kernel to do read-ahead and write combining. This is often more efficient, particularly if the storage subsystem has both database and non-database activity. In such cases, only the kernel knows about all the storage activity, and can better optimize the workload. This also helps Postgres run well in virtualized environments. The big downside of using the kernel cache is that it often contains data already in the shared buffer cache.
Future blog entries will explore the various configuration possibilities of allocating random-access memory (ram).
View or Post Comments
Upcoming EventsSaturday, April 28, 2012
During the next two months I will be attending events in the following cities: New York City, Ottawa, Charlotte (North Carolina), and
Boston — the details are on my website. I will also be doing training in many of
these locations. And I spoke in Philadelphia this week — I
guess this is what the New Postgres Era looks like.
Why Use Multiple Schemas?Friday, April 27, 2012
I mentioned that Postgres supports multiple clusters, databases, and schemas, but an open question is why use multiple schemas, rather than placing everything in the "public" schema? By default, search_path places everything in the public schema (assuming a schema matching the current user name does not exist):
test=> SHOW search_path; search_path ---------------- "$user",public (1 row)
Of course, search_path can be set at various levels, e.g. user, database, session.
So, why use more than the public schema?
When Informix introduced schemas in the mid-1990's, I didn't understand schemas and why all the tables now were prefixed with "public", so I never used this useful feature. Hopefully this blog post helps explain the usefulness of schemas and why ignoring schemas, as I did in the past, is not always a good idea.
View or Post Comments
Take a Dip in the PoolerWednesday, April 25, 2012
I recently mentioned the use of connection poolers to reduce the overhead of server-side language initialization, but I believe the topic warrants fuller coverage.
Aside from the ability to reduce language initialization overhead, the two major advantages of connection pooling are:
While these advantages tend to get lumped together in people's minds, they actually address two different issues. The first, reducing session startup time, reduces the delay in executing the first query of a session, and reduces process creation overhead on the server. It is particularly useful for databases with many short-lived sessions, and for operating systems, i.e. Windows, that do not use fork (details about fork).
The second advantage, reducing session management overhead, improves overall performance by reducing the server overhead involved in managing many sessions, i.e. if many of those sessions are idle, it is possible to speed up the active sessions by reducing the total number of database sessions connected to the server. This is obviously advantageous when there are many sessions, but most are inactive, i.e. the database is more efficient executing 20 active sessions than running 20 active sessions and managing 400 inactive sessions.
Postgres has two popular connection poolers, pgpool-II and PgBouncer. The names are a little confusing, because pgpool-II does a lot more than pooling. (Josh Berkus covered this issue recently.) Connection poolers are also built into many application-building tools, like PHP and Hibernate. Whichever one you use, they all provide the benefits listed above.
One tricky limitation is that database sessions can only be pooled per-database, meaning that if you connect to many different databases in a unpredictable pattern, a connection pooler will be less useful.
Postgres does not provide a built-in connection pooler. and that is by design — external connection poolers can be placed closer to the client, or on inactive servers, and they can be used to redirect client connections to new servers after a fail-over — a built-in connection pooler would have none of these advantages.
View or Post Comments
When to Use Multiple Clusters, Databases, or Schemas?Monday, April 23, 2012
I previously explained that Postgres allows multiple databases per cluster. The outstanding question might be, when should I use multiple clusters, multiple databases, or multiple schemas? The following table outlines the advantages of the various container types:
| Feature | Cluster | Database | Schema |
| Isolated Server Start/Stop | ✓ | ||
| Connection Control | ✓ | ✓ | |
| Private System Tables | ✓ | ✓ | |
| Private Plug-Ins | ✓ | ✓ | |
| Isolated Administration | ✓ | ||
| Shared Administration | ✓ | ✓ | |
| Isolated Resource Usage | ✓ | ||
| Shared Resource Usage(1) | ✓ | ✓ | |
| Data Isolation(2) | ✓ | ✓ | |
| Cross-Container Queries | ✓ |
(1) A large number of data containers increases the usefulness of resource sharing, e.g. shared_buffers. Resource sharing includes log shipping and streaming replication sharing.
(2) User and database names, being global objects, are visible in all databases. It is impossible to query across databases, except via an external database session, e.g. dblink. Schema permissions allow data access control, but pg_class still shows all tables defined in the database.
Hopefully this chart helps users choose the proper container for their data needs.
View or Post Comments
Avoiding Logical Dump PitfallsWednesday, April 18, 2012
Postgres has three backup methods: logical, file system, and continuous archiving. The last two rely on copying the file system (with the possible assistance of the write-ahead log). The first method, logical backups, is performed via pg_dump or pg_dumpall, and has two pitfalls I would like to mention.
First, many people think that running pg_dumpall and running pg_dump on each database individually are equivalent — they are not. While pg_dumpall does backup each database individually, it also dumps out global objects that exist outside any specific database, e.g. roles, tablespaces. If you prefer to use pg_dump to dump each database individually, be sure to run pg_dumpall -globals-only as well. Failure to restore global objects before per-database objects causes all restored data to be owned by the restore user and be in a single tablespace.
The second pitfall is to believe that a logical dump of the database has captured all cluster modifications. While all the database objects have been dumped, configuration file changes are not included in a logical dump and must be backed up separately, e.g. postgresql.conf, pg_hba.conf. This is why many wise administrators and packaging systems move these configuration files out of the the data directory and into a directory that is regularly backed up by the operating system, e.g. /etc. With the configuration files moved, the data directory can be skipped entirely during operating system-level backups.
So, if you perform logical backups, make sure you backup your global objects and configuration files as part of your backup process.
View or Post Comments
Multiple Databases and Global ObjectsMonday, April 16, 2012
The Postgres system is setup differently than some other database systems, and while long-time Postgres users consider its layout natural, people coming from other database are often initially confused. Therefore, it is helpful to explicitly mention some of those differences.
First, Postgres allows multiple databases in a single cluster (which is controlled by a single postmaster). This often confuses Oracle users because in Oracle you can only have one database in a cluster. This also confuses MySQL users because they have multiple "databases", but these databases share a single system catalog and allow cross-"database" queries. Effectively, they are "schemas", rather than databases. Microsoft SQL and DB2 have a similar facility to Postgres, with multiple databases in a cluster, and multiple schemas in a database.
As part of this multi-database, multi-schema capability, there are some things that don't make sense to store in per-database tables, e.g. roles, tablespaces. These are called global objects and a system catalog query can show them to us:
SELECT relname
FROM pg_class JOIN pg_tablespace ON (reltablespace = pg_tablespace.oid)
WHERE relkind = 'r' AND -- only tables
spcname = 'pg_global'
ORDER BY 1;
relname
--------------------
pg_auth_members
pg_authid
pg_database
pg_db_role_setting
pg_pltemplate
pg_shdepend
pg_shdescription
pg_shseclabel
pg_tablespace
(9 rows)
Another way to see global objects is with pg_dumpall --globals-only.
Looking at the list, it seems natural that they are not per-database — you certainly wouldn't want the list of available databases stored in a single database. Role (user and group) information is also cluster-wide. Tablespaces are something that exists as part of the file system, so it makes sense for them to be global too.
I hope this helps users who might have been confused by the way Postgres does things.
View or Post Comments
Selecting All Columns and TOASTFriday, April 13, 2012
I have already blogged about toast storage (and TOAST queries), but I would like to highlight one aspect that is easily overlooked by application programmers. There is an age-old debate over whether using select * is good programming practice — many feel that wildcarding all columns of a table makes applications more brittle when columns are added or removed, while others feel that specifying all columns in an application is more error-prone. If a table has only a few columns, specifying them is pretty easy, but when the number of columns is large, specification of column names can be cumbersome. Of course, having wide rows in tables has its own issues, but I don't want to get into that here — what I do want to cover is its affect on toast'ed values.
As I mentioned in my previous blog post, long values are not stored in the main row, but in toast tables that have additional access overhead. This is great when you only need the short values from a row, which is typical, but what happens when you use select *? Well, if you actually need all the column values, using select * or specifying the column name explicitly makes no difference, but many applications use select * when they only need some of the column values, and this is where toast becomes an issue. If you don't need to see some columns, and those columns might be long and hence toast'ed, it is wise to avoid selecting them in a query — this will avoid toast table access and speed up queries significantly.
With toast, some values are more expensive to access than others, so avoid unnecessarily accessing columns that might contain toast'ed values. (Observation from Stephen Frost.)
View or Post Comments
Don't Ignore shared_preload_librariesWednesday, April 11, 2012
You might have seen shared_preload_libraries in the postgresql.conf configuration file, but its usefulness might have been unclear — I would like to explain its purpose here.
Unlike traditional database systems, Postgres is an object-relation database system, meaning it relies heavily on plug-in objects to enable flexible behavior, and many of these objects are supplied as shared object libraries (or dynamically-loadable libraries (DDL) on Windows). Examples of external objects include the PL/pgSQL server-side language, the pgcrypto cryptographic library, and the PostGIS geographic information system. These are all implemented as shared libraries that are dynamically loaded into the database server when accessed. The process of loading a library into a running executable is called dynamic loading and is the way most modern operating systems access libraries (the non-dynamic method is called "static linking").
However, this dynamic loading is not free — it takes time to load a dynamic library the first time it is accessed. The library must be mapped into the process address space, and its symbol table read, and this must happen before any functions in the dynamic library can be executed. This is where shared_preload_libraries is useful — on operating systems where child processes are forked rather than created from an executable (all but Windows), shared_preload_libraries loads the shared library into the parent of all child processes (the postmaster). This causes all child processes, i.e. new sessions, to inherit an address space that already contains the needed shared library, speeding up calls to any functions in that library. (Fork is illustrated in my Inside PostgreSQL Shared Memory presentation, slide 12.)
In one test, I found that setting shared_preload_libraries = '$libdir/plpgsql' in postgresql.conf caused a simple plpgsql function:
CREATE FUNCTION preload_test() RETURNS void AS E'BEGIN\nEND;' LANGUAGE plpgsql;
executed with these commands 10,000 times in the same psql session:
\connect test SELECT preload_test();
reduced execution time significantly. Of course, this is an atypical test and most workloads will not see as significant a gain. Here is a summary of my various tests:
----- S e c o n d s ----- % Net Change
Library No Preload With Preload % Change from chr('a')
-----------------------------------------------------------------------
chr('a') 28.0 28.2 +1 n/a
pgcrypto(1) 32.8 30.9 -6 -40
plpgsql 34.7 30.5 -12 -63
plperl(2) 295.0 240.3 -19 -20
1 query: armor('a')
2 function body: 'return;'
PL/Perl, being a more complex library, showed a more significant improvement, and its high overhead suggests the advantage of using pooled connections so the Perl interpreter doesn't have to be started for every new connection. Notice that loading unneeded libraries can have a slight overhead, so use shared_preload_libraries discriminately.
In conclusion, testing shows that shared_preload_libraries can improve system performance (except on Windows) if your server heavily uses shared libraries.
View or Post Comments
Three Ways to Install PostgresTuesday, April 10, 2012
There are three ways to install Postgres, and they are all listed on the Download menu on the Postgres website. While the web authors did their best to explain the differences between the various installation methods, I thought it would be helpful to more fully explain their advantaged and disadvantages. The three methods are:
The New Postgres EraMonday, April 9, 2012
Having attended several conferences recently, I saw confirmation of my previous observation that Postgres is poised for a new wave of adoption. The last time I saw such an upturn in adoption was with the release of Postgres 8.0 in 2005, which included a native port of Postgres to Windows. You can see the increase in the volume of postings to the Postgres jobs email list. (The spike in January of 2008 was Sun buying MySQL.)
And that's not all — Robert Haas's recent blog post about Postgres scaling linearly to 64-cores in upcoming Postgres 9.2 means that, by the end of the year, Postgres will be a major contender on high-end hardware. We have always done well on small to medium-sized servers, but we are now poised to compete heavily on the high-end.
Postgres adoption is probably five years behind Linux's adoption. This video explains the increased popularity and development model used by Linux, which is similar to Postgres.
View or Post Comments
Software Patents: Why the Status Quo Continues?Friday, March 30, 2012
Most software developers agree that software patents are hopelessly broad and that violation is almost unavoidable for any non-trivial software project. I won't even go into examples of software patents causing industry chaos — new high-profile patent attacks appear every month. The risk of software patents to the software industry has been likened to a mine field, and for good reason — it is impossible to guess when software patents will create a crisis.
The Postgres team has been circumspect about software patents — we address them when someone brings up a possible violation (as happened in 2005 with a patent on queues), and we try to avoid patented ideas in our code, as you can see from our recent discussion about compression algorithms. Several years ago, one of the Postgres support companies analyzed our source code for possible patent violations, and reported we were clean, but given the "mine field" unpredictability of patents, it is hard to trust the report's accuracy.
If most people believe the software patent system is causing more harm than good, you would think it would be corrected? Well, it has been this way for at least a decade, and doesn't seem to be getting any better. In fact, the number of patent attacks, particularly by competitors, has increased. Even companies that aren't excited about patents (e.g. Google) are having to file or purchase patents as a way of defending themselves from competitors.
Why no improvement? This blog entry suggests a reason:
As a result, there's a deep and persistent rift between the community of computer programmers, who are overwhelmingly hostile to software patents, and patent lawyers who seem mystified by all the outrage. The job of a patent lawyer gives him a systematically skewed understanding of how the patent system affects the software industry.
Hopefully, someday, these two groups will get together and communicate in a way that finally improves software patents. Perhaps, if the number of software patents attacks continues to increase, their will be so much disruption in the software industry that reform will be demanded by everyone.
Still think the patent system can continue unchanged? Listen to the upsetting audio report, When Patents Attack! (transcript, summary), about people attacked by patent trolls. This is a different problem from competitors attacking with patents.
Postgres has been lucky to avoid attacks so far, and we are always vigilant, but in the current system, we can never relax because, in a mine field, you never know where the mines are buried.
Update: Good summary of the current patent situation
View or Post Comments
Report from PGDay AustinWednesday, March 28, 2012
We just completed PGDay Austin, and it was a great event. It was a single-track conference in a serene hotel — this gave the event a polish and cohesion that I have rarely felt at Postgres events. Everyone knew where they should be, nothing felt rushed, and the talks were all of high quality. I wouldn't be surprised to see double the attendance at next year's conference.
The most exciting talk for me was Will Leinweber's Schemaless SQL (down arrow key advances slides). The talk started by explaining why developers like the document storage model, and Will explained his experienced moving from document storage databases to Postgres. Will found our hstore extension to be ideal for storing the type of unstructured data he used to store in document databases. Hstore gave him the best of both relational and document database worlds.
What really excited me was the combination of our JSON data type coming in Postgres 9.2 and the use of the Javascript stored procedure language PLv8 — combining these gives you a data type (JSON) that has a stored procedure language designed to operate on that data type (Javascript). The linkage of data type and stored procedure language is something new for Postgres and suggests a powerful new direction for application programmers. The tools are not ready for production use yet, but I can see data type & language combinations being powerful.
One big highlight of the conference was the attendance of the SQL guru Joe Celko. Joe lives in Austin and kindly attended, asked questions during the talks, and even went out with us for drinks after the conference. Magnus Hagander, Greg Smith, and I are flying to Virginia tomorrow to attend PG Corridor Day DC, and then PGDay NYC on Monday.
View or Post Comments
The Importance of High-Quality DocumentationTuesday, March 27, 2012
Documentation quality can make or break an open source project. Don't believe me? Ever used poor documentation? How did it make you feel about your continued use of the software?
Just as Postgres's high code quality doesn't happen by accident, neither does its documentation quality. Our community is regularly updating and improving our documentation — since 1998, our documentation has averaged 1.3 commits per day. Don't believe me? See for yourself — that's 6913 documentation improvements over 14 years.
New features often require documentation improvements, but that's not all. We are constantly on the lookout for ways to improve our documentation. Here is an email report (with patch) from last Thursday indicating our documentation about numeric storage requirements was incorrect, and here is the commit on the same day fixing it. We are also on the lookout for people who are confused by our documentation, e.g. this commit clarifies behavior described in this bug report.
There are also more structured efforts to improve our documentation. A few years ago I started a project to systematically improve our documentation, which has led to significant stylistic improvements. PGCon, the undisputed Postgres hackers conference, this year is sponsoring a documentation improvement day as part of their conference.
So, as you can see, it does take work to make high-quality documentation. You can help by reporting anything that is unclear, needs improvement, or incorrect — the community will do its job of making sure your report gets the attention it deserves, so we can all continue to be proud, not only of the Postgres code, but of its documentation.
View or Post Comments
Postgres DocumentationFriday, March 23, 2012
The Postgres documentation is extensive — no question about that. In fact it is so extensive (2704 US Letter pages in PDF for Postgres 9.1.3) that it has discouraged the publication of commercial Postgres books.
The Postgres documentation is available in two popular formats, HTML and PDF. The HTML format is most prominently displayed on our web site (9.1 version). Thom Brown masterfully improved the layout of our HTML documentation to be more esthetically pleasing about 18 months ago, and we made page navigation improvements about six months ago.
The PDF version is great for screen viewing and printing. If you are using a PDF viewer that is only showing you the PDF pages, you are
missing out — sequentially scanning 2704 pages looking for something is frustrating. To make the PDF usable, you have to enable
the left side-pane in your PDF viewer, and choose index view. For example,
this is the page-only view,
this is the thumbnail view, and
this is the index view. The index view allows you to see all the
chapters and sections present in the PDF, and you can click on a title to see the referenced text. You can also open
index chapters to see sections and subsections. Without the index view,
you easily get lost in PDF files; with an index view, PDFs become better than physical printed pages. And if you prefer printed pages,
you can always give the PDF file to a local printer like
Staples and have them print the Postgres manuals for less
than USD $100. However, make sure you tell them it is acceptable to use multiple bindings — most printers are unable to
single-bind books this large.
Why You Should Care About LibpqWednesday, March 21, 2012
Why should you care about libpq? It sounds scary — not as scary as "PostgreSQL", but still scary.
Libpq is the C interface library for Postgres. Your reaction, "But I don't use C, I use
Python, PHP, etc., so libpq isn't useful to me?" Hold on! Don't overlook libpq so quickly.
Libpq provides a C interface for clients communicating with Postgres servers over a network (and via Unix domain sockets). It provides authentication handling, query submission, and query result processing, as well as other utility operations.
Why is it important for people programming in non-C languages? Well, the Perl, Python, Ruby, etc. interface to Postgres using an interface library (e.g. DBD::Pg, PyGreSQL) and those libraries need to communicate with Postgres servers too. It is inefficient for each library to maintain its own code to do network communication with Postgres, so they often use libpq internally. In fact, the only two popular client interface libraries that don't fully use libpq are the Postgres JDBC driver (because it is a Type 4 driver, and hence written in pure Java), and the Postgres ODBC driver, which mostly uses libpq but has its own network code to handle special ODBC requirements; all other interface libraries only use libpq.
So, what good is that to me, you might ask. Well, libpq has many features, covered in 70 pages of the Postgres manual. From those manual pages, you can learn more about libpq features that might be exposed by your interface library, like connection parameters and SSL support. Some features are enabled in all interfaces, like the use of environment variables and client password storage. What is nice about the last two libpq features is that they can be used without changing the application, so they are useful to users and developers.
So, as you can see, libpq is quite useful, even if you are not programming in C. Hopefully this blog entry has given you a few ideas of how libpq can make your programming life easier and more productive.
View or Post Comments
Open Source LeadershipMonday, March 19, 2012
Open source leadership might sound like an oxymoron — how can you have leadership when there is no organizational structure and no direct compensation for work performed? Well, let me ask a different question — how can a group of mostly volunteers create software that rivals that created by billion-dollar companies?
So, combining these questions, is it possible for volunteers to create great software without leadership? No, of course not, but with leadership, it is certainly possible, as Postgres proves. So, where is the leadership? How does it work, and how effective is it?
Certainly, this is a baffling question, not only to outsiders, but often to people who have worked in open source community for years. You might have a clue, but it is very difficult to articulate. For me, it was unlocked by hearing John Maxwell's Five Levels of Leadership (book). I have heard John several times at annual Chick-fil-A Leadercast conferences (next event, May 4), but last year he talked about leadership in a way I had never heard before — his ideas are captured in this video.
The lowest of John's five levels, position, does not exist in pure open source communities — even core status or committer rights have little bearing on leadership capabilities. What is striking about the five levels is that open source communities have to operate purely on the other four leadership levels, and while these levels are harder to accomplish, they produce teams that are much more productive and efficient. In fact, in the second part of his presentation, he states that the ability to lead volunteers is the true test of a leader.
The bottom line is that only great companies operate at the highest levels of leadership — our community is forced to operate at those levels, and that is part of the magic that makes us great.
View or Post Comments
Pg_Upgrade and Optimizer StatisticsFriday, March 16, 2012
My recent blog entry about pg_upgrade improvements in Postgres 9.2 prompted a community discussion about the time-consuming requirement of generating optimizer statistics once pg_upgrade completes. This requirement makes pg_upgrade less useful because some queries run slower while optimizer statistics are missing. The agreed solution was to have pg_upgrade create a script that generates increasingly more-accurate statistics — this allows the new cluster to generate minimal statistics in minutes, with full statistics taking perhaps an hour or more to complete.
Fortunately, the script created automatically by pg_upgrade in Postgres 9.2 works just fine in earlier versions of Postgres, so I am supplying links (Unix, Windows) for users who are currently using pg_upgrade. (Those upgrading from Postgres 8.3 will need to change the final "--analyze-only" to "--analyze"; pg_upgrade automatically handles this when it creates the script.) I hope this helps, and I am interested in feedback about the script's effectiveness.
View or Post Comments
Pg_Upgrade Improvements Coming in Postgres 9.2Monday, March 12, 2012
Only minor improvements are planned for pg_upgrade in Postgres 9.2, and that is a good thing — it indicates that the pg_upgrade code has reached maturity and is now considered a reliable way to perform rapid Postgres major upgrades. No additions are (yet) required to allow pg_upgrade to support all the new Postgres 9.2 features. A few features planned for the future are being designed to allow pg_upgrade to easily handle them.
The major improvement just
added to Postgres 9.2 is
improved logging, which will allow easier diagnosis of pg_upgrade failures. On a related note, I got a report from a happy tester that
pg_upgrade upgraded their 2.5 terrabyte database to Postgres 9.1 in seven minutes, using link mode.
Postgres Getting Increased AttentionFriday, March 9, 2012
Recently three people told me that they started seeing increased interest in Postgres 2-3 months ago. I have seen increased interest in
Postgres at conferences over the years, but this was described as different. Rather than interest from developers and open source people,
who are the usual conference attendees, the recent increase has been described as organizations wishing to switch large parts of their
infrastructure to Postgres. This has caused a huge spike in the demand for Postgres services and employment. A good example is a recent
blog post (with
video) of two Microsoft SQL Server DBAs expressing their excitement about Postgres. This
is good news for the project, and foreshadows much excitement to come.
Report from Montreal's ConFoo ConferenceThursday, March 1, 2012
I am bursting with news from Montreal's ConFoo conference, which has about 600 attendees. This is the first time I am attending, and I am told the first time there are Postgres-specific talks. I gave two Postgres talks yesterday (as I mentioned), and Ilia Alshanetsky gave a great Introduction to PostgreSQL talk (slides). The talk was smoothly presented and well received. I have seen many presentations about specific Postgres features, but few that give a general introduction to Postgres, particularly with an evangelical focus; in fact, the last great one I saw was in 2005.
There are some amazing talks at this conference, and the percentage of talks that are helpful to me is very high. I saw JVM Internals for Dummies, which explained how the OpenJDK HotSpot optimizer chooses to inline function calls and convert byte code to CPU instructions by using run-time profiling. This is similar to the discussion the Postgres community had about inlining some of our sort code for Postgres 9.2, except the JVM makes the decision at run-time using profiling, while we have to make the decision at compile time because we use C.
Today, there was an excellent talk about geolocation (slides) which had good things to say about Postgres and PostGIS, as well as PostGIS users OpenStreetMap and MapBox. There was also a good scaling talk (slides) about using cloud services, and Postgres got a good mention there, though the uneven performance of Amazon's Elastic Block Store was highlighted.
And, finally, the hotel is quite elegant.
Because of the French influence here, the meals, breads, pastries, and desserts are fantastic. Also, from my hotel room, I have a
picturesque view of a family of ducks living in the snow-covered hotel courtyard, enjoying a heated pond and small shelter
(image)--- it certainly is relaxing to watch — what a great idea.
CTE Presentation at ConFoo; Slides Now OnlineWednesday, February 29, 2012
Seven months ago, I wrote the presentation Programming the SQL Way with Common Table Expressions. Today, I was finally able to present it at the Montreal ConFoo conference, so I have placed the slides online.
My motivation for writing the talk is that, while I have seen several Postgres common table expression (CTE) talks, they started at too
advanced a level for me to understand how all the CTE parts worked together. My talk starts at a basic level and adds features slowly
until complex CTE queries are shown. The audience seemed to like the talk and had good questions, so
I will take that as a good sign.
Limiting Postgres ConnectionsFriday, February 24, 2012
During recent training, I realized there five ways to control who can access Postgres, and it isn't totally clear in the documentation how they fit together. There are basically a series of gates that users must pass through to connect to a Postgres database. I thought I would list them here, in the order they are applied:
Once all of these gates have been passed, authorization is now controlled by the GRANT and REVOKE commands.
View or Post Comments
Staying Close to HomeTuesday, February 21, 2012
Today I realized that I have nine events completed or scheduled for 2012, and
they are all in the eastern part of North America (except for Austin, Texas). This is exciting because I am
on track to cover the same number of conferences as last year, but without the
crazy travel schedule.
Not that I don't love going to distant lands, but it is
also nice to stay close to home for a while. This might become typical as the density of Postgres users increases.
Part 1 of Virtualizing Postgres Now OnlineFriday, February 17, 2012
I got many requests to hear my Virtualizing Postgres presentation, so I have placed the first part of my talk online. Part 1 is a virtualization primer. It took a lot of digging to find these details, and I have included many URLs for people who want more information. Unfortunately, this part doesn't cover many Postgres-specific aspects, which will be covered in parts 2 and 3.
A few people have asked where I am presenting this talk; I have no idea, but perhaps it will be chosen by the organizers of one of the upcoming USA one-day conferences.
View or Post Comments
The USA Is Planning a Triple PlayThursday, February 16, 2012
The United States is planning three one-day Postgres conferences in a one-week span:
I am excited to be attending all three events, along with Greg Smith and Magnus Hagander. I will also be doing training in Austin and New York City. (I am already doing training next week in the Reston area.)
View or Post Comments
Sorting Performance Improvement for Postgres 9.2Thursday, February 16, 2012
As part of the work on Postgres 9.2, yesterday Robert Haas applied a much-discussed (December, January, February) patch by Peter Geoghegan to improve sort performance. Having recently blogged about how work_mem affects sort performance, I thought I would re-run my tests to see the improvement:
Temporary Storage | Old | New -----------------------+--------+------- magnetic | 68 | 67 magnetic (writeback) | 50 | 48 SSD | 36 | 32 tmpfs | 30 | 27 none (high work_mem) | 12 | 9.5
As you can see, the new patch shaves about 3 seconds off of my test. That isn't a huge win for sorts that have to hit the disk, but for in-memory sorts (the last row), it yields a 20% improvement, which closely matches the ~25% reported by testers of the patch. This is one of the many sort improvements coming in Postgres 9.2.
View or Post Comments
Virtualizing PostgresThursday, February 9, 2012
Postgres is an ideal database to run in a virtual environment or public/private cloud — one reason is that Postgres relies heavily on the operating system, rather than using features like raw devices. Second, its license is obviously very flexible for virtual deployments.
I am often asked about running Postgres in virtual environments, and I usually answer that it runs just fine — and it does. However, I am starting to realize that I am not answering the more complex questions of which visualization technology to choose, and what is the performance and reliability impact of virtualization.
I have started writing a talk, Virtualizing Postgres, which will explore this. I had great trouble finding details of exactly how major virtualization approaches differ, how CPU-accelerated virtualization works, and how additional file system layers affect performance, and how these affect Postgres. I hope to share these details, and more, in my talk.
Once I present the talk at a conference or webcast, I will add the presentation to my website.
View or Post Comments
Let's See work_memThursday, February 2, 2012
Having shown memory allocation (and deallocation) in my previous blog post, I would like to show work_mem in action.
First, I ran the following database session using a ~6k RPM drive:
psql <<END SET log_temp_files = 0; SET client_min_messages = log; BEGIN WORK; DECLARE c1 CURSOR FOR SELECT random() FROM generate_series(1, 10000000) ORDER BY 1; \timing FETCH 1 FROM c1; \timing COMMIT; END
and got this output (I added commas to the allocation size, for clarity):
Time: 68238.315 ms LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp11852.1", size 220,241,920 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp11852.0", size 140,000,000
As you can see, it uses 360MB for temporary sort files, and takes 68 seconds. I tried changing the magnetic disk's ext3 file system from the default data=ordered to data=writeback and saw the time drop to 50 seconds. (data=writeback is the recommended mount option for Postgres file systems). Running the same on an Intel 320 SSD took 36 seconds.
This server has 24GB of RAM, mostly unused, so I was surprised that there was a difference between magnetic and solid-state drives because there was no need to force anything to disk, except for file system crash recovery. This suggests that creating a symbolic link from PGDATA/base/pgsql_tmp to a memory-based file system (tmpfs) might be wise; I did that and the test took 30 seconds.
But, of course, another solution would be to increase work_mem so no temporary files are unnecessary, as in this test:
psql <<END SET log_temp_files = 0; SET client_min_messages = log; SET work_mem = '1GB'; BEGIN WORK; DECLARE c1 CURSOR FOR SELECT random() FROM generate_series(1, 10000000) ORDER BY 1; \timing FETCH 1 FROM c1; \timing COMMIT; END
which completes in 12 seconds:
Time: 12091.462 ms
Notice there is no report of temporary files being created.
So looking at the times:
Temporary Storage | Seconds -----------------------+-------- magnetic | 68 magnetic (writeback) | 50 SSD | 36 tmpfs | 30 none (high work_mem) | 12
it is clear that:
tmpfs is probably much slower than the high work_mem test because of the extra comparisons necessary to perform the sort in one-megabyte batches (the default work_mem setting is '1MB').
Frankly, I am confused why a system with so much free RAM (and free kernel cache) was delayed by file system writes. I realize the data eventually must be written, but why would it delay these queries? I wrote a small test program to write 360MB but it consistency took only 0.8 seconds. Sorry, but it looks like I am going to have to conclude this blog entry with an unanswered question.
View or Post Comments
Postgres Memory SurprisesWednesday, February 1, 2012
In my previous blog entry, I analyzed how various tools (ps and smem) report memory usage. In summary:
With these issues understood, let's look at a running Postgres cluster:
---------- ps --------- -------- smem -------- TRS DRS VSZ RSS USS PSS RSS CMD 5171 61288 66460 7952 5184 5667 7952 postmaster -i 5171 61288 66460 1280 136 296 1280 writer process 5171 61288 66460 1212 144 276 1212 wal writer process 5171 62128 67300 2348 784 1103 2348 autovacuum launcher process 5171 21112 26284 1224 196 329 1224 stats collector process 5171 62380 67552 4484 2560 3025 4484 postgres postgres [backend]
The first thing that stands out is the high DRS and VSZ fields, representing the large amount of shared memory allocated by the parent and shared by the children — 32MB of that is shared buffers, another 9MB is other System V shared memory allocation (as reported by ipcs), 8MB is for postmaster-accessed local memory, and the remainder is probably for shared libraries that are in the virtual address space but not accessed. It is also clear that the stats collector process is not attached to shared memory.
What is also interesting is that very little of that memory appears in the smem columns — this is because little of it has been accessed — it is mostly just in the virtual address space.
For my first memory demonstration, I wanted to show how work_mem affects memory allocation. For this, I needed a query that would allocate a lot of memory, but not return a lot of data, so I settled on this query:
SELECT random() FROM generate_series(1, 10000000) ORDER BY 1 LIMIT 1;
However, that doesn't allocate much memory at all, because of this optimization added in Postgres 8.3:
ORDER BY ... LIMIT can be done without sorting (Greg Stark)This is done by sequentially scanning the table and tracking just the top N candidate rows, rather than performing a full sort of the entire table. This is useful when there is no matching index and the LIMIT is not large.
So, then, I had to place a WHERE clause between the LIMIT and the ORDER BY to disable the optimization:
SELECT * FROM (SELECT random() FROM generate_series(1, 10000000) ORDER BY 1) AS f(x) WHERE x < 0 LIMIT 1;
(But, of course, with that WHERE clause, the LIMIT is unnecessary.)
Great — I am monitoring process memory usage and can see the memory growing — but wait, the query is done and memory usage is back to its original values — it isn't supposed to do that. In all the operating systems I have previously worked with, allocated memory isn't returned to the operating system, and the virtual address space certainly doesn't shrink. I quick web search turns up confirming evidence — but wait, there is a mention of large allocations, and studying the Debian malloc manual page explains the behavior:
Normally, malloc() allocates memory from the heap, and adjusts the size of the heap as required, using sbrk(2). When allocating blocks of memory larger than MMAP_THRESHOLD bytes, the glibc malloc() implementation allocates the memory as a private anonymous mapping using mmap(2). MMAP_THRESHOLD is 128 kB by default, but is adjustable using mallopt(3). Allocations performed using mmap(2) are unaffected by the RLIMIT_DATA resource limit (see getrlimit(2)).
So how do I show the memory allocated by work_mem if it disappears when the query completes? I decided to use a cursor, and therefore a simpler query:
SET work_mem = '1GB'; BEGIN WORK; DECLARE c1 CURSOR FOR SELECT random() FROM generate_series(1, 10000000) ORDER BY 1; FETCH 1 FROM c1; \! (ps -o pid,trs,drs,vsz,rss,cmd -U postgres; smem -U postgres) | sort COMMIT; \! (ps -o pid,trs,drs,vsz,rss,cmd -U postgres; smem -U postgres) | sort
That shows sizable memory allocation while the query is running:
5171 61288 66460 7960 5080 5617 7960 postmaster -i 5171 61288 66460 1352 136 332 1352 writer process 5171 61288 66460 1212 144 277 1212 wal writer process 5171 62128 67300 2380 808 1126 2380 autovacuum launcher process 5171 21112 26284 1224 196 330 1224 stats collector process 5171 1685256 1690428 1411712 1409716 1410203 1411712 postgres postgres
and a return to normal memory usage once the transaction commits and the cursor is closed:
5171 61288 66460 7960 5076 5615 7960 postmaster -i 5171 61288 66460 1356 136 334 1356 writer process 5171 61288 66460 1212 144 277 1212 wal writer process 5171 62128 67300 2380 808 1126 2380 autovacuum launcher process 5171 21112 26284 1224 196 330 1224 stats collector process 5171 63044 68216 5236 3240 3727 5236 postgres postgres
I thought I was going to educate people about Postgres memory usage, but I ended up learning quite a bit myself; I hope you did too.
View or Post Comments
Revisiting Memory ReportingMonday, January 30, 2012
Memory is very important to databases — much more so than for typical applications (presentation). Unfortunately, because memory allocation is so complex, it is often hard to figure out how physical RAM is being used. There are several reasons for the complexity:
Robert Haas's excellent blog post highlighted much uncertainty about how to analyze memory usage for specific processes, especially Postgres. I commented on his blog, as did others, and now have a much clearer idea of how to study memory usage. A blog post by Chris Siebenmann directly addresses some of my and Robert's questions, and suggests smem as a way to analyze memory, especially the sharing of memory. It was interesting to learn that smem was designed specifically to address the problems Robert outlined (2007, 2009).
To learn more, I could have studied the source code of how my particular operating system (Debian 6.0) reports memory usage, but that probably would not have helped others running different operating systems, or even a different version of the same operating system. Therefore, I wrote a tool, pg_memalloc.c, which allocates memory in various ways and reports the memory allocation values from ps and smem:
$ pg_memalloc -h
Usage: pg_memalloc [-12fhsw] [mb]
-1 allocate memory in one process
-2 allocate memory in two processes (parent, child)
-f fork a child process
-h help (or --help)
-s use shared memory, rather than malloc()
-w wait indefinitely after allocations
mb allocate megabyte of memory (default=1)
(It should compile on most Unix-like operating systems.) Below is a chart showing the output for various argument combinations:
---------- ps ---------- ---------- smem ---------
Line Shared Fork Access PID TRS DRS VSZ RSS Swap USS PSS RSS Notes ARGS
1 1659 4 3711 3716 456 0 84 86 456 no memory 0
2 1673 4 4739 4744 476 0 92 94 476 default
3 * 1686 4 4735 4740 452 0 88 90 452 -s
4 * 1698 4 4739 4744 480 0 36 66 480 parent -f
5 * 1699 4 4739 4744 168 0 36 64 168 child -f
6 * * 1710 4 4735 4740 456 0 24 58 456 parent -f -s
7 * 1711 4 4735 4740 160 0 24 56 160 child -f -s
8 1 1722 4 4739 4744 1504 0 1116 1118 1504 -1
9 * 1 1733 4 4735 4740 1480 0 1112 1114 1480 -s -1
10 * 1 1745 4 4739 4744 1508 0 1064 1092 1508 parent -f -1
11 * 1 1746 4 4739 4744 168 0 40 66 168 child -f -1
12 * * 1 1818 4 4735 4740 1480 0 1052 1084 1480 parent -f -s -1
13 * * 1 1819 4 4735 4740 160 0 28 58 160 child -f -s -1
14 * 2 1780 4 4739 4744 1508 0 1064 1092 1508 parent -f -2
15 * 2 1781 4 4739 4744 1188 0 1060 1086 1188 child -f -2
16 * * 2 1849 4 4735 4740 1480 0 28 572 1480 parent -f -s -2
17 * * 2 1850 4 4735 4740 1184 0 28 570 1184 child -f -s -2
Line 1 represents the default output for zero allocated memory, so it is our baseline. Line 2 represents the default allocation of one megabyte. Notice that the data size (DRS) and virtual memory size (VSZ) increase by one megabyte, and in fact change very little for subsequent arguments. Similarly, the code/text size (TRS) and Swap remain unchanged.
The interesting values are the process-specific memory allocation (USS), proportional memory allocation (PSS), and resident size (RSS), which are all shown by smem.
(FYI, though TRS and DRS suggest resident size ("RS"), they are really reporting virtual size for specific memory segments; this can be proven because TRS + DRS is very close to VSZ, and clearly larger than RSS.)
Line 3 shows that shared memory in a single process has little effect on the reported memory usage.
Line 4 is the first case where subprocess is involved; the interesting change here is that the process-specific memory allocation (USS) is lower for the parent and child because the combined proportional memory allocation (PSS) is greater. This shows that fact that the parent and child are sharing memory pages. This sharing is happening as part of fork(), rather than using shared memory. Keep in mind none of the one megabyte of allocated memory has been accessed yet, so it doesn't show up in these columns. These numbers are really reflecting the executable and its dependent shared libraries. The child is also quite low because the parent has done much of the necessary memory access during startup.
Line 8 is the first time we actually access the one megabyte of allocated memory, and the numbers clearly reflect that. This highlights that you really need to access memory for it to show up, i.e. allocating it and not accessing it does not cause it to use physical RAM, though it does affect the virtual memory size.
Line 9 shows that using shared memory instead has little effect on the numbers.
Lines 10 and 11 are where things really get interesting — we are creating a child process, but accessing memory only in the parent. The parent process takes the memory hit, and the child process looks similar to line 5 where memory was not accessed. Lines 12 and 13 again show that allocating shared memory has little effect.
Lines 14 and 15 show that a parent and child both accessing one megabyte of memory each bear the full weight of the access.
Lines 16 and 17 are perhaps the most interesting — because shared memory is involved, there is only one copy of the one megabyte in RAM, and the two processes split the cost of the access in PSS, e.g. 572 and 570.
The output basically confirms what Chris Siebenmann reported. In a follow-up posting, I plan to show these same fields for a running Postgres server.
View or Post Comments
NoSQL Databases as the New Scripting LanguageFriday, January 27, 2012
This video presentation by Dwight Merriman (MongoDB) at OSCON Data 2011 explores the role of NoSQL databases. He makes some very interesting points:
What I found most interesting about his presentation was the comparison to dynamically-typed languages. When dynamically-typed scripting languages started to become popular years ago, there was a sense that they were only for trivial applications, compared to compiled language that were for serious work. As CPUs became faster, scripting languages were increasingly used for production applications, particularly web applications that are modified frequently. There might be a similar pattern now in the (improper?) dismissal of NoSQL for serious applications.
Frankly, I think the biggest limitation of NoSQL is the need to do more complex processing in application code because the powerful SQL language is not available. Of course, the lack of transaction semantics and joins are also serious limitations for many workloads. Having NoSQL map to JSON has benefits if your web application uses JSON, but it hard to replace the powerful SQL features (e.g. GROUP BY) in an application. (Postgres is working on a native JSON storage data type.) Even performance on single node isn't a reason to use NoSQL since Postgres's UNLOGGED tables gives similar performance.
So, the big benefit for NoSQL is horizontal, read-write scaling and dynamic schemas. The big question for users considering NoSQL is whether these features are worth the loss of traditional relational database capabilities. Certainly, for some users, it is worth it — I think the question many relational database users are wondering is which current relational use-cases are better with NoSQL.
I hope this blog entry has helped answer that question. Josh Berkus has an excellent feature article that goes into the complexities of NoSQL database selection, and Gavin Roy has an interesting presentation on the topic.
Update: This article warns about using NoSQL in the wrong places.
View or Post Comments
Scalability What?Thursday, January 26, 2012
The term "scalability" gets injected into database conversations often, but scalability discussions often end without a clear consensus. I believe this is because "scalability", like "replication", actually means different things to different people.
There are two dimensions to database scalability; the first dimension specifies if additional servers are involved:
Think of "vertical" as the server getting larger (taller), and horizontal as having more servers on the floor (wider). The second dimension specifies the type of workload to be scaled:
Keep in mind, the issue is not whether the server workload is read-only or read-write, but rather whether you need to scale writes as well as reads. There are tools (e.g. pgpool-II, see below) that can split out read-only queries for scaling while allowing read-write queries to go to a single master server.
With these terms defined, let's see how they are handled by Postgres. Vertical scalability is handled very well by Postgres for both read-only and read-write workloads, and improvements are coming in Postgres 9.2. I think the only area we don't scale well in is using multiple CPUs for a single query, and I have already blogged about that limitation.
Horizontal scaling is more complex because the server interconnect speed, even using InfiniBand, is much slower than interprocess communication via shared memory. For read-only workloads, a combination of pgpool-II and streaming replication allows for simple and powerful read-only load balancing across multiple servers, e.g. horizontal scaling.
What is hard to do is read-write horizontal scaling, because writes to different servers can conflict with each other, and communication between servers is relatively slow. One possible solution is Bucardo, which does not prevent conflicts but rather provides a mechanism for conflicting committed transactions to be ignored. The Postgres-XC team is working on a more sophisticated read-write, horizontal scaling system that allows conflicts to be detected before commit, but the software is not production-ready.
So, the next time you talk about scalability, it might be helpful to explain exactly what type of scaling you need — it will help to make the discussion a lot more fruitful.
Update: One other alternative for horizontal, read-write scaling is sharding, such as with PL/Proxy (details). This basically avoids write conflicts by spreading data across multiple servers. 2012-01-26
View or Post Comments
Increasing Database ReliabilityWednesday, January 25, 2012
While database software can be the cause of outages, for Postgres, it is often not the software but the hardware that causes failures — and storage is often the failing component. Magnetic disk is one of the few moving parts on a computer, and hence prone to breakage, and solid-state drives (SSDs) have a finite write limit.
While waiting for storage to start making loud noises or fail is an option, a better option is to use some type of monitoring that warns of storage failure before it occurs, e.g. enter SMART. SMART is a system developed by storage vendors that allows the operating system to query diagnostics on the drive and warn of unusual storage behavior before failure occurs. While read/write failures are reported by the kernel, SMART parameters often warn of danger before failure occurs. Below is the SMART output from a Western Digital (WDC) WD20EARX magnetic disk drive:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 174 174 021 Pre-fail Always - 6300 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 11 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 145 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 9 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 8 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 543 194 Temperature_Celsius 0x0022 127 119 000 Old_age Always - 23 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
(These charts were generated using smartmon.) The VALUE column ranges from 0-200, with higher values being better. The RAW_VALUE column is more difficult to interpret because in some cases high is good, and in others, it is bad (e.g. temperature). The next chart is from an Intel 320 Series SSD:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 3 Spin_Up_Time 0x0020 100 100 000 Old_age Offline - 0 4 Start_Stop_Count 0x0030 100 100 000 Old_age Offline - 0 5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 1120 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 38 170 Unknown_Attribute 0x0033 100 100 010 Pre-fail Always - 0 171 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 172 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 183 Runtime_Bad_Block 0x0030 100 100 000 Old_age Offline - 0 184 End-to-End_Error 0x0032 100 100 090 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 19 199 UDMA_CRC_Error_Count 0x0030 100 100 000 Old_age Offline - 0 225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 177828 226 Load-in_Time 0x0032 100 100 000 Old_age Always - 210631 227 Torq-amp_Count 0x0032 100 100 000 Old_age Always - 0 228 Power-off_Retract_Count 0x0032 100 100 000 Old_age Always - 4261545 232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0 233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 0 241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 177828 242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 96042
Notice there is no temperature value (probably unnecessary), but it has additional values like Media_Wearout_Indicator, which indicates the wear level of the memory chips. It shows a wear value of "100", indicating no measurable wear.
SMART also records all errors generated by the drive, and allows drive self-tests to be run, independent of the operating system. This is helpful in isolating error causes.
While you could manually check the SMART values of your drives periodically, there are tools, e.g. smartd, that can automatically query drives and warn administrators of potential problems, e.g. every 30 minutes. System administrators that want to maximize uptime should always use such available tools to get failure warnings before they happen.
View or Post Comments
The Most Important Postgres CPU InstructionTuesday, January 24, 2012
Postgres consists of roughly 1.1 million lines of C code, which is compiled into an executable with millions of CPU instructions. Of the many CPU machine-language instructions in the Postgres server executable, which one is the most important? That might seem like an odd question, and one that is hard to answer, but I think I know the answer.
You might wonder, "If Postgres is written in C, how would we find the most important machine-language instruction?" Well, there is a trick to that. Postgres is not completely written in C. There is a very small file (1000 lines) with C code that adds specific assembly-language CPU instructions into the executable. This file is called s_lock.h. It is an include file that is referenced in various parts of the server code that allows very fast locking operations. The C language doesn't supply fast-locking infrastructure, so Postgres is required to supply its own locking instructions for all twelve supported CPU architectures. (Operating system kernels do supply locking instructions, but they are much too slow to be used for Postgres.)
The specific locking operation supported by the CPU is called test-and-set. It allows a value to be stored in a memory location and, at the same time, the old value to be returned. This allows for two sessions to simultaneously request a lock and for only one session to successfully acquire the lock. Spinlocks and other more complex locks are built upon test-and-set; see my Inside PostgreSQL Shared Memory presentation for more details. this.
I have copied a small part of s_lock.h to show this important instruction on x86 architectures:
1 static __inline__ int
2 tas(volatile slock_t lock)
3 {
4 register slock_t _res = 1;
5
6 __asm__ __volatile__(
7 " cmpb $0,%1 \n"
8 " jne 1f \n"
9 " lock \n"
10 " xchgb %0,%1 \n"
11 "1: \n"
12 : "+q"(_res), "+m"(lock)
13 :
14 : "memory", "cc");
15 return (int) _res;
16 }
The first line identifies the function as returning an integer ("int"), and "__inline__" causes the the function body to likely be placed at the function reference site, rather than generating a function call and storing the function body someplace else. Line 2 names the function as "tas" (for Test-And-Set), and indicates it takes an "slock_t" memory address ("lock"). Line 4 defines a local variable "_res" which will hold our initial lock value (1), and the function return value. Line 6 starts a block of assembly-language code in GCC, with the code on lines lines 7-11. Line 12 indicates that the assembly language uses two input/output parameters: "_res" (%0) and "*lock" (%1). Lines 7 and 8 check to see if the lock is already non-zero, indicating the lock is already held. If so, it jumps to the end at line 11 ("1:"). Line 9 locks the memory address referenced by the next instruction. Line 10 is the big one, the most important assembly language instruction in the Postgres server executable: "xchgb". It does a test-and-set or exchange of the "_res" (local register) and "*lock" (shared memory) values, swapping them in an atomic manner. Line 14 indicates that memory and the condition code register are modified by the assembly language. Line 15 returns zero for success, and one for failure, which can occur if the test on line 7 or the exchange on line 10 found the lock was already held.
So, there is it is — the most important CPU instruction, at last for x86 CPUs. If you look through s_lock.h, you will find similar test-and-set instructions all the other supported CPU architectures.
View or Post Comments
More Lessons From My Server MigrationMonday, January 23, 2012
The new server is 2-10 times faster than my old 2003 server, but that 10x speedup is only possible for applications that:
Other observations:
Hopefully I can translate some of these lessons into Postgres improvements in the coming years. With this new server, and my recent
upgrade to fiber-based Internet, my home feels like a data center.
New ServerFriday, January 20, 2012
A few weeks ago, I finally replaced my eight-year-old home server. The age of my server, and its operating system, (BSD/OS, last officially updated in 2002) were a frequent source of amusement among Postgres community members. The new server is:
Super Micro 7046A-T 4U Tower Workstation 2 x Intel Xeon E5620 2.4GHz Quad-Core Processors 24GB Crucial Dual-Rank PC3-10600 DDR3 SDRAM 160GB Intel 320 Series SSD Drive 4 x Western Digital Caviar Green 2TB Hard Drives
The server replacement took about 100 hours of my time over four months, from specification, testing, and configuration. (Many thanks to Greg Smith for helping select and test the server.) Getting Postgres working on the new Debian server was trivial — even the historically difficult SGML documentation build process was easy. The server required a lot of configuration because it has twenty years of Unix customizations that do family calendaring, contacts, home automation, telephone, typesetting, web serving, email, and family media serving.
I learned a few things related to Postgres adoption in the process. We all have friends who are using databases other than Postgres. They often would like to use Postgres, but need a reason to switch. Sometimes cost, features, or performance can motivate such a switch, but without those, it is hard to recommend a time-consuming migration to Postgres. However, eventually, all hardware and software has to be upgraded, and when that time comes, I looked to friends with experience for advice. People considering upgrading their databases will do the same — we just need to be ready to give advice when the time comes.
View or Post Comments
Toast QueriesThursday, January 19, 2012
As a followup to my previous blog entry, I want to show queries that allow users to analyze toast tables. First, we find the toast details about the test heap table:
SELECT oid, relname, reltoastrelid, reltoastidxid FROM pg_class where relname = 'test'; oid | relname | reltoastrelid | reltoastidxid -------+---------+---------------+--------------- 17172 | test | 17175 | 0
Notice pg_class has two toast-related fields, reltoastrelid and reltoastidxid. This heap table references the toast table 17175:
SELECT oid, relname, reltoastrelid, reltoastidxid FROM pg_class where oid = 17175; oid | relname | reltoastrelid | reltoastidxid -------+----------------+---------------+--------------- 17175 | pg_toast_17172 | 0 | 17177
Notice the toast table name contains the oid of the heap table (17172). The toast table references its index, 17177:
SELECT oid, relname, reltoastrelid, reltoastidxid FROM pg_class where oid = 17177; oid | relname | reltoastrelid | reltoastidxid -------+----------------------+---------------+--------------- 17177 | pg_toast_17172_index | 0 | 0
You can actually look inside the toast table:
\d+ pg_toast.pg_toast_17172;
TOAST table "pg_toast.pg_toast_17172"
Column | Type | Storage
------------+---------+---------
chunk_id | oid | plain
chunk_seq | integer | plain
chunk_data | bytea | plain
SELECT * FROM pg_toast.pg_toast_17172;
chunk_id | chunk_seq | chunk_data
----------+-----------+--------------------------------------------------------------------
17206 | 0 |\x40420f00fe410f01ff0f01ff0f01ff0f01ff0f01ff0f01ffff0f01ff0f01ff0f.
| |.01ff0f01ff0f01ff0f01ff0f01ffff0f01ff0f01ff0f01ff0f01ff0f01ff0f01f.
| |.f0f01ff0f01ffff0f01ff0f01ff0f01ff0f01ff0f01ff0f01ff0f01ffff0f01ff.
…
Of course, this is the internal, compressed representation. The chunk_id is the value stored in the main heap table to reference the proper long toast value. The data is stored in the chunk_data field in 8k chunks, tracked by the chunk_seq.
Using ALTER TABLE and doing another insert, it is possible to see the values in their uncompressed, hex format:
ALTER TABLE test ALTER COLUMN y SET STORAGE EXTERNAL;
INSERT INTO test SELECT 1, string_agg(x, '')
FROM (
SELECT 'A'::text
FROM generate_series(1, 1000000)
) AS f(x);
SELECT * FROM pg_toast.pg_toast_17172;
— insert more 'A' data
…
chunk_id | chunk_seq | chunk_data
----------+-----------+---------------------------------------------------------------------
17207 | 0 |\x414141414141414141414141414141414141414141414141414141414141414141.
| |.4141414141414141414141414141414141414141414141414141414141414141414.
| |.1414141414141414141414141414141414141414141414141414141414141414141.
41 is the hex code for A, so with compression turned off, the toast table clearly shows the storage of the one-million A's.
View or Post Comments
Toast-y GoodnessTuesday, January 17, 2012
Many things are better toasted: cheese sandwiches, nuts, marshmallows. Even some of your Postgres data is better toasted — let me explain.
Postgres typically uses an eight-kilobyte block size — you can verify this by running pg_controldata:
Database block size: 8192
If that is true, how does Postgres support a field limit of one gigabyte, as mentioned in the Postgres FAQ?
What is the maximum size for a row, a table, and a database? … Maximum size for a field? 1 GB
One solution would be to allow rows to span multiple 8k blocks, but this would introduce source code complexity and performance problems. A better solution, implemented in Postgres since 2001, is TOAST (The Oversized-Attribute Storage Technique). Toast uses a backup table to store a row's long values. Its goal is to move enough row values into toast tables so the row length is less than about 2 kilobytes. You can read the details in tuptoaster.c.
Toast tables are actually hard to find — they don't show up when looking at the table via psql, even in verbose mode:
\d+ test
Table "public.test"
Column | Type | Modifiers | Storage | Stats target | Description
--------+---------+-----------+----------+--------------+-------------
x | integer | | plain | |
y | text | | extended | | Has OIDs: no
The only hint of their existence is the Storage column. Column x is a short, fix-length column, and hence is not considered for toast storage (marked plain). Column y is a "TEXT" column and can store very long values and hence is considered for toast storage. The possible storage values are listed in the TOAST documentation section and in the ALTER TABLE manual page. The options control whether data is compressed, and whether it is considered for toast storage. I use the word "considered" here because values are not unconditionally forced into compression or toast storage — they are only considered for such storage if the row length is long enough.
The toast system has several advantages:
You can see the effect of toast in the following queries:
INSERT INTO test SELECT 1, string_agg(x, '')
FROM (
SELECT 'A'::text
FROM generate_series(1, 1000000)
) AS f(x);
SELECT pg_table_size('test');
pg_table_size
---------------
65536
Notice that though one-million characters were inserted into the table, compression reduced that to 64k, made up of eight mostly-empty 8k pages: one heap page, two toast pages, three toast free space map (fsm) pages, two toast index pages. We can use pg_relation_size() to see each piece separately:
SELECT pg_relation_size('test');
pg_relation_size
------------------
8192
SELECT pg_relation_size('pg_toast.pg_toast_17172');
pg_relation_size
------------------
16384
SELECT pg_relation_size('pg_toast.pg_toast_17172', 'fsm');
pg_relation_size
------------------
24576
SELECT pg_relation_size('pg_toast.pg_toast_17172_index');
pg_relation_size
------------------
16384
Inserting random characters, which can't be easily compressed, does show the table size increasing by one megabyte, with most of that storage being consumed by the toast table, as expected:
<PRE> INSERT INTO test SELECT 1, string_agg(x, '') FROM (SELECT chr(ascii('A') + (random() * 25)::integer) FROM generate_series(1,1000000) ) AS f(x);
SELECT pg_table_size('test'); pg_table_size View or Post Comments
1105920
SELECT pg_relation_size('pg_toast.pg_toast_17172'); pg_relation_size View or Post Comments
1040384 </PRE>
As you can see, toast works transparently. In fact, it works so well that the Postgres community hasn't even found it necessary to provide tooling to study toast behavior. I hope this blog entry has provided interesting details on how Postgres uses toast to store long values transparently.
View or Post Comments
Coming to Boston Monday, January 16, 2012
Speaking of presentations, I get to use my updated presentations this Thursday when I speak to the Boston PostgreSQL Users Group. Then, on Friday, I get to use more of them when I do training at EnterpriseDB's headquarters.
View or Post Comments
Presentations UpdatedMonday, January 16, 2012
As part of my server upgrade, I migrated to a newer version of LyX and LaTeX, but most significantly, to a more modern and powerful LaTeX document class, Beamer. All 1300 slides in my presentations have been updated. If you see something that needs improvement, no matter how minor, please let please know via email, chat, or blog comment.
View or Post Comments