This blog is about my work on the Postgres open source database, and is also published on Planet PostgreSQL.
Online status: away
Unread Postgres emails:
7317
Email graphs:
incoming,
outgoing, unread, commits, conferences
Let's See work_mem
Thursday, February 2, 2012
Having shown memory allocation (and deallocation) in my previous blog post, I would like to show work_mem in action.
First, I ran the following database session using a ~6k RPM drive:
psql <<END SET log_temp_files = 0; SET client_min_messages = log; BEGIN WORK; DECLARE c1 CURSOR FOR SELECT random() FROM generate_series(1, 10000000) ORDER BY 1; \timing FETCH 1 FROM c1; \timing COMMIT; END
and got this output (I added commas to the allocation size, for clarity):
Time: 68238.315 ms LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp11852.1", size 220,241,920 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp11852.0", size 140,000,000
As you can see, it uses 360MB for temporary sort files, and takes 68 seconds. I tried changing the magnetic disk's ext3 file system from the default data=ordered to data=writeback and saw the time drop to 50 seconds. (data=writeback is the recommended mount option for Postgres file systems). Running the same on an Intel 320 SSD took 36 seconds.
This server has 24GB of RAM, mostly unused, so I was surprised that there was a difference between magnetic and solid-state drives because there was no need to force anything to disk, except for file system crash recovery. This suggests that creating a symbolic link from PGDATA/base/pgsql_tmp to a memory-based file system (tmpfs) might be wise; I did that and the test took 30 seconds.
But, of course, another solution would be to increase work_mem so no temporary files are unnecessary, as in this test:
psql <<END SET log_temp_files = 0; SET client_min_messages = log; SET work_mem = '1GB'; BEGIN WORK; DECLARE c1 CURSOR FOR SELECT random() FROM generate_series(1, 10000000) ORDER BY 1; \timing FETCH 1 FROM c1; \timing COMMIT; END
which completes in 12 seconds:
Time: 12091.462 ms
Notice there is no report of temporary files being created.
So looking at the times:
Temporary Storage | Seconds -----------------------+-------- magnetic | 68 magnetic (writeback) | 50 SSD | 36 tmpfs | 30 none (high work_mem) | 12
it is clear that:
- data=writeback helps when using temporary files
- SSDs are faster, similar to tmpfs
- Using no temporary storage is even better
tmpfs is probably much slower than the high work_mem test because of the extra comparisons necessary to perform the sort in one-megabyte batches (the default work_mem setting is '1MB').
Frankly, I am confused why a system with so much free RAM (and free kernel cache) was delayed by file system writes. I realize the data eventually must be written, but why would it delay these queries? I wrote a small test program to write 360MB but it consistency took only 0.8 seconds. Sorry, but it looks like I am going to have to conclude this blog entry with an unanswered question.
Postgres Memory Surprises
Wednesday, February 1, 2012
In my previous blog entry, I analyzed how various tools (ps and smem) report memory usage. In summary:
- ps columns TRS, DRS, and VSZ report virtual address space allocated, not actual RAM allocated.
- smem's USS reports a process's private (unshared) memory allocated.
- smem's PSS is a sum of process's private memory allocated and a proportional amount of shared memory (both System V shared memory, like Postgres's shared_buffers, and shared libraries).
- RSS shows actual RAM allocated, private and shared.
With these issues understood, let's look at a running Postgres cluster:
---------- ps --------- -------- smem -------- TRS DRS VSZ RSS USS PSS RSS CMD 5171 61288 66460 7952 5184 5667 7952 postmaster -i 5171 61288 66460 1280 136 296 1280 writer process 5171 61288 66460 1212 144 276 1212 wal writer process 5171 62128 67300 2348 784 1103 2348 autovacuum launcher process 5171 21112 26284 1224 196 329 1224 stats collector process 5171 62380 67552 4484 2560 3025 4484 postgres postgres [backend]
The first thing that stands out is the high DRS and VSZ fields, representing the large amount of shared memory allocated by the parent and shared by the children — 32MB of that is shared buffers, another 9MB is other System V shared memory allocation (as reported by ipcs), 8MB is for postmaster-accessed local memory, and the remainder is probably for shared libraries that are in the virtual address space but not accessed. It is also clear that the stats collector process is not attached to shared memory.
What is also interesting is that very little of that memory appears in the smem columns — this is because little of it has been accessed — it is mostly just in the virtual address space.
For my first memory demonstration, I wanted to show how work_mem affects memory allocation. For this, I needed a query that would allocate a lot of memory, but not return a lot of data, so I settled on this query:
SELECT random() FROM generate_series(1, 10000000) ORDER BY 1 LIMIT 1;
However, that doesn't allocate much memory at all, because of this optimization added in Postgres 8.3:
ORDER BY ... LIMIT can be done without sorting (Greg Stark)This is done by sequentially scanning the table and tracking just the top N candidate rows, rather than performing a full sort of the entire table. This is useful when there is no matching index and the LIMIT is not large.
So, then, I had to place a WHERE clause between the LIMIT and the ORDER BY to disable the optimization:
SELECT * FROM (SELECT random() FROM generate_series(1, 10000000) ORDER BY 1) AS f(x) WHERE x < 0 LIMIT 1;
(But, of course, with that WHERE clause, the LIMIT is unnecessary.)
Great — I am monitoring process memory usage and can see the memory growing — but wait, the query is done and memory usage is back to its original values — it isn't supposed to do that. In all the operating systems I have previously worked with, allocated memory isn't returned to the operating system, and the virtual address space certainly doesn't shrink. I quick web search turns up confirming evidence — but wait, there is a mention of large allocations, and studying the Debian malloc manual page explains the behavior:
Normally, malloc() allocates memory from the heap, and adjusts the size of the heap as required, using sbrk(2). When allocating blocks of memory larger than MMAP_THRESHOLD bytes, the glibc malloc() implementation allocates the memory as a private anonymous mapping using mmap(2). MMAP_THRESHOLD is 128 kB by default, but is adjustable using mallopt(3). Allocations performed using mmap(2) are unaffected by the RLIMIT_DATA resource limit (see getrlimit(2)).
So how do I show the memory allocated by work_mem if it disappears when the query completes? I decided to use a cursor, and therefore a simpler query:
SET work_mem = '1GB'; BEGIN WORK; DECLARE c1 CURSOR FOR SELECT random() FROM generate_series(1, 10000000) ORDER BY 1; FETCH 1 FROM c1; \! (ps -o pid,trs,drs,vsz,rss,cmd -U postgres; smem -U postgres) | sort COMMIT; \! (ps -o pid,trs,drs,vsz,rss,cmd -U postgres; smem -U postgres) | sort
That shows sizable memory allocation while the query is running:
5171 61288 66460 7960 5080 5617 7960 postmaster -i 5171 61288 66460 1352 136 332 1352 writer process 5171 61288 66460 1212 144 277 1212 wal writer process 5171 62128 67300 2380 808 1126 2380 autovacuum launcher process 5171 21112 26284 1224 196 330 1224 stats collector process 5171 1685256 1690428 1411712 1409716 1410203 1411712 postgres postgres
and a return to normal memory usage once the transaction commits and the cursor is closed:
5171 61288 66460 7960 5076 5615 7960 postmaster -i 5171 61288 66460 1356 136 334 1356 writer process 5171 61288 66460 1212 144 277 1212 wal writer process 5171 62128 67300 2380 808 1126 2380 autovacuum launcher process 5171 21112 26284 1224 196 330 1224 stats collector process 5171 63044 68216 5236 3240 3727 5236 postgres postgres
I thought I was going to educate people about Postgres memory usage, but I ended up learning quite a bit myself; I hope you did too.
Revisiting Memory Reporting
Monday, January 30, 2012
Memory is very important to databases — much more so than for typical applications (presentation). Unfortunately, because memory allocation is so complex, it is often hard to figure out how physical RAM is being used. There are several reasons for the complexity:
- Virtual Memory: CPUs in virtual memory mode don't access RAM directly, but rather through page tables.
- Segmentation: Memory is allocated in specific segments: text(code), data, and stack.
- Sharing: physical RAM is often shared by multiple processes, either in read-only mode (program instructions), shared mode (read/write of share memory), or copy-on-write (create a new copy on write; used by fork).
Robert Haas's excellent blog post highlighted much uncertainty about how to analyze memory usage for specific processes, especially Postgres. I commented on his blog, as did others, and now have a much clearer idea of how to study memory usage. A blog post by Chris Siebenmann directly addresses some of my and Robert's questions, and suggests smem as a way to analyze memory, especially the sharing of memory. It was interesting to learn that smem was designed specifically to address the problems Robert outlined (2007, 2009).
To learn more, I could have studied the source code of how my particular operating system (Debian 6.0) reports memory usage, but that probably would not have helped others running different operating systems, or even a different version of the same operating system. Therefore, I wrote a tool, pg_memalloc.c, which allocates memory in various ways and reports the memory allocation values from ps and smem:
$ pg_memalloc -h
Usage: pg_memalloc [-12fhsw] [mb]
-1 allocate memory in one process
-2 allocate memory in two processes (parent, child)
-f fork a child process
-h help (or --help)
-s use shared memory, rather than malloc()
-w wait indefinitely after allocations
mb allocate megabyte of memory (default=1)
(It should compile on most Unix-like operating systems.) Below is a chart showing the output for various argument combinations:
---------- ps ---------- ---------- smem ---------
Line Shared Fork Access PID TRS DRS VSZ RSS Swap USS PSS RSS Notes ARGS
1 1659 4 3711 3716 456 0 84 86 456 no memory 0
2 1673 4 4739 4744 476 0 92 94 476 default
3 * 1686 4 4735 4740 452 0 88 90 452 -s
4 * 1698 4 4739 4744 480 0 36 66 480 parent -f
5 * 1699 4 4739 4744 168 0 36 64 168 child -f
6 * * 1710 4 4735 4740 456 0 24 58 456 parent -f -s
7 * 1711 4 4735 4740 160 0 24 56 160 child -f -s
8 1 1722 4 4739 4744 1504 0 1116 1118 1504 -1
9 * 1 1733 4 4735 4740 1480 0 1112 1114 1480 -s -1
10 * 1 1745 4 4739 4744 1508 0 1064 1092 1508 parent -f -1
11 * 1 1746 4 4739 4744 168 0 40 66 168 child -f -1
12 * * 1 1818 4 4735 4740 1480 0 1052 1084 1480 parent -f -s -1
13 * * 1 1819 4 4735 4740 160 0 28 58 160 child -f -s -1
14 * 2 1780 4 4739 4744 1508 0 1064 1092 1508 parent -f -2
15 * 2 1781 4 4739 4744 1188 0 1060 1086 1188 child -f -2
16 * * 2 1849 4 4735 4740 1480 0 28 572 1480 parent -f -s -2
17 * * 2 1850 4 4735 4740 1184 0 28 570 1184 child -f -s -2
Line 1 represents the default output for zero allocated memory, so it is our baseline. Line 2 represents the default allocation of one megabyte. Notice that the data size (DRS) and virtual memory size (VSZ) increase by one megabyte, and in fact change very little for subsequent arguments. Similarly, the code/text size (TRS) and Swap remain unchanged.
The interesting values are the process-specific memory allocation (USS), proportional memory allocation (PSS), and resident size (RSS), which are all shown by smem.
(FYI, though TRS and DRS suggest resident size ("RS"), they are really reporting virtual size for specific memory segments; this can be proven because TRS + DRS is very close to VSZ, and clearly larger than RSS.)
Line 3 shows that shared memory in a single process has little effect on the reported memory usage.
Line 4 is the first case where subprocess is involved; the interesting change here is that the process-specific memory allocation (USS) is lower for the parent and child because the combined proportional memory allocation (PSS) is greater. This shows that fact that the parent and child are sharing memory pages. This sharing is happening as part of fork(), rather than using shared memory. Keep in mind none of the one megabyte of allocated memory has been accessed yet, so it doesn't show up in these columns. These numbers are really reflecting the executable and its dependent shared libraries. The child is also quite low because the parent has done much of the necessary memory access during startup.
Line 8 is the first time we actually access the one megabyte of allocated memory, and the numbers clearly reflect that. This highlights that you really need to access memory for it to show up, i.e. allocating it and not accessing it does not cause it to use physical RAM, though it does affect the virtual memory size.
Line 9 shows that using shared memory instead has little effect on the numbers.
Lines 10 and 11 are where things really get interesting — we are creating a child process, but accessing memory only in the parent. The parent process takes the memory hit, and the child process looks similar to line 5 where memory was not accessed. Lines 12 and 13 again show that allocating shared memory has little effect.
Lines 14 and 15 show that a parent and child both accessing one megabyte of memory each bear the full weight of the access.
Lines 16 and 17 are perhaps the most interesting — because shared memory is involved, there is only one copy of the one megabyte in RAM, and the two processes split the cost of the access in PSS, e.g. 572 and 570.
The output basically confirms what Chris Siebenmann reported. In a follow-up posting, I plan to show these same fields for a running Postgres server.
NoSQL Databases as the New Scripting Language
Friday, January 27, 2012
This video presentation by Dwight Merriman (MongoDB) at OSCON Data 2011 explores the role of NoSQL databases. He makes some very interesting points:
- NoSQL was created to allow horizontal, read-write scaling
- NoSQL uses a dynamic schema
- NoSQL is similar to dynamically-typed languages, e.g. Perl, PHP
- NoSQL has serious limitations compared to relational data storage, e.g. reporting
What I found most interesting about his presentation was the comparison to dynamically-typed languages. When dynamically-typed scripting languages started to become popular years ago, there was a sense that they were only for trivial applications, compared to compiled language that were for serious work. As CPUs became faster, scripting languages were increasingly used for production applications, particularly web applications that are modified frequently. There might be a similar pattern now in the (improper?) dismissal of NoSQL for serious applications.
Frankly, I think the biggest limitation of NoSQL is the need to do more complex processing in application code because the powerful SQL language is not available. Of course, the lack of transaction semantics and joins are also serious limitations for many workloads. Having NoSQL map to JSON has benefits if your web application uses JSON, but it hard to replace the powerful SQL features (e.g. GROUP BY) in an application. (Postgres is working on a native JSON storage data type.) Even performance on single node isn't a reason to use NoSQL since Postgres's UNLOGGED tables gives similar performance.
So, the big benefit for NoSQL is horizontal, read-write scaling and dynamic schemas. The big question for users considering NoSQL is whether these features are worth the loss of traditional relational database capabilities. Certainly, for some users, it is worth it — I think the question many relational database users are wondering is which current relational use-cases are better with NoSQL.
I hope this blog entry has helped answer that question. Josh Berkus has an excellent feature article that goes into the complexities of NoSQL database selection, and Gavin Roy has an interesting presentation on the topic.
Scalability What?
Thursday, January 26, 2012
The term "scalability" gets injected into database conversations often, but scalability discussions often end without a clear consensus. I believe this is because "scalability", like "replication", actually means different things to different people.
There are two dimensions to database scalability; the first dimension specifies if additional servers are involved:
- vertical: The hardware capabilities of a single database server are increased to increase throughput
- horizontal: The number of a database servers is increased to increase throughput
Think of "vertical" as the server getting larger (taller), and horizontal as having more servers on the floor (wider). The second dimension specifies the type of workload to be scaled:
- read-only: Workload to be scaled is only reads
- read-write: Workload to be scaled includes reads and writes
Keep in mind, the issue is not whether the server workload is read-only or read-write, but rather whether you need to scale writes as well as reads. There are tools (e.g. pgpool-II, see below) that can split out read-only queries for scaling while allowing read-write queries to go to a single master server.
With these terms defined, let's see how they are handled by Postgres. Vertical scalability is handled very well by Postgres for both read-only and read-write workloads, and improvements are coming in Postgres 9.2. I think the only area we don't scale well in is using multiple CPUs for a single query, and I have already blogged about that limitation.
Horizontal scaling is more complex because the server interconnect speed, even using InfiniBand, is much slower than interprocess communication via shared memory. For read-only workloads, a combination of pgpool-II and streaming replication allows for simple and powerful read-only load balancing across multiple servers, e.g. horizontal scaling.
What is hard to do is read-write horizontal scaling, because writes to different servers can conflict with each other, and communication between servers is relatively slow. One possible solution is Bucardo, which does not prevent conflicts but rather provides a mechanism for conflicting committed transactions to be ignored. The Postgres-XC team is working on a more sophisticated read-write, horizontal scaling system that allows conflicts to be detected before commit, but the software is not production-ready.
So, the next time you talk about scalability, it might be helpful to explain exactly what type of scaling you need — it will help to make the discussion a lot more fruitful.
Update: One other alternative for horizontal, read-write scaling is sharding, such as with PL/Proxy (details). This basically avoids write conflicts by spreading data across multiple servers. 2012-01-26
Increasing Database Reliability
Wednesday, January 25, 2012
While database software can be the cause of outages, for Postgres, it is often not the software but the hardware that causes failures — and storage is often the failing component. Magnetic disk is one of the few moving parts on a computer, and hence prone to breakage, and solid-state drives (SSDs) have a finite write limit.
While waiting for storage to start making loud noises or fail is an option, a better option is to use some type of monitoring that warns of storage failure before it occurs, e.g. enter SMART. SMART is a system developed by storage vendors that allows the operating system to query diagnostics on the drive and warn of unusual storage behavior before failure occurs. While read/write failures are reported by the kernel, SMART parameters often warn of danger before failure occurs. Below is the SMART output from a Western Digital (WDC) WD20EARX magnetic disk drive:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 174 174 021 Pre-fail Always - 6300 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 11 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 145 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 9 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 8 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 543 194 Temperature_Celsius 0x0022 127 119 000 Old_age Always - 23 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
(These charts were generated using smartmon.) The VALUE column ranges from 0-200, with higher values being better. The RAW_VALUE column is more difficult to interpret because in some cases high is good, and in others, it is bad (e.g. temperature). The next chart is from an Intel 320 Series SSD:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 3 Spin_Up_Time 0x0020 100 100 000 Old_age Offline - 0 4 Start_Stop_Count 0x0030 100 100 000 Old_age Offline - 0 5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 1120 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 38 170 Unknown_Attribute 0x0033 100 100 010 Pre-fail Always - 0 171 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 172 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 183 Runtime_Bad_Block 0x0030 100 100 000 Old_age Offline - 0 184 End-to-End_Error 0x0032 100 100 090 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 19 199 UDMA_CRC_Error_Count 0x0030 100 100 000 Old_age Offline - 0 225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 177828 226 Load-in_Time 0x0032 100 100 000 Old_age Always - 210631 227 Torq-amp_Count 0x0032 100 100 000 Old_age Always - 0 228 Power-off_Retract_Count 0x0032 100 100 000 Old_age Always - 4261545 232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0 233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 0 241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 177828 242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 96042
Notice there is no temperature value (probably unnecessary), but it has additional values like Media_Wearout_Indicator, which indicates the wear level of the memory chips. It shows a wear value of "100", indicating no measurable wear.
SMART also records all errors generated by the drive, and allows drive self-tests to be run, independent of the operating system. This is helpful in isolating error causes.
While you could manually check the SMART values of your drives periodically, there are tools, e.g. smartd, that can automatically query drives and warn administrators of potential problems, e.g. every 30 minutes. System administrators that want to maximize uptime should always use such available tools to get failure warnings before they happen.
The Most Important Postgres CPU Instruction
Tuesday, January 24, 2012
Postgres consists of roughly 1.1 million lines of C code, which is compiled into an executable with millions of CPU instructions. Of the many CPU machine-language instructions in the Postgres server executable, which one is the most important? That might seem like an odd question, and one that is hard to answer, but I think I know the answer.
You might wonder, "If Postgres is written in C, how would we find the most important machine-language instruction?" Well, there is a trick to that. Postgres is not completely written in C. There is a very small file (1000 lines) with C code that adds specific assembly-language CPU instructions into the executable. This file is called s_lock.h. It is an include file that is referenced in various parts of the server code that allows very fast locking operations. The C language doesn't supply fast-locking infrastructure, so Postgres is required to supply its own locking instructions for all twelve supported CPU architectures. (Operating system kernels do supply locking instructions, but they are much too slow to be used for Postgres.)
The specific locking operation supported by the CPU is called test-and-set. It allows a value to be stored in a memory location and, at the same time, the old value to be returned. This allows for two sessions to simultaneously request a lock and for only one session to successfully acquire the lock. Spinlocks and other more complex locks are built upon test-and-set; see my Inside PostgreSQL Shared Memory presentation for more details. this.
I have copied a small part of s_lock.h to show this important instruction on x86 architectures:
1 static __inline__ int
2 tas(volatile slock_t lock)
3 {
4 register slock_t _res = 1;
5
6 __asm__ __volatile__(
7 " cmpb $0,%1 \n"
8 " jne 1f \n"
9 " lock \n"
10 " xchgb %0,%1 \n"
11 "1: \n"
12 : "+q"(_res), "+m"(lock)
13 :
14 : "memory", "cc");
15 return (int) _res;
16 }
The first line identifies the function as returning an integer ("int"), and "__inline__" causes the the function body to likely be placed at the function reference site, rather than generating a function call and storing the function body someplace else. Line 2 names the function as "tas" (for Test-And-Set), and indicates it takes an "slock_t" memory address ("lock"). Line 4 defines a local variable "_res" which will hold our initial lock value (1), and the function return value. Line 6 starts a block of assembly-language code in GCC, with the code on lines lines 7-11. Line 12 indicates that the assembly language uses two input/output parameters: "_res" (%0) and "*lock" (%1). Lines 7 and 8 check to see if the lock is already non-zero, indicating the lock is already held. If so, it jumps to the end at line 11 ("1:"). Line 9 locks the memory address referenced by the next instruction. Line 10 is the big one, the most important assembly language instruction in the Postgres server executable: "xchgb". It does a test-and-set or exchange of the "_res" (local register) and "*lock" (shared memory) values, swapping them in an atomic manner. Line 14 indicates that memory and the condition code register are modified by the assembly language. Line 15 returns zero for success, and one for failure, which can occur if the test on line 7 or the exchange on line 10 found the lock was already held.
So, there is it is — the most important CPU instruction, at last for x86 CPUs. If you look through s_lock.h, you will find similar test-and-set instructions all the other supported CPU architectures.
More Lessons From My Server Migration
Monday, January 23, 2012
The new server is 2-10 times faster than my old 2003 server, but that 10x speedup is only possible for applications that:
- Do lots of random I/O, thanks to the SSDs. Postgres already supports tablespace-specific random_page_cost settings, but it would be interesting to see if there are cases that can be optimized for low random pages costs. This is probably not an immediate requirement because the in-memory algorithms already assume a low random page cost.
- Can be highly parallelized. See my previous blog entry regarding parallelism. The 16 virtual cores in this server certainly offer more parallelism opportunities than my old two-core system.
Other observations:
- It takes serious money to do the job right, roughly USD $4k — hopefully increased productivity and reliability will pay back this investment.
- I actually started the upgrade two years ago by adjusting my scripts to be more portable; this made the migration go much smoother. The same method can be used for migrations to Postgres by rewriting SQL queries to be more portable before the migration. Reliable hardware is often the best way to ensure Postgres reliability.
- My hot-swappable SATA-2 drive bays allow for a flexible hard-drive-based backup solution (no more magnetic tapes). File system snapshots allow similar backups for Postgres tablespaces, but it would be good if this were more flexible. It would also be cool if you could move a drive containing Postgres tablespaces from one server to another (perhaps after freezing the rows).
Hopefully I can translate some of these lessons into Postgres improvements in the coming years. With this new server, and my recent
upgrade to fiber-based Internet, my home feels like a data center.
New Server
Friday, January 20, 2012
A few weeks ago, I finally replaced my eight-year-old home server. The age of my server, and its operating system, (BSD/OS, last officially updated in 2002) were a frequent source of amusement among Postgres community members. The new server is:
Super Micro 7046A-T 4U Tower Workstation 2 x Intel Xeon E5620 2.4GHz Quad-Core Processors Crucial 24GB Dual-Rank PC3-10600 DDR3 SDRAM Intel 160GB 320 Series SSD Drive 4 x Western Digital Caviar Green 2TB Hard Drives
The server replacement took about 100 hours of my time over four months, from specification, testing, and configuration. (Many thanks to Greg Smith for helping select and test the server.) Getting Postgres working on the new Debian server was trivial — even the historically difficult SGML documentation build process was easy. The server required a lot of configuration because it has twenty years of Unix customizations that do family calendaring, contacts, home automation, telephone, typesetting, web serving, email, and family media serving.
I learned a few things related to Postgres adoption in the process. We all have friends who are using databases other than Postgres. They often would like to use Postgres, but need a reason to switch. Sometimes cost, features, or performance can motivate such a switch, but without those, it is hard to recommend a time-consuming migration to Postgres. However, eventually, all hardware and software has to be upgraded, and when that time comes, I looked to friends with experience for advice. People considering upgrading their databases will do the same — we just need to be ready to give advice when the time comes.
View CommentsTOAST Queries
Thursday, January 19, 2012
As a followup to my previous blog entry, I want to show queries that allow users to analyze TOAST tables. First, we find the TOAST details about the test heap table:
SELECT oid, relname, reltoastrelid, reltoastidxid FROM pg_class where relname = 'test'; oid | relname | reltoastrelid | reltoastidxid -------+---------+---------------+--------------- 17172 | test | 17175 | 0
Notice pg_class has two TOAST-related fields, reltoastrelid and reltoastidxid. This heap table references the TOAST table 17175:
SELECT oid, relname, reltoastrelid, reltoastidxid FROM pg_class where oid = 17175; oid | relname | reltoastrelid | reltoastidxid -------+----------------+---------------+--------------- 17175 | pg_toast_17172 | 0 | 17177
Notice the TOAST table name contains the oid of the heap table (17172). The TOAST table references its index, 17177:
SELECT oid, relname, reltoastrelid, reltoastidxid FROM pg_class where oid = 17177; oid | relname | reltoastrelid | reltoastidxid -------+----------------------+---------------+--------------- 17177 | pg_toast_17172_index | 0 | 0
You can actually look inside the TOAST table:
\d+ pg_toast.pg_toast_17172;
TOAST table "pg_toast.pg_toast_17172"
Column | Type | Storage
------------+---------+---------
chunk_id | oid | plain
chunk_seq | integer | plain
chunk_data | bytea | plain
SELECT * FROM pg_toast.pg_toast_17172;
chunk_id | chunk_seq | chunk_data
----------+-----------+--------------------------------------------------------------------
17206 | 0 |\x40420f00fe410f01ff0f01ff0f01ff0f01ff0f01ff0f01ffff0f01ff0f01ff0f.
| |.01ff0f01ff0f01ff0f01ff0f01ffff0f01ff0f01ff0f01ff0f01ff0f01ff0f01f.
| |.f0f01ff0f01ffff0f01ff0f01ff0f01ff0f01ff0f01ff0f01ff0f01ffff0f01ff.
…
Of course, this is the internal, compressed representation. The chunk_id is the value stored in the main heap table to reference the proper long TOAST value. The data is stored in the chunk_data field in 8k chunks, tracked by the chunk_seq.
Using ALTER TABLE and doing another insert, it is possible to see the values in their uncompressed, hex format:
ALTER TABLE test ALTER COLUMN y SET STORAGE EXTERNAL;
INSERT INTO test SELECT 1, string_agg(x, '')
FROM (
SELECT 'A'::text
FROM generate_series(1, 1000000)
) AS f(x);
SELECT * FROM pg_toast.pg_toast_17172;
--- insert more 'A' data
…
chunk_id | chunk_seq | chunk_data
----------+-----------+---------------------------------------------------------------------
17207 | 0 |\x414141414141414141414141414141414141414141414141414141414141414141.
| |.4141414141414141414141414141414141414141414141414141414141414141414.
| |.1414141414141414141414141414141414141414141414141414141414141414141.
41 is the hex code for A, so with compression turned off, the TOAST table clearly shows the storage of the one-million A's.
View CommentsTOAST-y Goodness
Tuesday, January 17, 2012
Many things are better toasted: cheese sandwiches, nuts, marshmallows. Even some of your Postgres data is better toasted — let me explain.
Postgres typically uses an eight-kilobyte block size — you can verify this by running pg_controldata:
Database block size: 8192
If that is true, how does Postgres support a field limit of one gigabyte, as mentioned in the Postgres FAQ?
What is the maximum size for a row, a table, and a database? … Maximum size for a field? 1 GB
One solution would be to allow rows to span multiple 8k blocks, but this would introduce source code complexity and performance problems. A better solution, implemented in Postgres since 2001, is TOAST (The Oversized-Attribute Storage Technique). TOAST uses a backup table to store a row's long values. Its goal is to move enough row values into TOAST tables so the row length is less than about 2 kilobytes. You can read the details in tuptoaster.c.
TOAST tables are actually hard to find — they don't show up when looking at the table via psql, even in verbose mode:
\d+ test
Table "public.test"
Column | Type | Modifiers | Storage | Stats target | Description
--------+---------+-----------+----------+--------------+-------------
x | integer | | plain | |
y | text | | extended | | Has OIDs: no
The only hint of their existence is the Storage column. Column x is a short, fix-length column, and hence is not considered for TOAST storage (marked plain). Column y is a "TEXT" column and can store very long values and hence is considered for TOAST storage. The possible storage values are listed in the TOAST documentation section and in the ALTER TABLE manual page. The options control whether data is compressed, and whether it is considered for TOAST storage. I use the word "considered" here because values are not unconditionally forced into compression or TOAST storage — they are only considered for such storage if the row length is long enough.
The TOAST system has several advantages:
- Storage of a value in a backup/TOAST table happens only when the row is long
- Compression is attempted before moving a values to the TOAST table
- Sequential scans that do not access TOAST columns do not need to read or process these long data values.
- Updates under MVCC, which create a new copy of the updated row, can share the same TOAST table pointers unless the TOASTed column values are changed.
You can see the effect of TOAST in the following queries:
INSERT INTO test SELECT 1, string_agg(x, '')
FROM (
SELECT 'A'::text
FROM generate_series(1, 1000000)
) AS f(x);
SELECT pg_table_size('test');
pg_table_size
---------------
65536
Notice that though one-million characters were inserted into the table, compression reduced that to 64k, made up of eight mostly-empty 8k pages: one heap page, two TOAST pages, three TOAST free space map (fsm) pages, two TOAST index pages. We can use pg_relation_size() to see each piece separately:
SELECT pg_relation_size('test');
pg_relation_size
------------------
8192
SELECT pg_relation_size('pg_toast.pg_toast_17172');
pg_relation_size
------------------
16384
SELECT pg_relation_size('pg_toast.pg_toast_17172', 'fsm');
pg_relation_size
------------------
24576
SELECT pg_relation_size('pg_toast.pg_toast_17172_index');
pg_relation_size
------------------
16384
Inserting random characters, which can't be easily compressed, does show the table size increasing by one megabyte, with most of that storage being consumed by the TOAST table, as expected:
INSERT INTO test SELECT 1, string_agg(x, '')
FROM (
SELECT chr(ascii('A') + (random() * 26)::integer)
FROM generate_series(1,1000000)
) AS f(x);
SELECT pg_table_size('test');
pg_table_size
---------------
1105920
SELECT pg_relation_size('pg_toast.pg_toast_17172');
pg_relation_size
------------------
1040384
As you can see, TOAST works transparently. In fact, it works so well that the Postgres community hasn't even found it necessary to provide tooling to study TOAST behavior. I hope this blog entry has provided interesting details on how Postgres uses TOAST to store long values transparently.
View CommentsComing to Boston
Monday, January 16, 2012
Speaking of presentations, I get to use my updated presentations this Thursday when I speak to the Boston PostgreSQL Users Group. Then, on Friday, I get to use more of them when I do training at EnterpriseDB's headquarters.
View CommentsPresentations Updated
Monday, January 16, 2012
As part of my server upgrade, I migrated to a newer version of LyX and LaTeX, but most significantly, to a more modern and powerful LaTeX document class, Beamer. All 1300 slides in my presentations have been updated. If you see something that needs improvement, no matter how minor, please let please know via email, chat, or blog comment.
View Comments
Postgres Blog
RSS