This blog is about my work on the Postgres open source database, and is also published on Planet PostgreSQL.
Online status:
Unread Postgres emails:
Email graphs:
incoming,
outgoing,
unread,
commits
(details), events
PgLife Averages Thirty Active Users
Monday, May 13, 2013
A week after the release of PgLife, the site is averaging thirty active users. (I define an active user as an ip address that has viewed the site for at least five minutes during the past hour.) I consider that a success. Since the release of PgLife, I have increased the content update interval and added an About page explaining the site's purpose, which also includes the active user count.
The site uses ajax, Perl, Procmail rules, and Apache to collect and deliver dynamic content. Recent improvements in the Postgres mailing list archive feature set have made linking to emails much simpler.
PgLife was my first attempt at a dynamic website, and I learned a few things. First, I learned the value of having an alert file that can force a browser reload to push fixes and improvements to the browser. Second, I used the same file to allow pushing of news alerts to users, e.g. 9.3 Beta1. Third, I learned the importance of controlling browser and server caching and revalidation when using dynamic content.
I have always wanted a dashboard that shows current Postgres activity, and PgLife meets that need for me, and thankfully others. I am very fortunate that much of the Postgres activity is recorded in a way that makes external processing easy, and I know some people are working to make more content available, which will eventually be added to PgLife.
Postgres Blog
RSS
Monday, May 6, 2013
New Postgres users are often confused by the variety of activities in the Postgres community. Even us old-timers can be overwhelmed in monitoring the Postgres community. To meet these two needs, I have created a website called PgLife which dynamically displays the current activity in each of the community areas: email, development, blogging, media, news, events, and irc chat. I am hopeful people will find this useful in understanding the Postgres community, and in increasing their involvement.
Post a CommentFriday, May 3, 2013
I have completed the 9.3 release notes so they are ready for beta release. They will be regularly updated until 9.3 final.
Post a CommentFriday, April 12, 2013
With our final 9.3 commit-fest almost complete, and few 9.3 open items, I have been asked to start on the 9.3 release notes. I have blogged about this process in the past (1, 2), so I will not bore you with the details. I expect to be done in a week.
Post a CommentThursday, April 11, 2013
Postgres supports all the sql-standard constraint types, e.g. unique, check. These work fine at the server level:
While this clearly prevents invalid data from being entered into the database, it isn't very helpful in providing client-level feedback to the user. One improvement is to set the name of the constraint:
This returns a more meaningful client error string in the constraint name. (Fyi, Postgres does not record not null constraint names.) It is possible for clients to parse the error string and return the constraint name as error feedback to the user. This has a few disadvantages:
This will be much easier in Postgres 9.3 with the addition of the constraint name to error message output. Here is the output in psql:
I often tout the ability to write server-side code in the same language as database clients. This eases moving code between client and server. What is more interesting is code sharing, where you use similar code on the client and server. Will Leinweber has an interesting example where he uses plv8 to create JavaScript validation code that can be run in the browser and in the database.
Hopefully this blog entry has given you some ideas of how to harness server-side constraints in client applications, and how to share validation code between client and server. None of these approaches is perfect, but hopefully they will ease the burden of maintaining duplicate client and server validation checks.
Post a CommentWednesday, April 10, 2013
Sql has never been good at handling dynamic columns — it can easily process tables with a fixed number of columns, but dynamic column handling has always been a challenge. (Dynamic columns are often used to implement user-defined fields.) One approach has been to define columns for every possible value that might be stored in a row; this leads to sparse tables containing mostly null values. Because Postgres stores null values in a single bit, there is little overhead to this approach. However, it is hard to call these dynamic columns since every new column requires an alter table ... add column command.
Another approach is to use entity/attribute/value layout, where each value is stored in its own row. This obviously increases storage requirements, has performance challenges, and adds complexity to sql queries.
Fortunately, Postgres has various options to efficiently handle dynamic columns. The simplest solution is arrays which allow multiple values of the same data type to be stored in a single field. This assumes that the values are independent, or that their ordinal position in the array is sufficiently meaningful.
A more complex approach is to associate a key with each value. (Values internally are strings, rather than bound to a specific data type.) Hstore is a very popular key/value store for Postgres, and has index lookup support for arbitrary key/value pairs. (It does not allow hierarchical storage, i.e. key/values inside a key).
Xml and more recently json allow hierarchical storage and more naturally match client application languages. Unfortunately, none of the Postgres native index types understand the structure of json or xml data. (Json and xml documents are stored internally as single text strings.) To index json, it is necessary to create an expression index on a plv8 function (which understands json). Andrew Dunstan has a good blog entry using a plv8 function (jmember) to index specific json keys. Postgres 9.3 will add functions (blog, blog, blog, blog) to access json values directly — this will eliminate the need to use plv8 in simple cases.
For arbitrary json key lookups, it is possible to use a Postgres full text search index to filter out most non-matching rows and use a json-aware function to compare the remaining rows — this avoids the requirement of creating an index per lookup key.
Xml has similar capabilities to json — it can store hierarchical data, and can use full text search to filter matching rows. While many server-side languages understand xml, e.g. PL/Java, PL/Perl, most people use the built-in Postgres xml functions in expression indexes.
As you can see, Postgres offers a variety of options to store dynamic columns while maintaining a fixed-column table format that is easily accessed via sql — this gives you the best of both worlds when storing dynamic data.
Post a CommentTuesday, April 9, 2013
In December, I reported that a bug had been discovered in how pg_upgrade handled invalid indexes, and that the next release of pg_upgrade would throw an error if invalid indexes were found; that release happened in February. In an easily-overlooked development, last week's security release changed pg_dump so it now skips dumping invalid indexes. This allows the most recent version of pg_upgrade to upgrade clusters with invalid indexes, rather than throwing an error.
Post a CommentMonday, April 8, 2013
You probably heard about the important Postgres security releases (FAQ) last week, but maybe you thought it was mostly a theoretical threat. Well, the Schemaverse game author left a non-upgraded server open to the Internet to see how long it would take to be infected — turns out it took a sobering 21 hours. Hopefully this gets everyone's attention on the need to upgrade.
View or Post CommentsMonday, April 8, 2013
Having covered expression indexes, I want to cover another index feature — partial indexes. Partial indexes allow you to create an index on part of a table. Why would you want to do that?
People usually don't use expression or partial indexes initially, but there are certainly some use-cases that can only be solved with these index features.
View or Post CommentsSaturday, April 6, 2013
Having covered virtual columns recently, I wanted to cover one additional type of stored value, expression indexes. Unlike virtual columns, which are computed on each access, expression indexes are stored in index files, allowing rapid access. Let's look at some examples, building on the customer table and fullname function created in my previous blog entry:
The first create index command simply creates a copy of the column lastname in a btree-ordered index file. The second example concatenates firstname and lastname fields and stores the result in a btree-ordered index file; it requires double-parentheses around the expression. The last example stores the output of the function fullname in an index file.
The ability to store the output of expressions and functions allows rapid access for queries that are more complex than simple column comparisons. For example,
Explain allows us to see the index being used:
Notice that all three use the i_customer_fullname index. The sql fullname function was inlined when the index was created, so it was expanded to (((firstname || ' '::text) || lastname); the i_customer_concat index is simply unnecessary. Storing Soundex values in expression indexes is also useful.
Basically, you have three options for creating auto-generated columns:
Of course, only the last one, expression indexes, allows rapid access to computed values. The example below uses sql to find the input value that yields the supplied factorial output value, e.g. what value generates a factorial of 120:
Interestingly, Postgres computes optimizer statistics on expression indexes, even though the expressions do not exist in any table:
Expressions indexes are just another tool available to Postgres database users — they allow rapid access to rows based on table column expressions and functions.
View or Post CommentsThursday, April 4, 2013
Operators are usually easily distinguished from functions because they use punctuation, e.g. -, *, /, rather than words with letters and numbers. There are two popular types of operators, unary and binary. Unary operators take one argument, e.g. -3 (minus), !false (not). Binary operators take two arguments, one on each side of the operator, e.g. 2 + 3 (addition), 'a' || 'b' (concatenation). (C-based languages often also support the ternary operator ?:.)
Perhaps surprisingly, in Postgres comparisons are also operators, e.g. a = 3 (equality), to_tsvector('hello bye') @@ 'hello' (full text search). Comparison operators return true, false, or null, and their return values are processed by surrounding keywords or clauses, e.g. where, and. For example, to process where col1 = 10 + 1 and col2 > 5, 10 + 1 is processed with +, col1 = 11 is then processed with =, col2 > 5 is processed with >, the results processed by and, and its return value is passed to the query processor via where. Just as you can type \dfS to see all defined functions in psql, and \dTS to see all defined data types, \doS shows all defined operators.
You might think of operators as distinct from functions, but internally operator behavior is bound directly to functions. For example, factorial contains both operator and function definitions. The factorial operator, !, is actually the only Postgres-supplied left-associative unary operator:
It is also implemented as a function, factorial:
Notice both reference numeric_fac. Where is that defined? It is defined as C-language function, and is called by both ! and factorial(). Create operator allows users to create their own operators based upon already-defined functions, even user-defined functions. While user-defined operators are not as common as user-defined functions, they are frequently used by Postgres extensions. For example, hstore defines 14 operator combinations.
Just like functions and data types, operators are "laid on top" of the sql language in Postgres. So, they next time you call an operator, realize there is a lot going on under the hood, and you can change its behavior.
View or Post CommentsWednesday, April 3, 2013
My 2013 blog entries through March 25 have been translated into Spanish by Maria Ramos of webhostinghub.com. It is unclear how frequently it will be updated; I will report back if I get more information.
View or Post CommentsTuesday, April 2, 2013
There was some nifty function-calling syntax used in my previous blog post about virtual columns, so I now want to cover function calling syntax in detail. Traditionally, functions are called in sql using a procedural syntax, e.g. function(argument):
With Postgres, you can also pass sql composite types in a procedural manner. Composite types represent data rows or records of data values. The following query works because the function was defined in my previous blog post to accept a composite type as an argument, i.e. create function fullname(customer), where customer is a relation name, and hence a composite type:
You can also append .*:
Where it really gets interesting is when we use an alternative function call syntax with composite types that allows the function call to appear as a normal column — this is part of the original object-relational design from Berkeley. It follows a more object-oriented approach, where object variables (data storage) and object methods (functions) are accessed using a similar syntax, i.e. fullname appears as a storage column, but is actually a function name:
This is the syntax we used for virtual columns — hopefully they make more sense now. It is also possible to dynamically create composite types and use this calling syntax. To create composite types, you use parentheses with either a comma inside the parentheses, or a period after the closing parenthesis:
The first example does not match the criteria for composite creation — notice the output does not show parentheses, meaning it is not a composite type. The second does create a composite type by including a comma. The third example illustrates calling the upper function using a dynamically-created composite type. (Row can also be used to create composite types, though the function calling syntax doesn't seem to work with them.)
One disadvantage of the composite syntax is that, because it was developed before schemas (namespaces), there is no way to specify the function's schema:
Only search_path is used to locate such functions.
You can also output a data row in composite format by surrounding the relation name with parentheses:
In summary, there are two function calling conventions — a procedural one, which everyone knows, and a less well-known composite method. The composite method can use either a composite type, like a relation name, or a dynamically-created type, which requires parentheses. The composite method could probably benefit from syntax flexibility improvements in future Postgres releases.
View or Post CommentsMonday, April 1, 2013
Databases store data in table columns, but sometimes it is useful to create non-storage columns for code clarity. For example, you might need to access a combination of columns frequently and don't want to repeatedly perform the combining operation in your application. In this case, a column can be created that acts as a virtual read-only column. In the example below, a fullname function is created that combines the firstname and lastname columns, and the fullname function is then accessed as a virtual column:
You can even create column aliases in this way:
Only functions with a table name argument can be called in this way, i.e. as table_name.function_name. Because sql functions are often inlined before execution, their overhead is minimal — notice that the explain plan for the real and virtual column references are identical:
Such functions can also be created in other server-side languages, like PL/pgSQL:
Of course, much of this can already be done with views:
Views are nice in that they do not require creation of a function to use them, but they do require applications to reference the view name rather than the base table name. (Of course, ddl can be added to make views updatable.) Virtual columns are nice because they are only active when the virtual column is referenced on the base table.
View or Post CommentsFriday, March 29, 2013
Building upon my previous blog post, where I declared Postgres was "good enough", I want to explore what "good enough" means. I will use an aircraft design analogy. (You might remember me exploring a Postgres/aircraft parallel before.)
In aircraft design, everything is a tradeoff between weight, cost, reliability, and safety. Let's look at safety — what would be the safest material to use in constructing an airplane? One logical selection would be to make it from the same material as the flight data recorder, i.e. the black box. It is designed to withstand the most serious crashes, so obviously, it would be the safest.
Of course, the problem is that an aircraft made with material of the same strength as a flight data recorder would be so heavy it would never get off the ground. A material must be chosen that is safe enough, and light enough so the plane can fly. Obviously, measuring best in only one dimension leads to impractical problems in other areas.
Software has similar issues — cheapest, fastest, most flexible, most featureful, most reliable, least maintenance — these are all possible measures for evaluating software, but the best solution strikes a balance in each of these areas. The fastest software might be custom-produced and written in assembler, but its cost and rigidity would make it a poor choice. Databases have to balance these measure to produce useful software for many users. For example, for some sites, NoSQL has sufficient reliability, and for others, SQLite has enough features. (I have already explored the tradeoff between configurability and maintenance overhead.)
Postgres is always working to extend its attractiveness in every major category. Sometimes we can improve one category without diminishing another, but at other times, this is not possible, and our community must choose what will be most attractive for our current and future users. These tradeoffs are never easy, but thanks to the close communication among our developers and users, we usually triangulate on an ideal outcome. Sometimes we can allow users to choose tradeoffs, e.g. between performance and durability.
So Postgres is good enough, in so many ways. It probably isn't superior in any of these categories (except perhaps cost and flexibility), but its combination of low cost, flexibility, speed, features, reliability, and minimal maintenance makes it a star performer. Our users think its combination is the best, and you might think so too.
View or Post CommentsThursday, March 28, 2013
With the increased attention Postgres is getting, many ask whether Postgres has the same features as Oracle, MS SQL, or DB2. Of course, the answer is that Postgres doesn't have all their features, and never will, but it is also true that those other databases will never have all of Postgres's features. The better question is whether Postgres has all the features you need for a successful deployment, and the answer to that is usually an enthusiastic "yes".
This gets into the larger question of when something is good enough. Good enough sounds negative, like you are giving up something to settle for a partial solution. However, good enough is often the best option, as the negative aspects of an ideal solution (cost, overhead, rigidity) often make good enough best. In response to Oracle's announcement of new hardware, this Slashdot comment captures how great good enough can be:
So, in the future, if you hear someone say Postgres is good enough for their needs, don't think of them as settling for less — think of them as having chosen Postgres as the best solution they could find, because in so many situations, Postgres is the best.
View or Post CommentsTuesday, March 26, 2013
With the all the excitement surrounding Postgres, I have heard speculation about why Postgres has become such a big success. Many point to our exemplary community structure, while others point to our technical rigor. One item of particular scrutiny has been our extensibility — I was asked about this twice at the recent New York City PGDay.
Postgres (named as post-Ingres) was designed to be extensible, meaning not only can users create server-side functions, but also custom data types, operators, index types, aggregates, and languages. This extensibility is used by our contrib extensions, PostGIS, and many others. Someone said it was like Postgres developed the sql language, then laid the data types on top. Will Leinweber, in his nyc presentation, described Postgres as a "data platform rather than a database", and his use of hstore, json, and plv8 illustrated the point well.
Extensibility certainly isn't the only reason for Postgres's success, but it is part of a larger set of ideal circumstances that allows Postgres to thrive in an environment where many other relational databases are struggling.
View or Post CommentsMonday, March 25, 2013
I just returned from PGDay NYC. The conference had an intensity that I have not seen in other conferences recently. Last year's PGDay NYC conference was in a large facility and felt dominated by people considering Postgres for serious deployments. This year's event was in underground adjacent rooms, and the crowd seemed completely sold on Postgres. The compactness of the venue and the enthusiasm of the crowd made for a unique energy. One of the rooms even had couches and cushioned chairs — I don't think I have ever listened to a technical presentation from a couch — hard to beat.
The event was attended by four core team members and distant speakers like Joshua Drake, David Fetter, and Will Leinweber. Will's presentation was particularly exciting for me and I will be blogging about it soon.
View or Post CommentsFriday, March 8, 2013
Postgres is very good at adding features built upon existing features. This process was used when table partitioning was added in Postgres 8.1 in 2005. It was built upon three existing features:
Though constraint_exclusion (which controls partition selection) has received some usability improvements since 2005, the basic feature has remained unchanged. Unfortunately, the feature as implemented has several limitations which still need to be addressed:
There is a wiki related to table partitioning that covers some of this in detail.
Postgres partitioning was good enough to foster its wide deployment, but the limitations of building partitioning upon existing features is clear. It is perhaps time to develop more partition-specific infrastructure to allow easier-to-use and more scalable partitioning in Postgres.
One downside is that any partitioning improvements that address these limitations would probably be less flexible than our current partitioning capabilities — in a way, it is our flexibility that has led to the performance and user-overhead requirements of our current setup. It is probably the challenge of reducing partitioning's limitations while maintaining its flexibility that has prevented improvements in this area.
View or Post CommentsThursday, March 7, 2013
Last month I visited India for EnterpriseDB and talked to many government and corporate groups. We don't have many independent Postgres community members in India, so I was surprised by the serious interest in Postgres and the number of large deployments under consideration.
The issue of porting applications to Postgres was discussed. Normally, when I think of porting to Postgres, I think of company-created applications that just require a proper Postgres interface library and simple sql query modifications — and our community maintains porting guides for many databases, e.g. Oracle, MySQL, MSSQL, other. In addition, most open source application languages are rapidly moving to standardize on Postgres, so I thought most of the big hurdles for migration were solved.
However, there is one area that still has significant barriers for Postgres migration, and that is middleware, particularly proprietary middleware. If your application is written to use a middleware api, and that middleware doesn't support Postgres, there isn't much possibility of porting to Postgres unless you change your middleware software, and changing middleware software involves not just modifying queries, but rearchitecting the entire application to use a different middleware api.
At this point, almost every application language and all open source middleware has Postgres support, so hopefully more proprietary middleware vendors will add Postgres support soon.
View or Post CommentsTuesday, March 5, 2013
A few months ago I wrote an article about how NoSQL databases are challenging relational databases, and it has recently been published. My article's goal was to explain that while NoSQL databases have advantages over relational databases, there are many disadvantages that are often overlooked. I also explained how relational databases, particularly Postgres, are adding features to adapt to NoSQL workloads.
View or Post CommentsThursday, January 24, 2013
In December of 2011, I blogged about the increasing need for parallelism in the Postgres backend. (Client applications have always been able to do parallelism with subprocesses, and with threads since 2003).
Thirteen months later, I have added two parallel code paths (1, 2) to pg_upgrade. I have learned a few things in the process:
Using pg_upgrade as an example, parallelism can yield a 10x performance improvement, and it only took me a few weeks to accomplish. However, to get 10x improvement, you have to have multiple large databases, and be using multiple tablespaces — others will see more moderate gains. Fortunately, I have also improved pg_upgrade performance by 5x even without parallelism.
Anyway, with those lessons learned, I am ready to start mapping out a parallelism strategy for the Postgres server. I have created a wiki that outlines the benefits, challenges, and specific opportunities for parallelism in the backend. Feel free to add items or make adjustments.
View or Post CommentsWednesday, January 23, 2013
I hope you have found my blog series about nulls useful. I was inspired to write this series by Jeff Davis's 2009 blog entry on the topic, in which he states:
The Wikipedia article about nulls is quite thorough, including criticisms. This C. J. Date book excerpt also has a good summary of possible null traps to avoid.
The sql standards committee had to balance convenience and consistency in designing null behavior. Unfortunately, they didn't achieve 100% success in either area, but achieved reasonable success in both.
Here is a summary of all the blog entries in this series:
- Nulls Make Things Easier?
- Explicit and Implicit Null Generation
- The Non-Value of Nulls
- The Three-Valued Logic of Nulls
- The Not In Null Trap
- Comparing Nulls
- Ordering and Indexing Nulls
- Nulls and Aggregates
- Mapping Nulls to Strings
- Nulls in Arrays and Row Expressions
- Null Summary
View or Post CommentsMonday, January 21, 2013
Null behavior as part of an array or row expression has some unusual behaviors that it is best just to memorize:
A null cast to an integer array is null, while an empty array or an array containing a null is not null.
Empty row expressions or row expressions containing only nulls are null, while a row expression containing a mix of nulls and non-nulls is neither null nor not null. Rows containing only non-nulls are not null, of course. Basically, all of a row's values must have the same null status to return true for null checks.
Another odd behavior is using a query returning zero rows in the target list:
Though the table has no rows, using it in a select target list returns a null. This is because a select with no from clause is assumed to return one row. (This was recently discussed on a Postgres email list.)
View or Post CommentsFriday, January 18, 2013
Instagram has been public about their large-scale use of Postgres. They recently posted a blog entry mentioning five helpful tips about using Postgres.
View or Post CommentsFriday, January 18, 2013
Sql specifies functions to map null values to and from non-null values. Coalesce returns the first passed non-null value:
Notice that coalesce can return a value of any data type. It is very useful for specifying a special output string to represent null values. The second to last use of coalesce (using ||) shows how coalesce can be used to concatenate values that might be null. The last use of coalesce changes the unusual behavior I mentioned earlier about sum() returning null when no values are processed.
The function nullif returns a null if two parameters are equal — the inverse of coalesce:
The last example converts null to 'n/a', then back to null with nullif.
View or Post CommentsWednesday, January 16, 2013
Aggregates are also a regular source of confusion with nulls, both in how aggregates handle nulls on input, and when aggregates return nulls. Let's look at the input case first:
There are two things to notice here:
Let's look at cases where aggregates return nulls, specifically where aggregates have no rows to process:
count(*) and count(x) have no trouble returning zero rows. Min(), max(), and avg() are probably returning the best value they can for no rows. You might argue that avg() should return *0*, but that would technically be computing 0 / 0, which is indeterminate in mathematics. The real problem here is sum() — the sum of zero rows is clearly zero, but the sql standards committee defined it as returning null, probably to be consistent with the other non-count aggregates. Of course, with sum() returning null for no rows, avg() returning null makes perfect sense. And, because count(x) and non-count aggregates ignore nulls, processing only null values also yields null outputs:
Group by groups nulls together, just like order by:
The zero in the last row is interesting because even though there are two matching null rows shown by count(*), count(x) ignores nulls, so it returns zero. The other null aggregate return values in that row are not because the nulls are being processed, but rather because no non-null values are seen.
View or Post CommentsMonday, January 14, 2013
Having looked at the null comparisons, let's see how they behave in ordering and indexing. For ordering purposes, in a bow to practicality, nulls are considered to be equal to each other in terms of grouping, because, hey, if they were not, ordering them wouldn't be very useful.
Nulls can even be specified as coming first, rather than last:
For unique indexes, nulls are considered not equal to each other, and multiple null can be stored in a unique index:
In Postgres, nulls are stored in indexes and is null is an indexable operation. Of course, for null index lookups to be useful, there must be a small percentage of null values in the indexed column.
View or Post CommentsFriday, January 11, 2013
I just covered how comparisons with nulls return more nulls, so how do you test for null? There are actually two clauses designed specifically to compare nulls.
The first is the is null comparison clause. It is a clause rather than an operator because I mentioned earlier that operators on nulls return nulls. Here is how it works:
You can see that an is null comparison on a null returns true, which allows the where clause to evaluate the comparison as true. is not null allows null values to be excluded.
Another null comparison clause is is distinct from:
is distinct from basically performs null comparisons in a two-valued logic system. For non-null inputs, it operates like a not-equal comparison. This allows for cleaner handling of the not-equals row comparison we did in an earlier blog entry:
This returns all rows where x does not equal 30, or is null; in fact, this could have been rewritten as where x <> 30 or x is null. is not distinct from is the opposite (an equality comparison), though it sounds like a double-negative — think of it as the same as:
However, since = 30 already excludes nulls, it doesn't seem as useful, but it does allow you to do equality comparisons which considers nulls as equal:
This is particularly useful when joining two columns that might contain nulls.
View or Post CommentsWednesday, January 9, 2013
Pg_Upgrade's copies or links user data and index files from the old to the new cluster, and of course this can be slow, especially the default copy mode. Previous releases of pg_upgrade did the copying/linking serially.
Building on my recent pg_upgrade --jobs mode addition, I have just applied a patch to allow each tablespace to be copied/linked in parallel. Here are the performance results with two tablespaces on different magnetic disks, in copy mode:
The test machine has 24gb of ram, and you can see the patch has little benefit until the kernel cache starts to fill at 16gb.
This is my last planned pg_upgrade speed improvement. While performance tuning isn't done until the execution time is negative
, I will need to find new ideas to improve pg_upgrade performance further. In summary,
Postgres 9.3 will include a pg_upgrade optimized for database clusters using tablespaces and with many tables and databases.
View or Post CommentsMonday, January 7, 2013
As a corollary to three-valued logic expressed in the previous blog, not in comparisons are a regular source of surprise. Consider this comparison of one against various values:
The first query has an obvious result, and looking at the second query, you can now understand why it would return null because of 1 <> null. Where the surprises come is when you aren't looking at explicit nulls but rather a subquery generating a null:
Combining multi-value comparison and nulls, here is an example with the subqueries returning two rows:
The first two queries makes sense because there are matches. The last two queries have no matches, so they expand to:
The first case is clear that 'a' does not equal 'b' or null. What really trips people up is the second case, where 'a' <> 'b' and 'a' <> null returns null, rather than true. The proper solution is never to allow not in subqueries to return nulls, usually by excluding them in the subquery where clause.
View or Post CommentsFriday, January 4, 2013
Nulls exist in a three-valued logic system, which frankly sounds scary. A two-valued logic system, where things are either true or false, is very easy to understand. Three-valued logic sounds like something is wrong, and in an sense something is wrong because it injects uncertainty (null) into a logic system which normally only handles certainty. This is like the flat-nose pliers I mentioned earlier — ideally we could do everything with two-valued logic, but there are cases where three-valued logic, like pliers, is necessary, and we have to be able to deal with such cases:
The first three examples make sense if you assume that null means unknown — we don't know if the unknown value equals '1' or '', and we don't know if the two null (unknown) values are equal. The last one is harder to explain — clearly any numeric is larger if one is added to it, but we don't know if that null value might mean not applicable rather than unknown, and we don't even know if it is a numeric which supports addition, so again, returning null makes sense. Also, null is used for unknown/unassigned values all over the system, so there is no guarantee that the two nulls came from the same place and represent the same value. Assuming too much about the meaning of a null value is usually where problems start.
If the above comparisons return null, how do they behave in query comparisons?
The where clause returns the single row when passed true, but false and null return nothing. It has been suggested that the where clause should be called the sure clause because it only returns values it knows for sure are true. Null behaves like not true in boolean operations, so true or null returns true, while true and null returns null, which behaves like false when it reaches the where clause. Keep in mind that not true is not the same as false, e.g. not null is null, not true. Again, these are not surprising results, but they do suggest more complex queries that could cause surprises:
Two-valued logic suggests that x is either positive, zero, or negative, but three-valued logic allows null values which do not fit into any numeric range; strings and dates have similar behavior. In fact, all data types can be assigned null values, and all operator comparisons with null values return null. These examples are even more bizarre:
The first example kind of makes sense because we don't know if the null represents 10, but the second example would (two-valued) logically have x equaling 10 or not equaling 10, and you would think one of them is true. Null just doesn't work that way, and again shows that making assumptions about null leads to trouble. Don't assume you can split a relation into two parts without considering nulls.
If you stay with non-null values, you never see this, but once a null is involved, the three-valued behavior becomes very clear. This reinforces the advice that it is wise to limit the use of nulls in your schema, e.g by using not null, and to be very careful when nulls are involved.
View or Post CommentsWednesday, January 2, 2013
Having covered why nulls exist and how they can be created explicitly and implicitly, let's look at how nulls behave by looking at some specific examples. I previously showed that you can specify null explicitly, but what happens when nulls interact with non-nulls:
Above, you can see that null can interact with any data type, and it doesn't have a data type of its own. If you substitute the idea of unknown for null, the answers even make sense, e.g. unknown + 1 is unknown. While the above queries seem reasonable, you can start to see how "surprises" happen when the null is stored in a column:
You can see the increment did nothing to the value in the last row. Add a where clause, and things really get complicated — I will cover that next.
View or Post Comments