Thursday, September 16, 2010

Generating ID sequences for Postgres using JPA annotations

This started out as yet another too-long blog post, so I'm going to refactor it by cutting to the chase:
@MappedSuperclass
public abstract class PostgresDomainObject {
@Id
@GeneratedValue(strategy=GenerationType.IDENTITY)
@Column(insertable=false, updatable=false,
columnDefinition="BigSerial not null")
public long id;
...
}
The BigSerial data type in Postgres is shorthand for the following:

CREATE SEQUENCE tablename_colname_seq;
CREATE TABLE tablename (
colname integer NOT NULL DEFAULT nextval('tablename_colname_seq')
);
ALTER SEQUENCE tablename_colname_seq OWNED BY tablename.colname

The key to making this work is GenerationType.IDENTITY. The insertable and updatable attributes don't seem to actually work - I've left them in as a indication of how the column behaves, but they don't seem to have any effect (at least not without IDENTITY). But using IDENTITY as the generation type causes Hibernate to not include the id column in insert statements, meaning that the default will get used.

The really beauty of this is that classes that extend PostgresDomainObject and declare @Entity will get their own dedicated sequence. If I had declared the column as a bigint and used GenerationType.AUTO, it would have created a single sequence, called "hibernate_sequence", and that sequence would have been used by every table. If I had declared the base class as @Entity and used the Hibernate/JPA table inheritance model (an atrocity I may rant about in some future post), I would again have been limited to using a single shared sequence.

There are two good reasons for wanting to have a dedicated sequence for each table instead of a single shared one:

1) Performance - having sequence generators dedicated to each to table helps to prevents contention when there are multiple concurrent requests for new sequence numbers

2) Maintainability - sequences are often allocated in batches (say, 50 at a time) and some errors can cause the entire batch to be skipped. If there is a recurring error on a busy system, it's even possible to exhaust the sequence. It's also possible to exhaust the sequence on a very large table that has a lot of turnover. In those cases, unlikely as they may be, it's a lot easier to track down the problem and fix it if it's a dedicated sequence.

If the database in question is a simple, low-volume data store, none of this matters much. But the systems I care about are at the other end of the spectrum - either high-volume OLTP systems or data warehouses that have tables with millions of rows and high levels of concurrent access.