Thursday, September 16, 2010

Generating ID sequences for Postgres using JPA annotations

This started out as yet another too-long blog post, so I'm going to refactor it by cutting to the chase:
@MappedSuperclass
public abstract class PostgresDomainObject {
@Id
@GeneratedValue(strategy=GenerationType.IDENTITY)
@Column(insertable=false, updatable=false,
columnDefinition="BigSerial not null")
public long id;
...
}
The BigSerial data type in Postgres is shorthand for the following:

CREATE SEQUENCE tablename_colname_seq;
CREATE TABLE tablename (
colname integer NOT NULL DEFAULT nextval('tablename_colname_seq')
);
ALTER SEQUENCE tablename_colname_seq OWNED BY tablename.colname

The key to making this work is GenerationType.IDENTITY. The insertable and updatable attributes don't seem to actually work - I've left them in as a indication of how the column behaves, but they don't seem to have any effect (at least not without IDENTITY). But using IDENTITY as the generation type causes Hibernate to not include the id column in insert statements, meaning that the default will get used.

The really beauty of this is that classes that extend PostgresDomainObject and declare @Entity will get their own dedicated sequence. If I had declared the column as a bigint and used GenerationType.AUTO, it would have created a single sequence, called "hibernate_sequence", and that sequence would have been used by every table. If I had declared the base class as @Entity and used the Hibernate/JPA table inheritance model (an atrocity I may rant about in some future post), I would again have been limited to using a single shared sequence.

There are two good reasons for wanting to have a dedicated sequence for each table instead of a single shared one:

1) Performance - having sequence generators dedicated to each to table helps to prevents contention when there are multiple concurrent requests for new sequence numbers

2) Maintainability - sequences are often allocated in batches (say, 50 at a time) and some errors can cause the entire batch to be skipped. If there is a recurring error on a busy system, it's even possible to exhaust the sequence. It's also possible to exhaust the sequence on a very large table that has a lot of turnover. In those cases, unlikely as they may be, it's a lot easier to track down the problem and fix it if it's a dedicated sequence.

If the database in question is a simple, low-volume data store, none of this matters much. But the systems I care about are at the other end of the spectrum - either high-volume OLTP systems or data warehouses that have tables with millions of rows and high levels of concurrent access.

Monday, August 30, 2010

Innovation in the MySql market

The previous post might have made you think I don't care much for MySql, and you'd be right. But there is some interesting innovation going on in the MySql world - it's just not in MySql itself.

Drizzle looks like it could be to MySql what FireFox was to Mozilla - a cleaned-up, stripped-down version that improves all the good stuff and leaves the dreck behind. Here's a terrific talk by Brian Aker (former architect for MySql and one of the movers behind Drizzle) that made me want to go out and try Drizzle in spite of my aversion to MySql.

Infobright is one of the new breed analytic DBMS's. It's build on MySql, but uses its own column-oriented storage engine, data compression, and optimizer. It looks interesting, and I see that one of their technical advisors is Roger Bodamer, who I worked for at OuterBay Technologies - a startup that was bought by HP in 2006. Roger is also the SVP of Engineering at 10gen, the company behind MongoDB - so it looks like he's sitting on both sides of the new SQL/NoSql fence. Smart guy.

Big Changes in the Data Warehouse Space

It's been about 2 1/2 years since I looked really seriously at the data warehouse/BI space. At the time, things seemed very stagnant - there were the big 3 or 4 vendors (Oracle, IBM, Microsoft, maybe Teradata) and a host of smaller players who either were getting gobbled up (Hyperion, Business Objects, Cognos, etc.) or spiraling downward (Sybase, MicroStrategy).

In fact, this seemed to be true of the DBMS space as a whole. A market where the trendiest and most disruptive technology is MySql is not an interesting market.

Turns out that wasn't the case at all. Lots of interesting things were bubbling beneath the surface then and have been emerging ever since. The big news in the software development world is NoSQL/MapReduce/Hadoop, because it's been about a decade since Object-Oriented databases failed and so it's time for another round of "relational databases suck and here's what's going to replace them". (Actually, I like the whole NoSQL thing. But it's not a replacement for DBMS's - instead, it's a big part of the solution for a new class of data storage and management problems. But I digress).

What I think is more interesting is the new breed of analytic DBMS engines - Vertica, GreenPlum, AsterData, ParAccel, VoltDB, InfoBright, Netezza - and the attempts by the established vendors to keep up (Oracle Exadata, and Microsoft's acquisition of DATAllegro and subsequent release of SQL Server 2008 R2 Parallel Datawarehouse). Weave in threads like the re-birth of the database appliance, the emergence of Solid-State Disk as a mainstream technology, and the adoption/adaptation of MapReduce by some of the above vendors, and now we've got something really interesting.

What's interesting about the technology of the new breed is the way different architectural trends are being synthesized together. Massively Parallel Processing (MPP) and shared-nothing architectures are nothing new in the database world, but here they've been strongly influenced by the architecture of scaling out on commodity hardware that owes more to Google and the other big web sites. Columnar storage, compression, and tiered storage schemes aren't new ideas either, but combining them together, especially in the context of a database appliance and/or the cloud is a big step forward.

The other really interesting development is a new approach towards analytics, known as "MAD" - Magnetic, Agile, and Deep. I won't pretend that I understand what it really means yet, other than to note that this is another area of software development where dissatisfaction with the status quo has resulted in a new wave of people and products who have adapted an Agile world-view.

I think EMC's acquisition of GreenPlum is an inflection point that, along with the success of Vertica and Aster Data (and the large installations that all three of them now have), legitimizes the new breed to the mainstream datawarehouse customer.

As an aside, it's worth noting that a lot of the interesting stuff going on can be traced either directly or indirectly to Mike Stonebraker. Vertica and VoltDB are his latest projects, GreenPlum and AsterData are built on top of Postgres, ParAccel also uses some Postgres technology, and DATAllegro was originally built on top of Ingres.