Just what do you need an RDBMS for, anyway?

If you are in the business of writing anything like an "enterprise application" (and these days, that's pretty much every programmer who's target platform doesn't come sealed in epoxy) then you store your data in an RDBMS.

And if you're at all up to speed with developments in the art of programming over the past thirty-odd years you write the application "logic", middle tier, call it what you will, in an object-oriented language (or, at least, something that looks quite a lot like one from a sufficient distance). And so you spend a deal of your time dealing with a thing called the "Object-Relational Impedance Mismatch".

There's money to be made from this. Oracle recently purchased what was originally a Smalltalk O-R mapping tool, Toplink. Like most of its ilk it allows programmers to maintain (by mostly manual processes) a huge pile of XML relating a bunch of tables in a database to a bunch of classes in a codebase. I focus on Toplink because that's where my experience with the topic largely lies, not through any great love or hatred of the product. Well, anyway, the aim of these tools is "transparent persistence", whereby modifications to the mapped objects make their way to the database automagically. (*) So, there's this arbitrary graph of objects that gets mapped onto this strict table-based model, via hierarchical text-based metadata. Ooo-kay.

"Translucent" Persistence

Except that the object graph isn't arbitrary. And the "transparency" is of a rather murky kind. "translucent persistence" might be a better term. At the time that you write the classes from the object graph will be instantiated the mapping tool will impose all sorts of constraints on what you can do, and how you do it. This is very far from the promise of orthogonal persistence. What happened to JSR 20, that's what I want to know.

For instance, if you want to get hold of an object that (you have to know) happens to be backed by the underlying database then what you do is obtain an entry point to the data, and search over it for the object you want. Unlike the navigation that you might expect to do in an object graph, instead you build...queries. Like this:
ExpressionBuilder builder = new ExpressionBuilder();
Expression expression =
Person p =
(Person) aSession.acquireUnitOfWork().readObject(
Yup, that's pretty transparent alright. Can't tell that there's anything other than plain old in-memory objects there. For Java fans: isn't it especially wonderful how the programmer had to mention the name of the class three times? So that it will be true, presumably.

Building these query objects can be a pain, though, if they get all complicated--which they do, always. I know one team that ended rolling its own library of query-building classes. Had a load of methods on them with names like "select" and "orderBy" and so forth...

Why do programmers put up with this? A lot of the time, habit. Which is to say, not thinking terribly hard about what it is they do that adds value.

To be fair, the modern RDBMS is a pretty amazing thing: high performance in time and space, robust, reliable, available...it's just that a lot of application programmers are a bit confused, I believe, about where the benefit of using an RDBMS lies.

If only an RDBMS were Realational

While it's true that what we have today in the widely used RDBMSs is a pale reflection of what the relational model is capable of, they are nevertheless closely related. So let's take a look at the ur-text of relational theory, Codd's paper of 1970 (back when "data base" was still two words). What's of interest about it is that it doesn't talk about persistence much at all. There is mention in passing that the data is "stored" in the "data bank" but the focus throughout is on the structure of the data, how that structure can be preserved in the face of updates to data and to what we would now call the schema, and how users of the data can be isolated from the details of how the data is held and changed.

Flowing from this are two principle advantages of using an RDBMS to store data:
  • ease of framing ad-hoc queries for decision support
  • robustness in the face of concurrent updates from multiple sources
That the data put into the database yesterday is still there today (even though the server was powered off overnight) is secondary. Immensely useful, but secondary. If we adopt a stance whereby we focus on the A, C, and I in ACID, and soft-pedal the D, some more interesting thinking becomes possible.

What if the data "never" changed?

If you are indeed in the business of business of writing anything like an enterprise app (and if you aren't, how come you read this far?), ask yourself this: how much of the code data source layer (or whatever you call it) of your stack could go away if the data it supplies to layers above it never changed? By "never", of course I mean "not during the lifetime of an application instance".

Now, the RDBMS crowd already distinguish between OLTP and OLAP, (and a very interesting debate that is, too--no really, it is) and the endpoint of moving in the OLAP direction is the data warehouse. And lot of regular RDBMS practice, predating the warehouse, allows for a highly optimised "read-only" route into the data involving a bunch of denormalized views and such. It's a failure mode to "denormalize for performance" even before the schema is built, never mind used for long enough to identify the frequent access patterns that will benefit from that denormalization, but when some user is going to need the same report updated time and again: go for it.

Data "corner shop"

Data warehouses are generally expected to be 1) gigantic, 2) concerned with transactional data that accumulates at a high rate (leading to 1) and 3) aimed at sophisticated reporting (the "analytical" bit of OLAP). The smaller volume, more task specific version is the ''data mart''.

Lets say that we are building a system where the data needed to satisfy the wants of an end-user changed over periods much longer that the duration of a user's typical interaction with the system. Why, then, would we worry about all that aSession.acquireUnitOfWork().readObject( blah blah blah stuff? We would have no need to do that for each user session. If the data changed slowly enough, there's be no need to do it for each instantiation of the (web) application itself. We can imagine the "transactions" getting consolidated in the warehouse (and condensed in the mart) might be things like "add a new item to the catalogue", or "change the price of this item". Happens on the timescale of days--or weeks. And furthermore, these changes might be known about, and actioned, some time in advance of being turned on for a particular application. But we have this idea avilable that access to the data that involves rapidly changing entities doens't have to go by the same route, or be maintained in the same way as that which changes slowly.

Phew! So, rolling all that together you can start to think about doing what the team who re-implemented SQL in Java whom I mentioned earlier did. Or, rather, what one of the more capable programmers did in his spare time because he was fed up with all that junk. Every now and again (say, once every 24 hours) run a bunch of queries (that's your data mart) over the data and pull out all the currently useful bits and write them to a file in one simple encoding or another. Then, when the application gets bounced, it reads that data and builds its in-memory object graph and uses that to serve user requests. Call out to the DB for really volatile stuff, but how much of that do you really, really have? Really?

Do this and I'll bet you find, as that team did when the databaseless version of their app was deployed, that it goes faster and scales better. Oh, and by isolating the application (in their case, completely) from a database they were able to very easily pull out a chunk of functionality to sell as a stand-alone application and/or plugin to other systems and enabled a whole new product line for their business. Nice.

This sort of decision should be the work of architects, but too often wouldn't even come up for disucssion as an option. Why is that?

(*) No approbation of the rather sinister ESR or his dubious schemes should be construed from this reference to the Jargon File

1 comment:

keithb said...

Tim Bray dips a toe in the no-db water at http://www.tbray.org/ongoing/When/200x/2006/07/17/No-Databases