[GEEKERY] Separation of concerns when scaling a database...

topic posted Wed, December 6, 2006 - 8:17 AM by  Brian
We have been hard at work since our management change on a couple of different fronts, both business and technical. Mercifully, I have been on the technical. The work we have been doing is basically a thorough audit of the many software layers we have put in place over the last few years, figuring out which we should keep, which we should update, and which we should dump.

One thing that we didn't do in a consistent fashion across the board was our persistence layer. We started with Torque, on Object-Relational Mapping tool that has pretty much run its course. Torque couldn't do everything, so we had to throw some JDBC in there too. Then, we began to move into Hibernate, a more modern and functional ORM tool. Throw into this the fact that our database connection management is also done different ways in different places and it's pretty easy to see that one thing we need to invest time in is reworking our persistence layer.

In order to scale to the size we want to, as well as to accomodate future features and business models that we want to support on our platform, the persistence layer needs to scale. This means we need to partition things vertically across many different data segments and we need to partition horizontally across many partitions. And ideally we would do all this in a way that is hidden from the application business logic completely. Below is a stream of consciousness that I spewed a few days ago about this problem that I wanted to put up here mostly just so that I have a log of what I was thinking about...

===

...There is also a question of how cross-segment collections would be handled. So if the idiom would call for something like member.addProfile(p) do I want to suppot the addition of objects to cross-segment collections? Or would it be better to not expose the addProfile() method on the net.tribe.beans interface and just require first the creation of the Profile bean, then the addition of the Member object to that bean. My gut says that since cross-segment collections could result in the selecting single objects across many partitions, we don't want to make the retrieval of that collection an "under-the-sheets" operation. As opposed to the same-segment collection (which presumes same-partition, incidentally) where Hibernate will just be able to do business in the normal way.

This actually brings up an interesting point about the need for a layer lilke this as opposed to letting the database or some other layer do all the data segmentation work. If you put in a 3rd party tool to do things like retrieve collections that may exist across many partitions, someone is going to have to figure out which partitions to go to for that information. This means that the virtualization layer may need to go to every partition it knows about to make that query, which may result in some pretty heavy lifting in that layer, along with some extra work on each of the partitions that really didn't need to happen at all. By putting some smarts in our persistence layer, we are able to dumb down these requirements and include more simplifying assumptions about how to recover data sets across many different partitions.

Issues such as load balancing and failover are probably good candidates for a third party (application agnostic) system, since those systems will provide the "hard part" - i.e. failure detection and automatic redirection of requests to the hot backup. Load balancing across replicated instances is also something that could be accomplished through a third party (or even by a hardware solution like setting up a virtual server on our f5) because again, there is nothing particular about the application that a load balancing system could use to its advantage. Rather it is more an issue of the load balancing software (or hardware) knowing how to figure out the optimal target for a given request.

So when considering a scalable solution for the application to database interface, there are a few different concerns that we can separate out:

* vertical partitioning scheme: How do we split up tables onto different machines? This is a very application specific concern and needs to take into account things like how quickly the tables will grow as well as how the application domain views those objects, even up to the UI level. There could also be data isolation concerns bound into this as well.
* horizontal partitioning scheme: We also need to be able to insure that we are able to split up the same table into different partitions so that it's growth across reasonably priced hardware is boundless. This type of work may be offloaded to a third party product depending on the requirements. I.e. if the partitioning algorithm is simple then probably the RDBMS could handle it. This is the case if you are viewing your db as a single object and it just needs to be able to grow to be very large. If, however, you or your customers require some level of data isolation (perhaps, say, for reporting or pricing needs) then the application wil need to be more involved.
* load balancing: how to I spread queries across different readers? Clearly something the application doesn't need to know about.
* failover: how do I account for the loss of a node? Clearly something the application doesn't need to know about.
posted by:
Brian
SF Bay Area