Wednesday, January 21, 2009

Implementing Sharding in the Database

Over the past few weeks (years really) there has been some discussion on sharding. Instead of discussing when sharding is required, as there are good discussions on this already, I want to discuss how I would like to have sharding implemented in the database.

I want the database to handle sharding automatically, and where it can't be automatic, I want the database to help as much as it can.  Just like I want my business logic in a language ideally suited to it, and not stored procs (generally, there are always exceptions); I want all my physical persistence to be handled by the place that already does most of it, the database.  Having the database handle some of the physical persistence and the object relational layer handle the sharding logic isn’t ideal to me, and not just because the current object relational layers don’t have all the sharding functionality I want.  So here is what I want my database to do.  

1) I want to automatically route simple queries to the one shard that has the relevant data.  Hibernate shards already does this; I just want it implemented in the database.  The MPP (massively parallel processing) version of DB2 can also do this.

2) I need the ability to reshard data when a rebalancing of the shards is required, without downtime.  This is similar in concept to an index rebuild being done while the database is still up, which the expensive databases have implemented.  This is more complex than an index rebuild as multiple tables will need to be resharded before the resharding is complete. DB2 has functionality to reshard, but it requires down time last time I checked (which was a few years ago) .

3) Finally, and this is a superset of #1, I want to be able to run one SQL statement across multiple shards at the same time.  The database will send the SQL to all the relevant shards, run as much of the SQL on that shard as possible, using something like the function shipping that DB2 MPP has, and then aggregate the data from the various shards into one result set.  All I need to do this is execute one SQL statement.  Stated differently, all the shards will look like one database.  This won’t be easy to implement, and will probably be limited in functionality at first, but this long term goal is why implementing sharding in the database makes sense. 

This might sound a bit like an advertisement for the MPP version of DB2, but if it were open source I would at strongly consider implementing it for applications that need database sharding. Given that that is not likely to happen, I hope a reliable open source database, like MySQL, or Drizzle, or PostgreSQL, starts implementing this reasonably soon.

This sharding functionality, for the types of problems I’m seeing, is more important than any other new MySQL functionality, including better scaling on multi-core systems. With all above functionality, I can reasonably easily have multiple shards on one machine. Not that I don’t want much better multi-core CPU scalability, I do, as more shards are still harder to manage than less shards, it just isn’t as important as better sharding functionality.  I absolutely need the ability to shard; I don't absolutely need the ability to effectively use every CPU on a server with one shard. 

On another note, I’ll be giving a presentation on table partitioning at the MySQL conference.  This won’t be a basic presentation on how to implement partitioning, instead I’ll be highlighting scenarios where partitioning will degrade performance and cases where it improves performance.  I’ll show examples where, in spite of degraded performance for some tasks, it still might make sense to implement it.  An analogy here would be indexes, where adding an index can speed up some selects at the expense of slowing down inserts and some updates.