Friday, July 15, 2011

To shard, or not to shard: that is the question

When dealing with very high volumes of data, one usually needs to decide how the system is going to scale when the data grows beyond a reasonable size. In my current project we’re working with tables that reach 1B records.  I’m talking about MySQL 5.5.

After initial load-testing and profiling we had our first “oh, shit” moment. When tables are in hundreds of millions of records, regular “laws of physics” don’t always apply. So, we desperately started looking at ways to alleviate these problems.

Clearly, the most scalable method is sharding. Sharding is usually implemented at the application level in such a way that the application knows on which node a particular piece of data resides. A common example is placing users 1-1M on shard 1, users 1M-2M on shard 2, etc. This method allows for theoretically unlimited amounts of data.

Now, the bad part. Implementing sharding is pretty complex and time consuming. Most importantly, sharding forces breaking up the data by certain criteria (like users_id). All queries that specify the user are OK, because we know where to go. But, queries that don’t specify user IDs must be executed on ALL shards. That is a major point of pain. Of course, you can come up with a smart parallel execution strategy, but then you’re effectively entering a realm of programming of distributed databases.

There are some ready alternatives: ScaleDB and Gizzard are examples of NewSQL, and are layers that reside between the application and mulitiple MySQL nodes, and know where to execute the queries. Then, of course, there is a plethora of NoSQL solutions that are capable of distributing the data: Cassandra, MongoDB, etc. In our case, increased complexity associated with introducing new systems into the project, would outweigh benefits of these systems.

So, what’s a desperate developer to do in such a case? Well, in some cases it’s possible to fall back on time-tested ways of reducing table size by means of archival. Archival does not mean that the archived data is not accessible. It might just be accessed from another table. Also, sometimes it helps to partition the table on MySQL level. While it’s far short of the benefits that complete sharding would yield, it does help with performance of queries with partitioned column.

The conclusion is that while sharding is very powerful and scalable technique for dealing with large data volumes, it involves a lot of complexity and side-effects. A careful examination of alternatives is in order. Be it NewSQL, NoSQL, or plain OldSQL (archival, DB-level partitioning), it’s worthwhile to carefully analyze tradeoffs of each solution. 


2 comments:

  1. Great post. One point to add is that there are off-the-shelf solutions for sharding - which give you sharding value without the hassle. ScaleBase offers such a solution (disclosure - I'm working there)

    ReplyDelete
  2. Thanks for providing such useful information. I really appreciate your professional approach.
    garment retailing software

    ReplyDelete