Monthly Archives: May 2011

How redis can ruin your day, and what you can do to fix it

Over the past few years, Redis has become one of the internet’s more popular NoSQL, RAM based datastores, owing largely to its ease of deployment, the abundance of libraries/interfaces, available in a multiplicity of flavors  (we use ezmobius’s redis-rb gem), and perhaps most importantly, the flexibility of its data structures.  Compared to something like memcached, a cache key in redis can correspond to a single value (string or integer), a list (an array of values), a set (an ordered or unordered group of non-repeating values), or a hash (a set of N named fields, each storing a separate value).

For many of you, none of that is necessarily news, and even if it is, the internet abounds with redis how-to’s and introductions, so instead of rewriting what’s already been written, I’d like to share with you what we’ve learned about what I’d call “the dark-side” of Redis, the side that you only get to see after the two of you have had a few too many drinks at a hotel bar, and things start to get real weird, real fast.  Here at Miso we’ve been using Redis long enough to have had at least a few of these awkward moments with it, and although they’ve never been uncomfortable enough to make us consider replacing it altogether, they have been major points of frustration at times.  This post is my attempt to provide a first-hand account of redis’s sordid underbelly, in the hopes that you may be able to avoid some of issues we’ve grappled  with (and continue to) over the last year.

Where’s my memory?

One of the most confounding aspects of redis for the beginner may be the unpredictable and at times incomprehensible relationship between the memory footprint of redis-server and the actual amount of data being stored.  This was originally the impetus behind most of the high-level analysis we performed; we were perpetually running out of RAM on our caching server, but we knew (according to this script) that we were only storing a couple of gigabytes of values across all of our redis databases.  Sure, we expected redis to use a little extra memory to take care of metadata like key expiries, and other stuff, but we consistently saw redis-server using up to 5-10x as much memory as we would expect intuitively.

To understand the issue better, I began running a series of tests designed to examine how redis allocates memory given various datasets.  The idea was to populate redis with a bunch of records containing random data, using both strings and hashes (these were the only data structures that we were interested in using), and then measure the memory footprint in relation to the total amount of “stuff” (characters) that we saved. At the outset we were most interested in discovering what parameters/configuration yielded the most efficient storage performance (our metric was bytes/character – a value I’ll refer to as ‘overhead’).  Below are three graphs, comparing the total number of records, key size, and value size to overhead:

Certain patterns leap out almost immediately; for instance, just about any way you slice it, the smaller the total amount of data being stored the larger overhead.  Conversely, the ‘overhead’ just about always decreases asymptotically toward 1 byte/character as the amount of data being stored increases.  This makes perfect sense, as there is a “base footprint” that redis requires no matter what, and as the dataset grows, there is a more well-defined relationship between the actual amount of data contained in redis and the memory it consumes.

We can also infer (with some help from the redis documentation) that more “continuous” data is stored more efficiently.  For instance, if we need to store 2 million characters, it is more efficient to store it this way:

1000 records * ( 100 characters per key + 1900 characters per value)

than this way:

10,000 records * ( 100 characters per key + 100 characters per value)

This is all pretty consistent with the recommendations in the redis documentation.

To Hash or not to Hash

It also became clear from our tests that with randomized keys and values, hashes have slightly higher overhead than strings, and once again this makes sense, as hashes contain more “metadata” (information about how the data is structured), and that comes at a cost.

This seems contrary to the information provided by the redis documentation (see “Use Hashes when Possible”), which suggests that for recent versions of redis (2.2 and higher), hashes are far more efficient than strings, but keep in mind that we have been generating completely random data  (essentially noise) up until this point, and noise is, by definition, incompressible.

The story changes quite a bit when you have a non-noisy  set of cache keys that can be considered compressible, for example:

    user:1:last_signin => "Last Thursday"
    user:1:favorite color => "Blue"
    user:1:name => "Justin"

In these cases, it’s obvious to anyone familiar with hashes that the same data could be structured like this in something like JSON

user:1 => {name: "Justin", favorite_color: "Blue", last_signin: "Last Thursday"}

This format obviates the need to repeat “user:1″ for each value being stored, in theory reducing the amount of overall data redis needs to record.

To test this hypothesis, I generated data for 100,000 users, each with 5 fields holding randomized strings of 10 characters, using hashes first:

(user:1 = {:field0 =>”dugf4dfgv3″, :field1 => “oiw2335hnb”….})

then flat strings, with the field embedded in the key (user:1:field:0 = “).  The hashified example had a memory footprint of 21 MB, compared to 61 MB for the flattened data – a savings of about 2/3.  The same test with with 33,000 records and 10 fields produced 10 MB of data when hashes were used as opposed to 41 MB for the flattened data, once again a very significant reduction.  The lesson to take away from all of this is to use hashes whenever it makes sense.  If you are creating multiple records for values that all correspond to a single object (a user in our example), a hash is probably the better alternative.  If you have a significant amount of data (more than 10,000 records), you will absolutely  reduce the amount of memory used by redis.

Is that it?

No, it most certainly isn’t.  There are a few other little gotchas that we’ve encountered along the way, some of which we still have no explanation for.  For instance, if your instance of redis-server is using a significant amount of the total memory available on your machine (we’ll say greater than 70%), you need to be VERY vigilant, as we have experienced huge leaps in memory consumption for seemingly no reason.  For instance, this weekend within a span of 5 seconds, redis decided it wanted another 200 MB of memory without a single record being added to any of our databases.  The same thing happened twice more over the course of 24 hours, culminating in a whopping 20% size increase with no discernible cause – and that’s pretty significant for 4 GB of data.  We are still at a loss to explain what happened during this period.   If you plan on using redis in production, plan on having monit, god or something similar in place to keep an eye on it, just in case it decides it wants to be sneaky while you aren’t paying attention.

It’s also a smart idea to make frequent use of the redis-cli tool that ships with redis, to view the output of:

redis-cli info

This command will provide you with information about the ACTUAL amount of information being consumed by your data, along with the total amount of data redis-server believes it’s using, and an attendant fragmentation ratio.  It will look something like this:

used_memory:41825152
used_memory_human:39.89M
used_memory_rss:68186112
mem_fragmentation_ratio:1.63

The mem_fragmentation_ratio gives you an idea of how wasteful redis is being.  In many cases, despite how ugly it may sound, a simple restart will free up most of the memory that redis no longer needs, but hasn’t had a chance to deallocate yet.

Another optimization-related note to bear in mine is that there are specific settings in the redis.conf file which you can use to tell redis how big you believe your hashes, lists or sets will be.  Redis will theoretically use these values to further optimize the storage of your data, saving even more space.  We haven’t found them to provide that much utility in our preliminary tests, but that doesn’t mean they offer none, and the  documentation suggests that this configuration can actually be quite effective in reducing redis’s memory footprint.

Beyond that, there isn’t much else you can do once your data starts to become unmanageable, aside from dramatically rethinking the way in which you cache. Originally, redis offered a solution for datasets that were simply ALWAYS going to be too large for memory –  redis virtual memory – which would write infrequently used values to disk, and only store the most important, frequently accessed records in memory.  This turned out to be a bit of a flop, in that for our dataset, it took up to 30 minutes for redis to start up with the virtual memory enabled.  The creator of redis, antirez, is attempting to roll out a superior replacement to redis virtual memory, the redis diskstore in version 2.4, which should be in beta sometime later this year .

Until then, we are stuck with the somewhat scary proposition that redis will continue to outgrow our hardware (as it has in the past), and in that case, our options are either to optimize even further, buy more hardware, or drop it altogether.  Our best advice is to be as smart as possible from the beginning about how you use redis, and never make the assumption that because redis is so fast and lightweight for smallish datasets (100,000-1 million records), that it will continue to be for 10 million or more records.  Antirez himself states that redis was written in such a way that it is left to the developer to decide how he/she wants data to be stored:

But the Redis Way is that the user must understand how things work so that he is able to pick the best compromise, and to understand how the system will behave exactly.

- antirez

At a certain point it will start to become prohibitively annoying to make sweeping changes to your app while simultaneously modifying all of your historical data to comply with those changes -so perhaps the most important thing to remember is that you should never expect redis to magically solve your caching problems for you.

If you’re using to_json, you’re doing it wrong

At Miso, we have been very busy in the last few months building out a large number of public APIs for our Developer Platform. In a short time, we have already seen early versions of applications built on our platform for Chrome, Windows Mobile 7, Blackberry, Playbook, XBMC among others. This has been very exciting to see the community embrace our platform and leverage our data to power additional services or bring our service to a new group of users. In this post, we will discuss how we started out building our APIs using Rails and ‘to_json’, why we became frustrated with that approach and how we ended up building our own library for API generation.

Our public APIs are designed to be unsurprising and intuitive for a developer. We chose OAuth 1.0a (soon to support OAuth 2) because this is already familiar to developers and there is rich library support across languages for this authentication strategy. The endpoints are for the most part RESTful with GET retrieving and POST / DELETE used to modify the data associated with a user. We also tried to design our API responses as simply as possible, giving every attribute a readable name, keeping node hierarchies relatively flat and not including unnecessary information from our database schema. While this may seem like good design goals when building a Public API, you might find yourself surprised at how difficult this can be using Rails and the baked in ‘to_json’ serialization that the framework provides.

Rails ‘to_json’ API generation

Let’s start by discussing the canonical approach Rails provides for generating APIs. The idea is to have the JSON and XML responses be deeply tied to the model schemas with one-to-one mappings in most cases between the database columns and the api output. This is great if you are working on internal APIs and you wish to dump the data directly into a response but less great for well-designed public APIs. Let’s look at how rendering with this to_json approach works in practice:

# app/controllers/posts_controller.rb
#...
  respond_to :json, :xml
  def index
    @posts = Post.all
    respond_with(@posts)
  end
#...

Now, when this controller action is invoked the @post object automatically has the ‘to_json’ method called which takes the ActiveRecord model and converts seamlessly to JSON output. The model can also optionally be given parameters in the model by overriding the as_json method with options:

# app/models/post.rb
class Post
  def as_json(options={})
    super(options.merge(:methods => [...], :only => [...], :include => [...])
  end
end

This would then render the associated JSON response based on the options specified. In the simplest cases this would be all you need and the Rails way works without a problem. As long as your database schema is deeply coupled with the API output then all is well. As mentioned there are also a few choices to customize the to_json output and alter the output:

  • only – Only show column names in the output as specified in this list
  • except – Show all column names except the ones specified in this list
  • methods – Include these methods nodes (without any arguments) as nodes in the output
  • include – Add child nodes (potentially nested) based on associations within the object

Alright, to recap: There is a model which contains a specific database schema and some methods. There are limited options to transforming that into JSON as described above by passing a hash of options. These options can be passed in as defaults with the ‘as_json’ method in the model itself or through the controller action.

The unravelling of ‘to_json’ begins

Well wait just a minute, what if in different API responses, you want to include different api output options? What if you want to override the ‘as_json’ defaults? Not too bad you can just do:

# app/controllers/posts_controller.rb
#...
  respond_to :json, :xml
  def index
    @posts = Post.all
    respond_with(@posts) do |format|
      format.json { render :json => @posts.to_json(:include => [...], :methods => [...]) }
    end
  end
#...

Using those settings you can change the JSON output on a per action basis. This system as described is a good high level overview of the Rails JSON generation approach. This can work in very simple and naive applications, but it should not be hard to imagine how this approach can be restrictive as well as verbose. Suppose I want to render a JSON output with only a few columns, also adding several methods and including nested options for multiple associations? That might look something like this:

# app/controllers/posts_controller.rb
#...
  respond_to :json, :xml
  def index
    @posts = Post.all
    respond_with(@posts) do |format|
      format.json { render :json => @posts.to_json(
         :only => [:title, :body, :created_at, :tags, :category],
         :include => [
            :likes => { :only => [:created_at], :include => [:author] },
            :comments => { only => [:created_at, :body], :include => [:author]  },
            :user => { :only => [:first_name, :last_name}, :methods => [:full_name] },
         :methods => [:likes_count, :comments_count])
      }
    end
  end
#...

This action code is already starting to smell a bit funny even here. This is a lot of bulk in the controller and quite redundant. This doesn’t even really seem like this belongs in the controller. This is more of a view or template concern discussing the details of a particular JSON representation. You could move that all to the model inside a method, but that actually makes things harder to follow. Already this method of generating JSON doesn’t feel quite right and begins to break down.

You may think this example is a contrived case or poor API design but consider that there’s actually not that much going on here. This type of response is commonplace in almost any public API you will see on the web. In fact, it is actually much simpler then many in the wild. Compare the above to the Instagram API.

More Frustrations with API Generation

The issues above were just the beginning of the issues we ran up against using the ‘to_json’ method because that approach is interested in ‘serializing’ a database object while we are interested in creating a relevant representation for our public API platform. The ‘serialization’ of the object so directly just didn’t quite fit what we were trying to do.

The easiest way to demonstrate the limitations that frustrated us is to show relevant examples. Let’s start with a simple idea. In our system we have ‘posts’ and we have the idea of ‘liking a post’. In our API we want to return if the authenticated user ‘liked’ a particular post in the feed. Something like:

[ { post : { title : "...", liked_by_user : true }, ...]

Notice the node ‘liked_by_user’ which contains whether or not a user has liked the given post. Assuming we have this method in the model:

class Post
  # Returns true if given user has liked the post.
  # @user.liked_by_user?(@user) => true
  def liked_by_user?(user)
    self.likes.exists?(:user_id => user.id)
  end
end

We simply want to get this boolean value into the API response with the node name ‘liked_by_user’. How would we do this in ‘to_json’? How do we pass an argument to a method? After doing some research, it was apparent that this was not particularly easy or intuitive. It would be nice to have a simple way to pass multiple arguments to a method in the model without jumping through hoops.

Let’s move onto another example. Suppose we want to change the ‘user’ association to be aliased as an ‘author’ node in the output. Let’s say we have:

class Post
  belongs_to :user
end

and we want to have the output be:

[ { post : { title : "...", author : { first_name : "...", last_name : "..." }  }, ...]

What if I just need a minor change to the value of an attribute before inserting it into the JSON? What if I need a custom node in the JSON that is not needed in the model directly? What if I want to include the value of a method only if a condition on the record is met? What if I want to reduce duplication and render a JSON hash as a child of the parent response? What if I want to glue a couple of attributes from the user to the post? Change the model and fill it with this display logic every time? Fill our controllers with complicated JSON display options? Workaround the problems by fighting with ‘to_json’ and/or monkeypatching it?

Perhaps a better approach

As we came against these issues and many more while we designed and implemented our public APIs, we butted our heads against ‘to_json’ again and again. Often we wanted the attributes defined in the schema to be renamed or modified for the representation, or we wanted to omit attributes, or we wanted to include attributes if a condition was met, we wanted to handle polymorphic associations in a clean and easy way, we wanted to keep a flat hierarchy by ‘gluing’ attributes from the child to the parent.

Furthermore, the model and/or controller was getting filled up with tons of json specific details that had nothing to do with model or business logic. In fact, these JSON responses and verbose declarations didn’t seem to belong in the model or the controller at all and were cluttering up our code. In fact, true to MVC these details of the response seemed much more appropriate in a view of some kind. This idea of storing the JSON in a view sparked an experiment. Why not just generate the JSON in a template and move all of the display details out of the model and the controller. What if the API could be crafted easily in the view where a JSON representation belongs?

We agreed that implementing APIs in a view made the most sense both conceptually and practically. The next question becomes what templating language to use to generate these APIs? Forming the XML or JSON manually in ‘erb’ seemed verbose and error-prone. Using builder seemed silly since we wanted to build APIs that work primarily in JSON. Indeed, none of the default templating languages we grab for seemed to fit. We didn’t want to painstakingly handcraft nodes manually, we just wanted a simple way to declare how our APIs should look that afforded us the flexibility we needed.

We investigated a wealth of different libraries that seemed to fit the bill from tequila, to json_builder, to argonaut and many more attempts to solve this problem. Clearly we weren’t the only ones that had experienced the pain of ‘to_json’. Perusing the READMEs of any of these libraries quickly revealed people fed up with the limitations same as we had become. Problem was every option we could find didn’t work for one reason or another. Either the syntax was awkward, the libraries weren’t maintained, there were too many bugs, or the templates became verbose and difficult to manage. After reviewing the available options, we decided to try and design our own library for creating APIs. One that would solve all the problems we had encountered thus far.

The Ruby API Builder Language

We embarked on a thought experiment before building the library. What were our frustrations with existing libraries and tools? Where do we want the JSON options to live? How did we want to specify them? What options did we want to have? What language or syntax should we use to define the output? How do we keep the options DRY and intuitive?

Early on we decided we wanted the JSON output to be defined in the views. Logic that belonged in the models would stay there where it belonged, but this was rarely the case. Most of the options were simply crafting the JSON response and clearly belongs in a template. So that meant a file living in the views folder within Rails. We also decided we didn’t want to learn a new language and that Ruby was as good an API builder as any. Why not just leverage a simple Ruby DSL to build our APIs? Why not support inheritance and partials for our APIs? Why not allow the same template to describe both the JSON and XML responses for our API?

From these design questions and several days of work, the RABL gem was born. We started using this approach and fell in love with it immediately. All of a sudden, generating APIs was easy and intuitive. Even the most complex or custom API output was very simple and maintainable through the use of inheritance, partials and custom nodes. All of this was kept neatly tucked away in a view template where it belonged without requiring any extra code in the models or worse the controller actions.

Stay Tuned

Since we built RABL, we have gotten excellent feedback from the community. We have deployed in production all of our Public APIs using RABL and we couldn’t be happier. Please checkout the README and let us know what you think! We would love to hear your experiences with building APIs on Rails or Sinatra. This post is a setup for a thorough step-by-step tutorial we plan to publish soon on generating clean JSON and XML APIs in Rails 3 using RABL.