Tag Archives: Detailed

Adventures in Scaling, Part 2: PostgreSQL

Several months ago, I wrote a post about REE Garbage Collection Tuning with the intent of kicking off a series dedicated to different approaches and methods applied at Miso in order to scale our service. This time around I wanted to focus on how to setup PostgreSQL on a dedicated server instance. In addition, I will cover how to tweak the configuration settings as a first-pass towards optimizing database performance and explain which parameters are the most important.

Why PostgreSQL?

Before I begin covering these topics, I want to briefly touch on why our application (and this tutorial) is centered on PostgreSQL and not one of the many other RDBMS or NoSQL alternatives available for persistence in current web development. From the multitude of “general purpose” database persistence options available, the most common choices for a startup in my experience tend to be MySQL, Drizzle, PostgreSQL and MongoDB.

Each of these options above have pros and cons and there is no one size fits all solution as is usually the case in technology. In the past, I have traditionally used MySQL to do the bulk of database persistence for Rails apps. This choice was largely because of familiarity as well as the clear MySQL favoritism in the early years of the Rails community. Though the full explanation of why is outside the scope of this post, suffice to say I won’t be choosing to use MySQL again in the future when starting a new project. My claim, unsubstantiated in this post, is that there is nothing significant MySQL provides over PostgreSQL and yet there are many pitfalls and downsides.

If you are interested in the subject, I recommend you read a few posts and draw your own conclusions. To be fair, Drizzle looks like an interesting alternative to MySQL and/or Postgres. Having never used that database, I would be curious to hear how it compares to PostgreSQL. We are big fans of MongoDB at Miso and we store several types of data for our services within collections. However, for historical and practical reasons, we did not want to dedicate the time to convert our primary dataset as the benefit at our current level was not significant enough to warrant the time involved. In a future post, I would love to delve deeper into our Polyglot Persistence strategy and why we opted to use particular technologies over alternatives.

Setting up PostgreSQL

With that explanation out of the way, let’s turn our attention to setting up PostgreSQL on a dedicated database server. In this tutorial, we will be installing Postgres 9.X on an Ubuntu machine. You may need to tweak this for your specific needs and platform depending on your specific setup.

One of the first questions you run across when setting up a dedicated PostgreSQL server is “How much RAM should the instance have?”. I would take a look at the total size of your database (or the expected size of your database in the near future). If at all possible, the ideal RAM for your instance would allow for the entire database to be placed in memory. At small scales, this should be possible and will mean major performance increases for obvious reasons. The size I recommend for a starting server is typically between 2-8GB of RAM. Furthermore, I would recommend against using a dedicated database with less than 1GB of RAM if you can help it. If you have a large dataset and need replication or sharding from the start, then I would recommend putting down this guide and buying the PostgreSQL High Performance book right now. For the purposes of the rest of this guide, I am going to assume an 8GB instance was selected.

Now that we picked the size of our instance, let’s actually start the install. Unfortunately for us, Ubuntu 10.04 apt repositories only have 8.4. The postgresql-9.0 package won’t be added until Ubuntu Natty. For this reason, unless you are using that version, you must use alternate repositories for this installation. There is an excellent utility called “python-software-properties” to make installing repositories easier. If you haven’t yet, install that first:

sudo apt-get install python-software-properties

Next, let’s add the repository containing PostgreSQL 9.0:

sudo add-apt-repository ppa:pitti/postgresql
sudo apt-get update

Now, we need to install the database, the contrib tools, and several supporting libraries:

sudo apt-get install postgresql-9.0 postgresql-contrib-9.0
sudo apt-get install postgresql-server-dev-9.0 libpq-dev libpq5

Installing all of these packages now will save you the pain later of trying to track down why certain things won’t work. Another good step is to symlink the archival cleanup tool to /usr/bin for use when you need to enable replication with WAL:

sudo ln -s /usr/lib/postgresql/9.0/bin/pg_archivecleanup /usr/bin/

At this time, you should also recreate the cluster to ensure proper UTF-8 encoding for all databases:

su postgres
pg_dropcluster --stop 9.0 main
pg_createcluster --start -e UTF-8 9.0 main
sudo /etc/init.d/postgresql restart

You can now check the status of the database using:

sudo /etc/init.d/postgresql status

Once this has all been done, you can find the configuration directory in /etc/postgresql/9.0/main and the data directory in /var/lib/postgresql/9.0/main.

Another useful tool is pg_top, which allows you to view the status of your PostgreSQL processes along with the currently executing queries:

sudo wget http://pgfoundry.org/frs/download.php/1781/pg_top-3.6.2.tar.gz
tar -zxpvf pg_top-3.6.2.tar.gz
cd pg_top-3.6.2
./configure
make

PostgreSQL and all related utilities should now be properly installed. From here, you can begin creating databases and users with the psql command:

psql -d postgres
> CREATE USER deploy WITH PASSWORD 'jw8s0F4';
> CREATE ROLE admin SUPERUSER
> CREATE DATABASE name;

Of course, there are a lot of commands you can run with this console. You are encouraged to read other sources for more information.

PostgreSQL Tuning

Now that we have successfully installed PostgreSQL, the next thing to do is to tune the configuration parameters for solid performance. Of course, the best way to do this is to measure and profile your application and set these based on your own needs. All these settings will ultimately vary based on the needs of your particular application. Nonetheless, here is a guide intended to get you started with settings that work “well enough”. From here, these parameters can be tweaked to your hearts content to find the sweet spot for your individual use case.

Fortunately for us, PostgreSQL guru Gregory Smith has created a tool to make our lives a great deal simpler. This tool is called pgtune and I encourage you to run this on your server as quickly as possible after setup. The settings this recommends should be your baseline configuration values unless you know otherwise. First, let’s download the pg_tune tool:

cd ~
wget http://pgfoundry.org/frs/download.php/2449/pgtune-0.9.3.tar.gz
tar -zxvf pgtune-0.9.3.tar.gz

Once the utility has been extracted, simply execute the binary with the proper options and recommended configuration values will be output:

cd pgtune-0.9.3
./pgtune -i /etc/postgresql/9.0/main/postgresql.conf -o ~/postgresql.conf.pgtune --type Web

This will generate all the recommended values tailored custom to your server in a file ~/postgresql.conf.pgtune. Simply view this file and note the settings at the bottom:

cat ~/postgresql.conf.pgtune
# Look at the bottom for the relevant parameter values

Take the settings and append them to your actual configuration file located at /etc/postgresql/9.0/main/postgresql.conf. To be on the safe side, you should also update the kernel shmmax property which is the maximum size of shared memory segment in bytes. This is particularly necessary for large values of effective_cache_size and other parameters:

  sysctl -w kernel.shmmax=26843545600
  sudo nano /etc/sysctl.conf
    # Append the following line:
    kernel.shmmax=26843545600
  sudo sysctl -p /etc/sysctl.conf

Once this has been updated and you have saved the modified settings, restart your cluster for the settings to take affect:

pg_ctlcluster 9.0 main reload

Now that we have setup these tuned parameters, let’s take a look deeper and delve into the most important parameters you can tweak to improve your database performance. The list below is not a comprehensive guide, but in most cases these parameters will serve as the first values to experiment with after using pg_tune.

  • work_mem – Specifies the amount of memory to be used by internal sort operations and hash tables before switching to temporary disk files. The total memory used could be many times the value of work_mem; it is necessary to keep this fact in mind when choosing the value. Sort operations are used for ORDER BY, DISTINCT, and merge joins. The recommended range for this value is the total available memory / (2 * max_connections). On an 8GB system, this could be set to 40MB.
  • effective_cache_size – Sets the planner’s assumption about the effective size of the disk cache that is available to a single query. This is factored into estimates of the cost of using an index; a higher value makes it more likely index scans will be used, a lower value makes it more likely sequential scans will be used. The recommended range for this value is between 60%-80% of your total available RAM. On an 8GB system, this could be set to 5120MB.
  • shared_buffers – Sets the amount of memory the database server uses for shared memory buffers. The default is typically 32 megabytes and must be at least 128 kilobytes. The recommended range for this value is between 20%-30% of the total available RAM. On an 8GB system, this could be set to 2048MB.
  • wal_buffers – The amount of memory used in shared memory for WAL data. The default is 64 kilobytes (64kB). The setting need only be large enough to hold the amount of WAL data generated by one typical transaction. The recommended range for this value is between 2-16MB. On an 8GB system, this could be set to 12MB.
  • maintenance_work_mem – Specifies the maximum amount of memory to be used in maintenance operations, such as VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY. Larger settings may improve performance for vacuuming and for restoring database dumps. The recommended range for this value is 50MB per GB of the total available RAM. On an 8GB system, this could be set to 400MB.
  • max_connections – Determines the maximum number of concurrent connections to the database server. The default is typically 100 connections and this parameter can only be set at server start. The recommended range for this value is between 100-200.

Beyond this for further reading, I would recommend checking out other articles, and of course Gregory Smith’s book PostgreSQL High Performance.

Wrapping Up

Hopefully you have found the guide above helpful. The intent was to provide an easy way for people to get started using PostgreSQL in their Rails applications today. If you are starting a new Rails application, I strongly encourage you to make your own informed decision on which persistence engine to use. Research the issue for yourself but I would urge you to at least consider giving PostgreSQL an honest try. Coming from using MySQL prior, I can assure you we did not miss a single thing about MySQL when we migrated.

This post is the first of many that will deal with PostgreSQL and provide details on how to get this excellent RDBMS to work for your application. In a future post, I would like to cover how to setup “hot standby” replication with archival logs (WAL). That guide will take you step by step through creating a master database and several read-only slaves as well as then utilizing those slaves from within your Rails application. We also ported from MySQL to PostgreSQL, so another future post will detail how to best migrate your database as well as pitfalls to avoid.

Building a Platform API on Rails

Introduction

In my last post, I discussed the failings of to_json in Rails and how we approached writing our public APIs using RABL and JSON Templates in the View. Since that post, we have continued expanding our Developer API and adding new applications to our Application Gallery.

The approach of using views to intelligently craft our JSON APIs has continued to prove useful again and again as we strive to create clean and simple APIs for developers to consume. We have used the as_json approach on other projects for years in the past and I can say that I have not missed that strategy even once since we moved to RABL. To be clear, if your APIs and data models are dead simple and only for internal or private use and a one-to-one mapping suits your needs, then RABL may be overkill. RABL really excels once you find yourself needing to craft flexible and more complex APIs rich with data and associated information.

The last post was all about explaining the reasons for abandoning to_json and the driving forces behind our creation of RABL. This post is all about how to actually use Rails 3 and RABL in a psuedo real-world way to create a public API for your application. In other words, how to empower an “application” to become a developer’s platform.

Authentication

The first thing to determine when creating a platform is the authentication mechanism that developers will use when interacting with your API. If your API is read-only and/or easily cached on a global basis, then perhaps no authentication is necessary. In most cases though, your API will require authentication of some kind. These days OAuth has become a standard and using anything else is probably a bad idea.

The good news is that we are well past the point where you must implement an OAuth Provider strategy from scratch in your application. Nevertheless, I would strongly recommend you become familiar with the OAuth Authentication Protocol before trying to create a platform. Starting with this in mind, your next task is to pick an OAuth gem for Ruby. Searching Github, you can see there are many libraries to choose from.

The best one to start with is Rails OAuth Plugin which will provide a drop-in solution both to consuming and providing APIs in your Rails 3 applications. The documentation is fairly decent and coupled with the introduction blog post, I think reiterating the steps to set this up would be unnecessary. Follow the steps and generate an OAuth Provider system into your Rails 3 application. This is what will enable the “OAuth Dance” where developers can generate a request token and sign with that token for a user to retrieve and access token. Once the developer gets an access token, they can use that to interact with your APIs. The easiest way for developers to consume your API is to use a “consumer” OAuth library such as Ruby OAuth or equivalent in the client language.

Building your API

Now that we have authentication taken care of in our application, the next step is to create APIs for developers to interact with on our platform. For this, we can re-use our existing controllers or create new ones. You already likely have html representations of your application for users to interact with. These views are declared in a “index.html.erb” file or similar and describe the display for a user in the browser. When building APIs, the simplest way to think about them is to think of the JSON API as “just another view representation” that lives alongside your HTML templates.

To demonstrate how easy building an API is, let’s consider a sample application. This application is a simple blogging engine that has users and posts. Users can create posts and then their friends can read their streams. For now, we just want this to be a blogging platform and we will leave the clients up to developers to build using our APIs. First, let’s use RABL by declaring the gem in our Gemfile:

# Gemfile
gem 'rabl'

Next, we need to install the new gem into our gemset:

$ bundle install

Next, let’s generate the User and Post tables that are going to be used in this example application:

$ rails g model User first_name:string last_name:string age:integer
$ rails g model Post title:string body:text user_id:integer
$ rake db:migrate db:test:prepare

Nothing too crazy, just defining some models for user and posts with the minimum attributes needed. Let’s also setup model associations:

# app/models/user.rb
class User < ActiveRecord::Base
  has_many :posts
end

# app/models/post.rb
class Post < ActiveRecord::Base
  belongs_to :user
end

Great! So we now have a user model and a post model. The user can create many posts and they can be retrieved easily using the excellent rails has_many association. Now, let’s expose the user information for a particular user in our API. First, let’s generate the controller for Users:

$ rails g controller Users show

and of course setup the standard routes for the controller in routes.rb:

# config/routes.rb
SampleApiDemo::Application.routes.draw do
  resources :users
end

Next, let’s protect our controller so that the data can only be accessed when a user has authenticated, and setup our simple “show” method for accessing a user’s data:

# app/controllers/users_controller.rb
class UsersController < ApplicationController
  before_filter :oauth_required

  respond_to :json, :xml
  def show
    @user = User.find_by_id(params[:id])
  end
end

As you can see the controller is very lean here. We setup a before_filter from the oauth-plugin to require oauth authentication and then we define a “show” action which retrieves the user record given the associated user id. We also declare that the action can respond to both json and xml formats. At this point, you might wonder a few things.

First, “Why include XML up there if we are building a JSON API?” The short answer is because RABL gives this to us for free. The same RABL declarations actually power both our XML and JSON apis by default with no extra effort! The next question might be, “How does the action know what to render?”. This is where the power of the view template really shines. There is no need to declare any response handling logic in the action because Rails will automatically detect and render the associated view once the template is defined. But what does the template that powers our XML and JSON APIs look like? Let’s see below:

# app/views/users/show.rabl
object @user

# Declare the properties to include
attributes :first_name, :last_name

# Alias 'age' to 'years_old'
attributes :age => :years_old

# Include a custom node with full_name for user
node :full_name do |user|
  [user.first_name, user.last_name].join(" ")
end

# Include a custom node related to if the user can drink
node :can_drink do |user|
  user.age >= 21
end

The RABL template is all annotated above hopefully explaining fairly clearly each type of declaration. RABL is an all-ruby DSL for defining your APIs. Once the template is setup, testing out our new API endpoint is fairly simple. Probably the easiest way is to temporarily comment out the “oauth_required” line or disable in development. Then you can simply visit the endpoint in a browser or using curl:

// rails c
// User.create(...)
// rails s -p 3001
// http://localhost:3001/users/1.json
{
  user: {
    first_name: "Bob",
    last_name: "Hope",
    years_old: "92",
    full_name: "Bob Hope",
    can_drink: true
  }
}

With that we have a fully functional “show” action for a user. RABL also let’s us easily reuse templates as well. Let’s say we want to create an “index” action that lists all our users:

# app/views/users/index.rabl
object @users

# Reuse the show template definition
extends "users/show"

# Let's add an "id" resource for the index action
attributes :id

That would allow us to have an index action by re-using the show template and applying it to the entire collection:

// rails s -p 3001
// http://localhost:3001/users.json
[
  {
    user: {
      id : 1,
      first_name: "Bob",
      last_name: "Hope",
      years_old: "92",
      full_name: "Bob Hope",
      can_drink: true
    }
  },
  {
    user: {
      id: 2,
      first_name: "Alex",
      last_name: "Trebec",
      years_old: "102",
      full_name: "Alex Trebec",
      can_drink: true
    }
  }
]

Now that we have an “index” and “show” for users, let’s move onto the Post endpoints. First, let’s generate a PostsController:

$ rails g controller Posts index

and append the posts resource routes:

# config/routes.rb
SampleApiDemo::Application.routes.draw do
  resources :users, :only => [:show, :index]
  resources :posts, :only => [:index]
end

Now, we can fill in the index action by retrieving all posts into a posts collection:

# app/controllers/posts_controller.rb
class PostsController < ApplicationController
  before_filter :login_or_oauth_required     
  respond_to :html, :json, :xml      

  def index          
    @posts = Post.all      
  end  
end

and again we define the RABL template that serves as the definition for the XML and JSON output:

# app/views/posts/index.rabl  

# Declare the data source  
collection @posts  

# Declare attributes to display  
attributes :title, :body  

# Add custom node to declare if the post is recent 
node :is_recent do |post|      
  post.created_at > 1.week.ago
end

# Include user as child node, reusing the User 'show' template
child :user do
  extends "users/show"
end

Here we introduce a few new concepts. In RABL, there is support for “custom” nodes as well as for “child” association nodes which can also be defined in RABL or reuse existing templates as shown above. Let’s check out the result:

[
  {
    post: {
      title: "Being really old",
      body: "Let me tell you a story",
      is_recent: false,
      user: {
        id : 52,
        first_name: "Bob",
        last_name: "Hope",
        years_old: 92,
        full_name: "Bob Hope",
        can_drink: true
      }
    }
  },
  { ...more posts... }
]

Here you can see we have a full posts index with nested user association information and a custom node defined as well. This is where the power of JSON Templates lies. The incredible flexibility and simplicity that this approach affords you. Renaming an object or attribute is as simple as:

# app/views/posts/index.rabl

# Every post is now contained in an "article" root
collection @posts => :articles

# The attribute "first_name" is now "firstName" instead
attribute :first_name => :firstName

Another benefit is the ability to to use template inheritance which affords seamless reusability. In addition, the partial system allows you to embed a RABL template within another to reduce duplication.

Wrapping Up

For a full rundown of the functionality and DSL of RABL, I encourage you to check out the README and the Wiki. In addition, I would like to point out excellent resources by other users of RABL and encourage you to read teohm’s post as well as Nick Rowe’s overview of RABL as well. Please post any other resources in the comments!

My intention with this post was to provide a solid introductory guide for creating an API and a data platform easily in Rails. If there are topics that you would be interested in reading about more in-depth related to API design or creation or building out a platform, please let us know and we will consider it for a future post.

If you’re using to_json, you’re doing it wrong

At Miso, we have been very busy in the last few months building out a large number of public APIs for our Developer Platform. In a short time, we have already seen early versions of applications built on our platform for Chrome, Windows Mobile 7, Blackberry, Playbook, XBMC among others. This has been very exciting to see the community embrace our platform and leverage our data to power additional services or bring our service to a new group of users. In this post, we will discuss how we started out building our APIs using Rails and ‘to_json’, why we became frustrated with that approach and how we ended up building our own library for API generation.

Our public APIs are designed to be unsurprising and intuitive for a developer. We chose OAuth 1.0a (soon to support OAuth 2) because this is already familiar to developers and there is rich library support across languages for this authentication strategy. The endpoints are for the most part RESTful with GET retrieving and POST / DELETE used to modify the data associated with a user. We also tried to design our API responses as simply as possible, giving every attribute a readable name, keeping node hierarchies relatively flat and not including unnecessary information from our database schema. While this may seem like good design goals when building a Public API, you might find yourself surprised at how difficult this can be using Rails and the baked in ‘to_json’ serialization that the framework provides.

Rails ‘to_json’ API generation

Let’s start by discussing the canonical approach Rails provides for generating APIs. The idea is to have the JSON and XML responses be deeply tied to the model schemas with one-to-one mappings in most cases between the database columns and the api output. This is great if you are working on internal APIs and you wish to dump the data directly into a response but less great for well-designed public APIs. Let’s look at how rendering with this to_json approach works in practice:

# app/controllers/posts_controller.rb
#...
  respond_to :json, :xml
  def index
    @posts = Post.all
    respond_with(@posts)
  end
#...

Now, when this controller action is invoked the @post object automatically has the ‘to_json’ method called which takes the ActiveRecord model and converts seamlessly to JSON output. The model can also optionally be given parameters in the model by overriding the as_json method with options:

# app/models/post.rb
class Post
  def as_json(options={})
    super(options.merge(:methods => [...], :only => [...], :include => [...])
  end
end

This would then render the associated JSON response based on the options specified. In the simplest cases this would be all you need and the Rails way works without a problem. As long as your database schema is deeply coupled with the API output then all is well. As mentioned there are also a few choices to customize the to_json output and alter the output:

  • only – Only show column names in the output as specified in this list
  • except – Show all column names except the ones specified in this list
  • methods – Include these methods nodes (without any arguments) as nodes in the output
  • include – Add child nodes (potentially nested) based on associations within the object

Alright, to recap: There is a model which contains a specific database schema and some methods. There are limited options to transforming that into JSON as described above by passing a hash of options. These options can be passed in as defaults with the ‘as_json’ method in the model itself or through the controller action.

The unravelling of ‘to_json’ begins

Well wait just a minute, what if in different API responses, you want to include different api output options? What if you want to override the ‘as_json’ defaults? Not too bad you can just do:

# app/controllers/posts_controller.rb
#...
  respond_to :json, :xml
  def index
    @posts = Post.all
    respond_with(@posts) do |format|
      format.json { render :json => @posts.to_json(:include => [...], :methods => [...]) }
    end
  end
#...

Using those settings you can change the JSON output on a per action basis. This system as described is a good high level overview of the Rails JSON generation approach. This can work in very simple and naive applications, but it should not be hard to imagine how this approach can be restrictive as well as verbose. Suppose I want to render a JSON output with only a few columns, also adding several methods and including nested options for multiple associations? That might look something like this:

# app/controllers/posts_controller.rb
#...
  respond_to :json, :xml
  def index
    @posts = Post.all
    respond_with(@posts) do |format|
      format.json { render :json => @posts.to_json(
         :only => [:title, :body, :created_at, :tags, :category],
         :include => [
            :likes => { :only => [:created_at], :include => [:author] },
            :comments => { only => [:created_at, :body], :include => [:author]  },
            :user => { :only => [:first_name, :last_name}, :methods => [:full_name] },
         :methods => [:likes_count, :comments_count])
      }
    end
  end
#...

This action code is already starting to smell a bit funny even here. This is a lot of bulk in the controller and quite redundant. This doesn’t even really seem like this belongs in the controller. This is more of a view or template concern discussing the details of a particular JSON representation. You could move that all to the model inside a method, but that actually makes things harder to follow. Already this method of generating JSON doesn’t feel quite right and begins to break down.

You may think this example is a contrived case or poor API design but consider that there’s actually not that much going on here. This type of response is commonplace in almost any public API you will see on the web. In fact, it is actually much simpler then many in the wild. Compare the above to the Instagram API.

More Frustrations with API Generation

The issues above were just the beginning of the issues we ran up against using the ‘to_json’ method because that approach is interested in ‘serializing’ a database object while we are interested in creating a relevant representation for our public API platform. The ‘serialization’ of the object so directly just didn’t quite fit what we were trying to do.

The easiest way to demonstrate the limitations that frustrated us is to show relevant examples. Let’s start with a simple idea. In our system we have ‘posts’ and we have the idea of ‘liking a post’. In our API we want to return if the authenticated user ‘liked’ a particular post in the feed. Something like:

[ { post : { title : "...", liked_by_user : true }, ...]

Notice the node ‘liked_by_user’ which contains whether or not a user has liked the given post. Assuming we have this method in the model:

class Post
  # Returns true if given user has liked the post.
  # @user.liked_by_user?(@user) => true
  def liked_by_user?(user)
    self.likes.exists?(:user_id => user.id)
  end
end

We simply want to get this boolean value into the API response with the node name ‘liked_by_user’. How would we do this in ‘to_json’? How do we pass an argument to a method? After doing some research, it was apparent that this was not particularly easy or intuitive. It would be nice to have a simple way to pass multiple arguments to a method in the model without jumping through hoops.

Let’s move onto another example. Suppose we want to change the ‘user’ association to be aliased as an ‘author’ node in the output. Let’s say we have:

class Post
  belongs_to :user
end

and we want to have the output be:

[ { post : { title : "...", author : { first_name : "...", last_name : "..." }  }, ...]

What if I just need a minor change to the value of an attribute before inserting it into the JSON? What if I need a custom node in the JSON that is not needed in the model directly? What if I want to include the value of a method only if a condition on the record is met? What if I want to reduce duplication and render a JSON hash as a child of the parent response? What if I want to glue a couple of attributes from the user to the post? Change the model and fill it with this display logic every time? Fill our controllers with complicated JSON display options? Workaround the problems by fighting with ‘to_json’ and/or monkeypatching it?

Perhaps a better approach

As we came against these issues and many more while we designed and implemented our public APIs, we butted our heads against ‘to_json’ again and again. Often we wanted the attributes defined in the schema to be renamed or modified for the representation, or we wanted to omit attributes, or we wanted to include attributes if a condition was met, we wanted to handle polymorphic associations in a clean and easy way, we wanted to keep a flat hierarchy by ‘gluing’ attributes from the child to the parent.

Furthermore, the model and/or controller was getting filled up with tons of json specific details that had nothing to do with model or business logic. In fact, these JSON responses and verbose declarations didn’t seem to belong in the model or the controller at all and were cluttering up our code. In fact, true to MVC these details of the response seemed much more appropriate in a view of some kind. This idea of storing the JSON in a view sparked an experiment. Why not just generate the JSON in a template and move all of the display details out of the model and the controller. What if the API could be crafted easily in the view where a JSON representation belongs?

We agreed that implementing APIs in a view made the most sense both conceptually and practically. The next question becomes what templating language to use to generate these APIs? Forming the XML or JSON manually in ‘erb’ seemed verbose and error-prone. Using builder seemed silly since we wanted to build APIs that work primarily in JSON. Indeed, none of the default templating languages we grab for seemed to fit. We didn’t want to painstakingly handcraft nodes manually, we just wanted a simple way to declare how our APIs should look that afforded us the flexibility we needed.

We investigated a wealth of different libraries that seemed to fit the bill from tequila, to json_builder, to argonaut and many more attempts to solve this problem. Clearly we weren’t the only ones that had experienced the pain of ‘to_json’. Perusing the READMEs of any of these libraries quickly revealed people fed up with the limitations same as we had become. Problem was every option we could find didn’t work for one reason or another. Either the syntax was awkward, the libraries weren’t maintained, there were too many bugs, or the templates became verbose and difficult to manage. After reviewing the available options, we decided to try and design our own library for creating APIs. One that would solve all the problems we had encountered thus far.

The Ruby API Builder Language

We embarked on a thought experiment before building the library. What were our frustrations with existing libraries and tools? Where do we want the JSON options to live? How did we want to specify them? What options did we want to have? What language or syntax should we use to define the output? How do we keep the options DRY and intuitive?

Early on we decided we wanted the JSON output to be defined in the views. Logic that belonged in the models would stay there where it belonged, but this was rarely the case. Most of the options were simply crafting the JSON response and clearly belongs in a template. So that meant a file living in the views folder within Rails. We also decided we didn’t want to learn a new language and that Ruby was as good an API builder as any. Why not just leverage a simple Ruby DSL to build our APIs? Why not support inheritance and partials for our APIs? Why not allow the same template to describe both the JSON and XML responses for our API?

From these design questions and several days of work, the RABL gem was born. We started using this approach and fell in love with it immediately. All of a sudden, generating APIs was easy and intuitive. Even the most complex or custom API output was very simple and maintainable through the use of inheritance, partials and custom nodes. All of this was kept neatly tucked away in a view template where it belonged without requiring any extra code in the models or worse the controller actions.

Stay Tuned

Since we built RABL, we have gotten excellent feedback from the community. We have deployed in production all of our Public APIs using RABL and we couldn’t be happier. Please checkout the README and let us know what you think! We would love to hear your experiences with building APIs on Rails or Sinatra. This post is a setup for a thorough step-by-step tutorial we plan to publish soon on generating clean JSON and XML APIs in Rails 3 using RABL.

Adventures in Scaling, Part 1: Using REE

The engineering team at Miso has been quite busy the last few weeks hacking on new features for the site, as well as on experimental ideas regarding the ‘future of social TV’. We are also busy fleshing out the Developer Platform that allows others to build applications using our API and to embed widgets displaying a user’s latest watching activity. [Post on building a OAuth REST API is forthcoming...] As always, we are very focused on infrastructure and stability as our user base grows.

There are a lot of important lessons we have been learning as we continue to scale our application. An ongoing theme of this blog will be to detail a variety of topics relating to our adventures in scaling our various web services as our traffic grows. We decided to start simple and explain in this first post how developers hosting a rails application on 1.8.7 should be using Ruby Enterprise Edition to take advantage of the performance tuning capabilities.

Ruby Enterprise Edition is very simply an improved version of the 1.8.7 MRI Ruby Runtime. The improvements include a copy-on-write friendly garbage collector, an improved memory allocator, ability to debug and tune garbage collection, and various thread bug fixes and performance improvements. In short, if you are using 1.8.7 in your Rails or Rack application anyways, there is really no reason not to switch to REE. We have been using it in production for months and the benefits are worth the switch.

First, let’s get REE installed on your servers. For this tutorial, we will assume you are running Ubuntu 32-bit in production (if not then certain details might be slightly different):

wget http://rubyenterpriseedition.googlecode.com/files/ruby-enterprise_1.8.7-2011.01_i386_ubuntu10.04.deb
sudo dpkg -i ruby-enterprise_1.8.7-2011.01_i386_ubuntu10.04.deb

This will install the REE Package onto your Ubuntu system at /usr/local/ directory by default. Don’t worry this can happily co-exist with your existing Ruby installation. Next, you need to reconfigure your web server to use this version of ruby. If you are using Passenger for instance, run the passenger command to install the nginx or apache module and then change the configuration to point to Passenger.

REE comes with the ability to tune performance by tweaking the garbage collector. This can have a significant impact on your application and is worth the extra effort. We have seen up to 20-30% increase in ruby performance simply by fine tuning these parameters. To tune ruby, we need to create a wrapper script to set the appropriate variables and then launch ruby.

There are many different recommended settings for the variables and these do depend on your application. Let’s take a look at the various adjustable options for the garbage collector. Each has a different effect on the performance of your server:

RUBY_HEAP_MIN_SLOTS

The first option has to do with the initial number of heap slots ruby will allocate upon startup. This will affect memory usage because the larger the heap size, the more initial memory required. However, most ruby applications will need much more memory then the default allocation provides. By increasing this value, the startup time of your application will be decreased and increase throughput. The default is 10000 slots but the recommended range is between 500000-1250000. For most applications I have tested, the sweet spot is roughly 800000.

RUBY_HEAP_SLOTS_INCREMENT

This option is the number of additional heap slots that will be allocated whenever Ruby is forced to allocate new heap slots for the first time. The default value is 10000 but this lower than most Rails applications could benefit from. The recommended range is between 100000-300000 because this means that ruby will grow in heap size much faster which allows for better throughput and faster response times depending on your application. Our recommended setting is 250000.

RUBY_HEAP_SLOTS_GROWTH_FACTOR

This option is the multiplier that ruby uses to calculate the new heaps to allocate next time Ruby needs new heap slots. The default is 1.8 but if you adjust the slots increment to a much higher value as recommended above, this should be changed to a value of 1 because heap allocation is already sized correctly which means no need for incremental growth of the slots.

RUBY_GC_MALLOC_LIMIT

This option is the amount of data structures that can be allocated before a garbage sweep occurs. This value is really important because the garbage collector in Ruby can be very slow and as such minimizing the frequency that it executes can significantly increase performance. The default is 8000000 but recommended values range from 30000000-80000000 which allows many more structures to be created before a collection is triggered. This means more memory consumption in exchange for less frequent sweeping which can translate to significant performance gains.

RUBY_HEAP_FREE_MIN

This options is the number of heap slots that should be free after a garbage collection has executed. This number determines when ruby must allocate more heap space if the amount free is too low after the garbage sweep. The default value is 4096 but a value much higher can be used in conjunction with the other settings above. A value more to the tune of 100000 is more suitable which means less frequent allocation but each allocation will be a much larger number of slots as defined above. This translates to higher performance in most cases.

Bringing it all together

These changes significantly impact the way Ruby manages memory and performs garbage collection. Now, Ruby will start with enough memory to hold the application in memory from the initial launch. Normally, Ruby starts with far too little memory for a production web application. The memory is increased linearly as more is required rather than the default exponential growth. Garbage collection also happens far less frequently during the execution of your application. The downside is higher peak memory usage but the upside is significant performance gains.

In our case, we ended up using settings very similar to Github and Twitter. We are going to show those settings below, but feel free to research and tweak based on your own analysis of an individual application’s needs.

Let’s create a wrapper for the tweaked ruby settings and save it to /usr/local/bin/tuned_ruby:

# !/bin/bash
export RUBY_HEAP_MIN_SLOTS=800000
export RUBY_HEAP_FREE_MIN=100000
export RUBY_HEAP_SLOTS_INCREMENT=300000
export RUBY_HEAP_SLOTS_GROWTH_FACTOR=1
export RUBY_GC_MALLOC_LIMIT=79000000
exec "/usr/local/bin/ruby" "$@"

Then let’s set the appropriate permissions:

sudo chmod a+x /usr/local/bin/tuned_ruby

Once that tweaked ruby is executable, change your configuration so that the web server uses this new version of ruby. For instance in Passenger 3 for Nginx (our deployment tool of choice), the change would look like:

# /etc/nginx/conf/includes/passenger.conf
# ...
passenger_ruby /usr/local/bin/tuned_ruby;
# ...

Then you need to restart your web server for the changes to take effect in our case:

sudo god restart nginx

Now your application will be using the new tuned REE ruby runtime and will likely be much more memory efficient and have decreased response times for users. Miso uses a variety of profiling tools to gauge the performance of our application (post forthcoming) but I feel it is important to at least mention that after changing these parameters and using REE, profiling to measure the actual impact is essential.

This is only the first blog post of this series. We will be releasing another one soon about database optimization (for Postgres or MySQL) and how to tune your database for better performance as you scale.

Resources

For additional resources about this topic be sure to check out:

Easy Monitoring of Varnish with Munin

If you’re looking for a reverse proxy server, Varnish is an excellent choice. It’s fast, and it’s used by Facebook and Twitter, as well as plenty of others. For most sites, it can be used effectively pretty much out of the box with minimal tuning.

Like many decently-sized Rails apps, we leverage a lot of open source code. Dozens of gems and plugins, a variety of cloud services, Varnish and Nginx for caching and load balancing, and various persistence solutions. The point is, as our app usage has grown over the last year, we’ve had our share of stressful, on-the-fly debugging while our app was down. That’s not the best time to learn about all the fun nuances and interactions of your technology stack.

It’s a good idea to know what your services are doing and the key metrics to watch, so you’re better prepared when you hit those inevitable scaling pain points. New Relic has been tremendously useful for monitoring and debugging our database and Rails app. The rest of this post goes over some key metrics for Varnish and setting up Munin to monitor them.

Optimizing and Inspecting Varnish

Unless your application has an extremely high volume of traffic, you likely won’t have to optimize Varnish itself (e.g., cache sizes, thread pool settings, etc). Most of the work will be in verifying that your resources have appropriate HTTP caching parameters (Expires/max-age and ETag/Last-Modified). You’re most of the way there if you do the following:

  • Run Varnish on a 64-bit machine. It’ll run on a 32-bit machine, but it likes the virtual address space of a 64-bit machine. Also, Varnish’s test suites are only run on 64-bit distributions.
  • Normalize the hostname. e.g., www.website.com => website.com, to avoid caching the same resource multiple times. Details here.
  • Unset cookies for any resource that should be cacheable. Details here.

Varnish includes a variety of command line tools to inspect what Varnish is doing. SSH into the server running Varnish, and let’s take a look.

Inspecting an individual resource

First, let’s look at how Varnish handles an individual resource. On a client machine, point a web browser to an resource cached by Varnish. On the server, type:

$ varnishlog -c -o ReqStart <IP address of client machine>

The output of this command will be communication between the client machine and Varnish. In another SSH terminal, type:

$ varnishlog -b -o TxHeader <IP address of client machine>

The output of this command will be communication between Varnish and a backend server (i.e., an origin server, the actual application). Try reloading the resource in the browser. If it is cached correctly, you shouldn’t see any communication between Varnish and any backend servers. If you do see something printed there, inspect the HTTP caching headers and verify they are correct.

Varnish statistics

Now that we’ve seen that Varnish is working for an individual resource, let’s see how it’s doing overall. In your SSH session, type:

$ varnishstat

The most important metrics to note here are the hitrate and the uptime. Varnish has a parent process whose only function is to monitor and restart a child process. If Varnish is restarting itself frequently, that’s something to be investigated by looking at its output in /var/log/syslog.

Other than that, check out Varnishstat For Dummies for a good overview.

It’s great that we can check on Varnish fairly easily, but the key is to automate this process; otherwise, it can be very difficult to detect warning patterns early. Also, it’s not realistic to have a huge, manual, pre-flight checklist to check on the health of all your services. Enter Munin…

Get Started with Munin in 15 minutes

Munin is a monitoring tool with a plug-in framework. Munin nodes periodically report back to a Munin server. The Munin server collects the data and generates an HTML page with graphs. The default install of Munin contains a plug-in for reporting Varnish statistics. The Varnish plug-in includes a variety of graphs, including the one below.

Installing Munin

If you’re installing Munin on an Ubuntu machine (or any distribution that uses apt), use the commands below. For other platforms, see the installation instructions here.

For every server you want to monitor, type:

$ sudo apt-get install munin-node

Designate a server to collect the data. The server can also be a Munin node. On the server, type:

$ sudo apt-get install munin

Configuring Munin

For each node, open the configuration file at /etc/munin/munin-node.conf. Add the IP address of the Munin server.

allow ^xxx.xxx.xxx.xxx$

After you modify the configuration file, restart the Munin node by typing:

$ sudo service munin-node restart

For the server, open the configuration file at /etc/munin/munin.conf. Add each node that you want to monitor.

[Domain;serverA]
  address xxx.xxx.xxx.xxx
  use_node_name yes

Choose any value you like for Domain and serverA above; the names are purely for organization. When the Munin server was installed, it also installed a cron job that runs every 5 minutes and collects data from each node. After editing the configuration file, wait 5 minutes for the charts to be generated. If you’re impatient, type:

$ sudo -u munin /usr/bin/munin-cron

View Munin Graphs

If you have lighttpd or Apache, point it at /var/cache/munin/www. If the charts have been generated properly, there should be an index.html file in that directory.

Troubleshooting Munin

If the Munin charts aren’t being generated, make sure that the directories listed in /etc/munin/munin.conf exist and have appropriate permissions for the user, munin.

Try manually executing munin-cron and see if there is any error output.

Look at /var/log/syslog for any Munin-related errors.

Conclusion

That’s it! Varnish is optimized and working correctly, and Munin is reporting the important stats so you can sleep easy at night. Enjoy!

Additional Resources

Web caching references

Caching Tutorial – Excellent overview of web caching by Mark Nottingham.
Things Caches Do – Overview of reverse proxy caches like Varnish and Rack-Cache.
HTTP 1.1 Caching Specification – Official HTTP 1.1 Caching Specification.

Varnish references

A Varnish Crash Course For Aspiring Sysadmins
Varnishstat for Dummies
Varnish Best Practices
Achieving a High Hitrate

Munin references

Munin Tutorial