Monthly Archives: July 2011

Adventures in Scaling, Part 2: PostgreSQL

Several months ago, I wrote a post about REE Garbage Collection Tuning with the intent of kicking off a series dedicated to different approaches and methods applied at Miso in order to scale our service. This time around I wanted to focus on how to setup PostgreSQL on a dedicated server instance. In addition, I will cover how to tweak the configuration settings as a first-pass towards optimizing database performance and explain which parameters are the most important.

Why PostgreSQL?

Before I begin covering these topics, I want to briefly touch on why our application (and this tutorial) is centered on PostgreSQL and not one of the many other RDBMS or NoSQL alternatives available for persistence in current web development. From the multitude of “general purpose” database persistence options available, the most common choices for a startup in my experience tend to be MySQL, Drizzle, PostgreSQL and MongoDB.

Each of these options above have pros and cons and there is no one size fits all solution as is usually the case in technology. In the past, I have traditionally used MySQL to do the bulk of database persistence for Rails apps. This choice was largely because of familiarity as well as the clear MySQL favoritism in the early years of the Rails community. Though the full explanation of why is outside the scope of this post, suffice to say I won’t be choosing to use MySQL again in the future when starting a new project. My claim, unsubstantiated in this post, is that there is nothing significant MySQL provides over PostgreSQL and yet there are many pitfalls and downsides.

If you are interested in the subject, I recommend you read a few posts and draw your own conclusions. To be fair, Drizzle looks like an interesting alternative to MySQL and/or Postgres. Having never used that database, I would be curious to hear how it compares to PostgreSQL. We are big fans of MongoDB at Miso and we store several types of data for our services within collections. However, for historical and practical reasons, we did not want to dedicate the time to convert our primary dataset as the benefit at our current level was not significant enough to warrant the time involved. In a future post, I would love to delve deeper into our Polyglot Persistence strategy and why we opted to use particular technologies over alternatives.

Setting up PostgreSQL

With that explanation out of the way, let’s turn our attention to setting up PostgreSQL on a dedicated database server. In this tutorial, we will be installing Postgres 9.X on an Ubuntu machine. You may need to tweak this for your specific needs and platform depending on your specific setup.

One of the first questions you run across when setting up a dedicated PostgreSQL server is “How much RAM should the instance have?”. I would take a look at the total size of your database (or the expected size of your database in the near future). If at all possible, the ideal RAM for your instance would allow for the entire database to be placed in memory. At small scales, this should be possible and will mean major performance increases for obvious reasons. The size I recommend for a starting server is typically between 2-8GB of RAM. Furthermore, I would recommend against using a dedicated database with less than 1GB of RAM if you can help it. If you have a large dataset and need replication or sharding from the start, then I would recommend putting down this guide and buying the PostgreSQL High Performance book right now. For the purposes of the rest of this guide, I am going to assume an 8GB instance was selected.

Now that we picked the size of our instance, let’s actually start the install. Unfortunately for us, Ubuntu 10.04 apt repositories only have 8.4. The postgresql-9.0 package won’t be added until Ubuntu Natty. For this reason, unless you are using that version, you must use alternate repositories for this installation. There is an excellent utility called “python-software-properties” to make installing repositories easier. If you haven’t yet, install that first:

sudo apt-get install python-software-properties

Next, let’s add the repository containing PostgreSQL 9.0:

sudo add-apt-repository ppa:pitti/postgresql
sudo apt-get update

Now, we need to install the database, the contrib tools, and several supporting libraries:

sudo apt-get install postgresql-9.0 postgresql-contrib-9.0
sudo apt-get install postgresql-server-dev-9.0 libpq-dev libpq5

Installing all of these packages now will save you the pain later of trying to track down why certain things won’t work. Another good step is to symlink the archival cleanup tool to /usr/bin for use when you need to enable replication with WAL:

sudo ln -s /usr/lib/postgresql/9.0/bin/pg_archivecleanup /usr/bin/

At this time, you should also recreate the cluster to ensure proper UTF-8 encoding for all databases:

su postgres
pg_dropcluster --stop 9.0 main
pg_createcluster --start -e UTF-8 9.0 main
sudo /etc/init.d/postgresql restart

You can now check the status of the database using:

sudo /etc/init.d/postgresql status

Once this has all been done, you can find the configuration directory in /etc/postgresql/9.0/main and the data directory in /var/lib/postgresql/9.0/main.

Another useful tool is pg_top, which allows you to view the status of your PostgreSQL processes along with the currently executing queries:

sudo wget http://pgfoundry.org/frs/download.php/1781/pg_top-3.6.2.tar.gz
tar -zxpvf pg_top-3.6.2.tar.gz
cd pg_top-3.6.2
./configure
make

PostgreSQL and all related utilities should now be properly installed. From here, you can begin creating databases and users with the psql command:

psql -d postgres
> CREATE USER deploy WITH PASSWORD 'jw8s0F4';
> CREATE ROLE admin SUPERUSER
> CREATE DATABASE name;

Of course, there are a lot of commands you can run with this console. You are encouraged to read other sources for more information.

PostgreSQL Tuning

Now that we have successfully installed PostgreSQL, the next thing to do is to tune the configuration parameters for solid performance. Of course, the best way to do this is to measure and profile your application and set these based on your own needs. All these settings will ultimately vary based on the needs of your particular application. Nonetheless, here is a guide intended to get you started with settings that work “well enough”. From here, these parameters can be tweaked to your hearts content to find the sweet spot for your individual use case.

Fortunately for us, PostgreSQL guru Gregory Smith has created a tool to make our lives a great deal simpler. This tool is called pgtune and I encourage you to run this on your server as quickly as possible after setup. The settings this recommends should be your baseline configuration values unless you know otherwise. First, let’s download the pg_tune tool:

cd ~
wget http://pgfoundry.org/frs/download.php/2449/pgtune-0.9.3.tar.gz
tar -zxvf pgtune-0.9.3.tar.gz

Once the utility has been extracted, simply execute the binary with the proper options and recommended configuration values will be output:

cd pgtune-0.9.3
./pgtune -i /etc/postgresql/9.0/main/postgresql.conf -o ~/postgresql.conf.pgtune --type Web

This will generate all the recommended values tailored custom to your server in a file ~/postgresql.conf.pgtune. Simply view this file and note the settings at the bottom:

cat ~/postgresql.conf.pgtune
# Look at the bottom for the relevant parameter values

Take the settings and append them to your actual configuration file located at /etc/postgresql/9.0/main/postgresql.conf. To be on the safe side, you should also update the kernel shmmax property which is the maximum size of shared memory segment in bytes. This is particularly necessary for large values of effective_cache_size and other parameters:

  sysctl -w kernel.shmmax=26843545600
  sudo nano /etc/sysctl.conf
    # Append the following line:
    kernel.shmmax=26843545600
  sudo sysctl -p /etc/sysctl.conf

Once this has been updated and you have saved the modified settings, restart your cluster for the settings to take affect:

pg_ctlcluster 9.0 main reload

Now that we have setup these tuned parameters, let’s take a look deeper and delve into the most important parameters you can tweak to improve your database performance. The list below is not a comprehensive guide, but in most cases these parameters will serve as the first values to experiment with after using pg_tune.

  • work_mem – Specifies the amount of memory to be used by internal sort operations and hash tables before switching to temporary disk files. The total memory used could be many times the value of work_mem; it is necessary to keep this fact in mind when choosing the value. Sort operations are used for ORDER BY, DISTINCT, and merge joins. The recommended range for this value is the total available memory / (2 * max_connections). On an 8GB system, this could be set to 40MB.
  • effective_cache_size – Sets the planner’s assumption about the effective size of the disk cache that is available to a single query. This is factored into estimates of the cost of using an index; a higher value makes it more likely index scans will be used, a lower value makes it more likely sequential scans will be used. The recommended range for this value is between 60%-80% of your total available RAM. On an 8GB system, this could be set to 5120MB.
  • shared_buffers – Sets the amount of memory the database server uses for shared memory buffers. The default is typically 32 megabytes and must be at least 128 kilobytes. The recommended range for this value is between 20%-30% of the total available RAM. On an 8GB system, this could be set to 2048MB.
  • wal_buffers – The amount of memory used in shared memory for WAL data. The default is 64 kilobytes (64kB). The setting need only be large enough to hold the amount of WAL data generated by one typical transaction. The recommended range for this value is between 2-16MB. On an 8GB system, this could be set to 12MB.
  • maintenance_work_mem – Specifies the maximum amount of memory to be used in maintenance operations, such as VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY. Larger settings may improve performance for vacuuming and for restoring database dumps. The recommended range for this value is 50MB per GB of the total available RAM. On an 8GB system, this could be set to 400MB.
  • max_connections – Determines the maximum number of concurrent connections to the database server. The default is typically 100 connections and this parameter can only be set at server start. The recommended range for this value is between 100-200.

Beyond this for further reading, I would recommend checking out other articles, and of course Gregory Smith’s book PostgreSQL High Performance.

Wrapping Up

Hopefully you have found the guide above helpful. The intent was to provide an easy way for people to get started using PostgreSQL in their Rails applications today. If you are starting a new Rails application, I strongly encourage you to make your own informed decision on which persistence engine to use. Research the issue for yourself but I would urge you to at least consider giving PostgreSQL an honest try. Coming from using MySQL prior, I can assure you we did not miss a single thing about MySQL when we migrated.

This post is the first of many that will deal with PostgreSQL and provide details on how to get this excellent RDBMS to work for your application. In a future post, I would like to cover how to setup “hot standby” replication with archival logs (WAL). That guide will take you step by step through creating a master database and several read-only slaves as well as then utilizing those slaves from within your Rails application. We also ported from MySQL to PostgreSQL, so another future post will detail how to best migrate your database as well as pitfalls to avoid.

Persist your data in YAML files instead of SQL database. Wait what?

Why YAML Record?

At Miso, we occasionally ran into situations where we wanted to persist simple data to a text (or yaml) file. Examples include landing page email forms, contact forms, feedback forms, about us “team” pages, etc. In these situations, we wanted a way to persist data to YAML and easily view the results in a text file but also manage the data through forms and controllers as any other data. To achieve this we have created a YAML-backed persistence engine called YAML Record that allows access to YAML based data using a familiar ActiveRecord API.

If you have a small amount of data to persist which changes infrequently but that you do need to update from time to time, you should investigate using YAML Record. In other cases where you have frequently updated information, several query requirements or a medium/large set of data, a SQL or NoSQL solution would probably be a better fit. Using YAML for persistence doesn’t work in every case so use this approach wisely.

YamlRecord is a standalone, simple and lightweight way to map your data from a yaml file, and can easily be integrated along with ActiveRecord and we try to provide the same APIs. On the other hand if you’re using DataMapper, I would recommend checking out dm-yaml-adapter which allows for similar functionality.

How do we use this at Miso?

We have several use cases for YamlRecord. The first one is to persist our team members profiles on our about page. Using YAML Record, we can easily add, update or remove any of this information with a RESTful architecture and standard controllers. We even implemented full DragonFly support with YAML Record so we can link up to each team member with a picture. I’ll probably cover this functionality on my next blog post.

In another case, we use YAML Record for our FAQ. From an relational data perspective, storing this questions requires only a single table without any relationships to other tables or any indexes. For cases like this, a SQL solution and the overhead of creating a migration and a table doesn’t always make sense. We definitely could use a SQL table, but having one with 10 rows and only 3 columns seems almost silly.

------------------------------------
|  id  | question | answer         |
------------------------------------
|  1   | foo?     | blablabla      |
------------------------------------
|  2   | bar?     | blablaaaaa     |
------------------------------------
|  3   | L33t?    | Y3s            |
------------------------------------
|
...
|
------------------------------------
|  10  | n00b?    | n0!            |
------------------------------------

I would rather have a YAML file which looks like the example below. The main upside of using a YAML file is Rails is going to load entire file in memory once, that way your application doesn’t need to go back and forth to read your file each time.

---
- :id: 08cc5f757b26903b9e7b6fcc3a3fbe
  :question: foo?
  :answer: blablabla
- :id: 4f2e73f896d0443c2b66f57cbdb2bb
  :question: bar?
  :answer: blablaaaaa
- :id: 5197b037524a1629d27c69f7aa9891
  :question: L33t?
  :answer: Y3s
- :id: f77c91577af4bf9b1e994e1b5b9ecd
  :question: n00b?
  :answer: n0!

Using the same APIs as ActiveRecord

Here’s a quick sneak peek on what you can achieve with YamlRecord. Given a class HotGirl:

class HotGirl < YamlRecord::Base
  # Declare your properties
  properties :name, :sex_appeal_rate, :age

  # Declare source file path (config/hot_girls.yml)
  source Rails.root.join("config/hot_girls")
end

Here’s some of actions you can do:

# Create your instance as on active record
@hg = HotGirl.new(:name => "Jennifer Anniston", :sex_appeal_rate => 75, :age => 42) # @hg
@hg.save # => true

# Alternatively
@hg2 = HotGirl.create(:name => "Scarlett Johansson", :sex_appeal_rate => 92, :age => 26)

# You can get your items easily
HotGirl.all # => [@megan_fox, @jessica_alba ...]

# Update the record attributes
@megan_fox.update_attributes(:sex_appeal_rate => 100) # => true

# Destroy it
@miss_usa_2011.destroy => true

We tried to provide all essential APIs ActiveRecord provides and more are coming. If you want to know about the available APIs, you can find it on the README or Rdoc

Feedback?

We would love to hear your feedback on this concept to persist data to a YAML file in certain simple cases and about how we can improve this library in the future. The source code is available on Github and we welcome all patches and pull requests. Feel free to share your thoughts about it.

Conclusion

We really believe this concept of persisting data in YAML file makes a lot of sense for us in particular scenarios. Building this was actually a challenge for me as it was my first gem. I want to thank Nathan for his support building the gem and I really learned a lot in the process. After all, I guess we did it for fun and as an experiment to see how this would work.

We also found these 2 gems in same vein as YAML Record, which are worthwhile to check out:

Hybrid (Native + Web) Mobile App Development • Part 2: Maintaining EJS templates, and Bridging Interactions

Welcome to part 2 of this multi-part series of blog posts where we venture into the world of hybrid frameworks. This is where we get into the juicy stuff, if you are new to this, I suggest reading part 1 to understand the motivation behind this approach.

Big Picture Stuff

In a nutshell, what we’re trying to do here is mash a JSON response from a RESTful API call with a ejs (similar to rail’s erb) template to form the html to be rendered by a UIWebview. Simple, until you start asking questions such as how to an element to transition to a different state? How do you AJAX style interactions? How is template/asset syncing managed? These are all critical questions to ask of a hybrid framework. So before we dive into it, let’s make a short list of what a hybrid framework should be capable of before diving deeper into each feature separately.

A hybrid framework should:

* Maintain template syncing from a third party resource

* Handle bridging interactions between the web and native views

* Manage a RESTful API SDK

Maintain Template Syncing

At Miso, our mobile web templates are served off an Amazon S3 server. The basic strategy we employed is to download the web templates and cache it on disk, and when you open up the app in the future it would update it’s local templates with the latest and greatest by poking S3.

To accomplish this we used ASIHTTPRequest, a popular web request wrapper, to maintain web requests we send to the S3 server. This means asynchronous requests, cache responses, and a ASINetworkQueue that lets you handle multiple web requests.

There were a few nuggets we threw in worth mentioning here that made this strategy more efficient and scalable. We maintain a mainfest file on the server that has the list of assets the mobile device needs to download. The file would look something like this:

homepage.ejs?12398789
javascript/application.js?9437392
images/logo.png?1283740

We keep a versioning number by appending ?<some numbers> after each file. This is generated automatically every time we make updates to the web templates. Having this file allows us to scale well when we add more files, because all we have to download is the manifest file on startup.We then leverage ASIHTTPRequest’s cachedResponse flag and iterate through each file and check with s3 to ensure that this file hasn’t changed since we last fetched it.

To give a more concrete view of how exactly we maintain template syncing from a web request stand point, here’s a code snippet of an implementation of the delegate method where a ASIHTTPRequest completes showing the logic explained earlier:

- (void)requestFinished:(ASIHTTPRequest *)request {
    NSString *responseString = [[[NSString alloc] initWithData:[request responseData] encoding:NSUTF8StringEncoding] autorelease];

    // If the returned file is the manifest file, iterate through it and send web requests for each asset
    if ([[[request url] description] hasSuffix:@"manifest.mf"]) {
        // Separate assets and shove them into an array
        NSArray *assets = [responseString componentsSeparatedByCharactersInSet:[NSCharacterSet newlineCharacterSet]];

        for (NSString *asset in assets) {
            // Get rid of trailing white spaces in assets list
            if ([[asset stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceCharacterSet]] isEqualToString:@""])
                continue;

            // Add each asset in ASINetworkQueue
            [self addUrl:asset queue:_queue];
        }
    } else {
        // If it's asset file, templates on all views need to reload unless the response is from cache
        _needsReload = _needsReload || ![request didUseCachedResponse];
    }

    NSString *file = [[request url] relativePath];
    NSString *fqFilePath = [_localBaseUrl stringByAppendingPathComponent:file];

    // If the file doesn't exist on disk, or the response is not from cache, then it's a updated file and we should save it
    if (![request didUseCachedResponse] || ![[NSFileManager defaultManager] fileExistsAtPath:fqFilePath]) {
        [self saveLayoutFile:[[request url] relativePath] data:[request responseData]];
        _needsReload = YES;
    }
}

I want to mention one last thing before I wrap up this section, and that is how we handle updating templates across different version of our app. Let’s consider the situation where you intend to release a new version of your app complete with the latest web templates. You realize that there are backwards compatibility issues of the new templates with an older version of the app. If you were to update the current web templates it’d break everyone on an older version of the app. Ooops! How we work around that is to maintain separate buckets of web templates for different versions of our app. It’s not always necessary, but something good to keep in mind.

Bridging Interactions between Web and Native views

Let’s elaborate on what this means via examples:

* User clicks on a link in a webview, and it pushes another UIViewController onto the navigation stack

* User posts a comment via a native controller, and we want that to fire off an AJAX request and render on to the web view

How Miso achieves this is by defining certain protocols to conform to when generating links within our EJS templates. That way, when a load request is captured by the UIWebView we can then route these requests to different parts of the app.

- (BOOL)webView:(UIWebView *)webView shouldStartLoadWithRequest:(NSURLRequest *)request
 navigationType:(UIWebViewNavigationType)navigationType {
    NSString *url = [[request URL] description];

    // Check for protocol type and determine routing
    if (navigationType == UIWebViewNavigationTypeOther) {
        return YES;
    } else if ([url hasPrefix:@"miso://ajax"]) {
        [_ajaxController fireAjaxRequest:url];
        return NO;
    } else {
        // RoutesController delegates native app actions given a miso://<controller>/<action>?<params>
        [[RoutesController instance] processRoute:url viewController:_vc webview:self];
        return NO;
    }
}

The code snippet above is a stripped down version of what we use currently, but the basic concept is the same. We delegate routing by implementing the shouldStartLoadWithRequest delegate method for UIWebViews. So in the case where we capture a miso:// protocol we pass it along to other classes to process.

In the next part, we will start venturing further down the rabbit hole and look into the concepts behind RoutesController and AjaxController. Stay tuned!

Miso Hackathon I

Miso (and I) had our first Hackathon last week, and both of us survived! Our goal: conceive of an idea and build it within 3 days. We built Listify – an iPhone app that let’s you create, browse, and upvote lists of TV show and movies. Also with accompanying web landing pages.

A Little Bit of Context

At Miso, we’ve been working on making fun things for people watching TV, starting with check-ins, but experimenting with synchronized trivia, exclusive content, live Q & A. However, over the last three months, we’ve been focusing on ideas that connect friends around what they watch.

Why a Hackathon?

Entrepreneurs seem to be drinking heavily from the “Lean Startup” and “customer development” kool-aid.  One basic tenet is that product ideas/assumptions can be proven in days, not weeks or months.  Common tactics include applying MacGyver-like strategies to challenge product assumptions.  For example, create a splash page (this was our first attempt – http://playredzone.herokuapp.com) for a product before writing any code.

Generally speaking, it can be challenging to apply their methodologies consistently, but Hackathons certainly seem to intrinsically embody the spirit of Lean Startup.  Also, as most engineers will understand, creating something new from end to end is a very rewarding experience.

On to the Main Event

As we were deciding what to build for the Hackathon, we set the following constraints:

  • Should be consistent with the “connect friends around what they watch” theme.
  • Should be built on the Miso API (i.e., not as a feature in our main product).
  • We have to want to use it.
  • Must be finished in 3 days.

After some happy hour brainstorming, we came up with the basic motivation and mechanics behind Listify.

The need: People need help discovering what movies and TV shows to watch.

The underlying psychologies:

  • People want to express themselves. Lists are a structured and easy way to do that.
  • People take pride in their sense of taste. The band they “discovered”, the new (or old) indie film, are all worthy Facebook posts.
  • People enjoy communities of like-minded people. They’re hard to create, but when it works, it’s unbeatable.

Implementation

Over the course of a few hours, we defined a minimum (and a reach) feature set, then designed and sketched an iPhone application, with accompanying web landing pages for the lists. We also spec’d the RESTful API that would support the iPhone app.

Over the next two (long) days, Josh and I built the iPhone app, while Nathan and Nico implemented the API (which utilized the Miso API, RABL, and Padrino) and the web splash pages. Nico served double duty as our art director.  Aside from some idle discussion that a little thought and investment in library development could make building iPhone apps that are fairly straightforward views of a RESTful API much faster, the implementation went without much incident.

While we’ve been happily using the app in beta for a week or so, we hope to release the app to the iPhone App Store after about a day of cleanup work, as soon as we find the time.

Conclusion

Overall, putting aside all the business rationalizations for doing a company hackathon, building something that we were only imagining a few days before was a ton of fun and a testament to why we’re engineers in the first place.  I’ll be sure to update this blog post when we actually submit to the App Store.