Category Archives: Engineering

Vendor – Bringing Bundler to iOS

Using and sharing iOS libraries is tragically difficult given the maturity of the framework, the number of active developers, and the size of the open source community.  Compare with Rails which has RubyGems and Bundler, it’s no surprise that Rails/Ruby has a resource like The Ruby Toolbox, while iOS development has…nothing?

Are iOS developers fundamentally less collaborative than Rails developers?  When developing on Rails, if I ever find myself developing anything remotely reusable, I can almost always be certain that there is a gem for it (probably with an obnoxiously clever name).

I don’t think the spirit of collaboration is lacking in the iOS developer community; rather, there are a few fundamental challenges with iOS library development:

  • Libraries are hard to integrate into a project.  Granted, it’s not *that* hard to follow a brief set of instructions, but why can’t this process be more streamlined?
  • No standardized versioning standard.  Once a library is integrated into a project, there is no standard way of capturing which version of the library was used.
  • No dependency specification standard (this is a big problem).  Why does facebook-ios-sdk embed it’s own JSON library when there are better ones available?  So many libraries come embedded with common libraries that we all use – and worse than that, who knows what version they’re using!  Not only can this lead to duplicate symbols, but library developers essentially have to start from scratch instead of leveraging other existing libraries.

Of course, coming up with a naming, versioning, and dependency standard for iOS libraries and convincing everyone to adopt it is a daunting task.  One possible approach is follow the example of Homebrew, a popular package manager for OS X.  Homebrew turned installing and updating packages on OS X into a simple process.  Instead of convincing everyone to comply to some standard, Homebrew maintains a set of formulas that helps describe commonly used packages.  These formulas allow Homebrew to automate the installation process as well as enforce dependencies.  This works well for Homebrew, although it puts the burden of maintaining package specifications in one place, rather then distributed as it is with Ruby gems.

There seems to be a need for some type of solution here.  When we write Rails libraries, we write the readme first to help us understand what problem we’re trying to solve (Readme Driven Development).  Below is the readme for an iOS packaging system called Vendor.

Vendor – an iOS library management system

Vendor makes the process of using and managing libraries in iOS easy.  Vendor leverages the XCode Workspaces feature introduced with XCode 4 and is modeled after Bundler. Vendor streamlines the installation and update process for dependent libraries.  It also tracks versions and manages dependencies between libraries.

Step 1) Specify dependencies

Specify your dependencies in a Vendors file in your project’s root.

source "https://github.com/bazaarlabs/vendor"
lib "facebook-ios-sdk"  # Formula specified at source above
lib "three20"
lib "asi-http-request", :git => "https://github.com/pokeb/asi-http-request.git"
lib "JSONKit", :git => "https://github.com/johnezang/JSONKit.git"

Step 2) Install dependencies

vendor install
git add Vendors.lock

Installing a vendor library gets the latest version of the code, and adds the XCode project to the workspace.  As part of the installation process, the library is set up as a dependency of the main project, header search paths are modified, and required frameworks are added.  The installed version of the library is captured in the Vendors.lock file.

After a fresh check out of a project from source control, the XCode workspace may contain links to projects that don’t exist in the file system because vendor projects are not checked into source control. Run `vendor install` to restore the vendor projects.

Other commands

# Updating all dependencies will update all libraries to their latest versions.
vendor update
# Specifying the dependency will cause only the single library to be updated.
vendor update facebook-ios-sdk

Adding a library formula

If a library has no framework dependencies, has no required additional compiler/linker flags, and has an XCode project, it doesn’t require a Vendor formula. An example is JSONKit, which may be specified as below. However, if another Vendor library requires JSONKit, JSONKit must have a Vendor formula.

lib "JSONKit", :git => "https://github.com/johnezang/JSONKit.git"

However, if the library requires frameworks or has dependencies on other Vendor libraries, it must have a Vendor formula.  As with Brew, a Vendor formula is some declarative Ruby code that is open source and centrally managed.

An example Vendor formula might look like:

require 'formula'

class Three20 < Formula
  url "https://github.com/facebook/three20"
  libraries libThree20.a
  frameworks "CoreAnimation"
  header_path "three20/Build/Products/three20"
  linker_flags "ObjC", "all_load"
  vendors "JSONKit"
end

Conclusion

Using iOS libraries is way harder than it should be, which has negatively impacted the growth of the open source community for iOS.  Even if I was only developing libraries for myself, I would still want some kind of packaging system to help me manage the code.  Vendor essentially streamlines the flow described by Jonas Williams here.  Unfortunately, programmatically managing XCode projects isn’t supported natively, but people have implemented various solutions, such as Three20Victor Costan, and XCS.

Open Questions

  • Is there an existing solution for this?
  • Would this be a useful gem for your iOS development?
  • Why hasn’t anyone built something like this already? Impossible to build?

Adventures in Scaling, Part 2: PostgreSQL

Several months ago, I wrote a post about REE Garbage Collection Tuning with the intent of kicking off a series dedicated to different approaches and methods applied at Miso in order to scale our service. This time around I wanted to focus on how to setup PostgreSQL on a dedicated server instance. In addition, I will cover how to tweak the configuration settings as a first-pass towards optimizing database performance and explain which parameters are the most important.

Why PostgreSQL?

Before I begin covering these topics, I want to briefly touch on why our application (and this tutorial) is centered on PostgreSQL and not one of the many other RDBMS or NoSQL alternatives available for persistence in current web development. From the multitude of “general purpose” database persistence options available, the most common choices for a startup in my experience tend to be MySQL, Drizzle, PostgreSQL and MongoDB.

Each of these options above have pros and cons and there is no one size fits all solution as is usually the case in technology. In the past, I have traditionally used MySQL to do the bulk of database persistence for Rails apps. This choice was largely because of familiarity as well as the clear MySQL favoritism in the early years of the Rails community. Though the full explanation of why is outside the scope of this post, suffice to say I won’t be choosing to use MySQL again in the future when starting a new project. My claim, unsubstantiated in this post, is that there is nothing significant MySQL provides over PostgreSQL and yet there are many pitfalls and downsides.

If you are interested in the subject, I recommend you read a few posts and draw your own conclusions. To be fair, Drizzle looks like an interesting alternative to MySQL and/or Postgres. Having never used that database, I would be curious to hear how it compares to PostgreSQL. We are big fans of MongoDB at Miso and we store several types of data for our services within collections. However, for historical and practical reasons, we did not want to dedicate the time to convert our primary dataset as the benefit at our current level was not significant enough to warrant the time involved. In a future post, I would love to delve deeper into our Polyglot Persistence strategy and why we opted to use particular technologies over alternatives.

Setting up PostgreSQL

With that explanation out of the way, let’s turn our attention to setting up PostgreSQL on a dedicated database server. In this tutorial, we will be installing Postgres 9.X on an Ubuntu machine. You may need to tweak this for your specific needs and platform depending on your specific setup.

One of the first questions you run across when setting up a dedicated PostgreSQL server is “How much RAM should the instance have?”. I would take a look at the total size of your database (or the expected size of your database in the near future). If at all possible, the ideal RAM for your instance would allow for the entire database to be placed in memory. At small scales, this should be possible and will mean major performance increases for obvious reasons. The size I recommend for a starting server is typically between 2-8GB of RAM. Furthermore, I would recommend against using a dedicated database with less than 1GB of RAM if you can help it. If you have a large dataset and need replication or sharding from the start, then I would recommend putting down this guide and buying the PostgreSQL High Performance book right now. For the purposes of the rest of this guide, I am going to assume an 8GB instance was selected.

Now that we picked the size of our instance, let’s actually start the install. Unfortunately for us, Ubuntu 10.04 apt repositories only have 8.4. The postgresql-9.0 package won’t be added until Ubuntu Natty. For this reason, unless you are using that version, you must use alternate repositories for this installation. There is an excellent utility called “python-software-properties” to make installing repositories easier. If you haven’t yet, install that first:

sudo apt-get install python-software-properties

Next, let’s add the repository containing PostgreSQL 9.0:

sudo add-apt-repository ppa:pitti/postgresql
sudo apt-get update

Now, we need to install the database, the contrib tools, and several supporting libraries:

sudo apt-get install postgresql-9.0 postgresql-contrib-9.0
sudo apt-get install postgresql-server-dev-9.0 libpq-dev libpq5

Installing all of these packages now will save you the pain later of trying to track down why certain things won’t work. Another good step is to symlink the archival cleanup tool to /usr/bin for use when you need to enable replication with WAL:

sudo ln -s /usr/lib/postgresql/9.0/bin/pg_archivecleanup /usr/bin/

At this time, you should also recreate the cluster to ensure proper UTF-8 encoding for all databases:

su postgres
pg_dropcluster --stop 9.0 main
pg_createcluster --start -e UTF-8 9.0 main
sudo /etc/init.d/postgresql restart

You can now check the status of the database using:

sudo /etc/init.d/postgresql status

Once this has all been done, you can find the configuration directory in /etc/postgresql/9.0/main and the data directory in /var/lib/postgresql/9.0/main.

Another useful tool is pg_top, which allows you to view the status of your PostgreSQL processes along with the currently executing queries:

sudo wget http://pgfoundry.org/frs/download.php/1781/pg_top-3.6.2.tar.gz
tar -zxpvf pg_top-3.6.2.tar.gz
cd pg_top-3.6.2
./configure
make

PostgreSQL and all related utilities should now be properly installed. From here, you can begin creating databases and users with the psql command:

psql -d postgres
> CREATE USER deploy WITH PASSWORD 'jw8s0F4';
> CREATE ROLE admin SUPERUSER
> CREATE DATABASE name;

Of course, there are a lot of commands you can run with this console. You are encouraged to read other sources for more information.

PostgreSQL Tuning

Now that we have successfully installed PostgreSQL, the next thing to do is to tune the configuration parameters for solid performance. Of course, the best way to do this is to measure and profile your application and set these based on your own needs. All these settings will ultimately vary based on the needs of your particular application. Nonetheless, here is a guide intended to get you started with settings that work “well enough”. From here, these parameters can be tweaked to your hearts content to find the sweet spot for your individual use case.

Fortunately for us, PostgreSQL guru Gregory Smith has created a tool to make our lives a great deal simpler. This tool is called pgtune and I encourage you to run this on your server as quickly as possible after setup. The settings this recommends should be your baseline configuration values unless you know otherwise. First, let’s download the pg_tune tool:

cd ~
wget http://pgfoundry.org/frs/download.php/2449/pgtune-0.9.3.tar.gz
tar -zxvf pgtune-0.9.3.tar.gz

Once the utility has been extracted, simply execute the binary with the proper options and recommended configuration values will be output:

cd pgtune-0.9.3
./pgtune -i /etc/postgresql/9.0/main/postgresql.conf -o ~/postgresql.conf.pgtune --type Web

This will generate all the recommended values tailored custom to your server in a file ~/postgresql.conf.pgtune. Simply view this file and note the settings at the bottom:

cat ~/postgresql.conf.pgtune
# Look at the bottom for the relevant parameter values

Take the settings and append them to your actual configuration file located at /etc/postgresql/9.0/main/postgresql.conf. To be on the safe side, you should also update the kernel shmmax property which is the maximum size of shared memory segment in bytes. This is particularly necessary for large values of effective_cache_size and other parameters:

  sysctl -w kernel.shmmax=26843545600
  sudo nano /etc/sysctl.conf
    # Append the following line:
    kernel.shmmax=26843545600
  sudo sysctl -p /etc/sysctl.conf

Once this has been updated and you have saved the modified settings, restart your cluster for the settings to take affect:

pg_ctlcluster 9.0 main reload

Now that we have setup these tuned parameters, let’s take a look deeper and delve into the most important parameters you can tweak to improve your database performance. The list below is not a comprehensive guide, but in most cases these parameters will serve as the first values to experiment with after using pg_tune.

  • work_mem – Specifies the amount of memory to be used by internal sort operations and hash tables before switching to temporary disk files. The total memory used could be many times the value of work_mem; it is necessary to keep this fact in mind when choosing the value. Sort operations are used for ORDER BY, DISTINCT, and merge joins. The recommended range for this value is the total available memory / (2 * max_connections). On an 8GB system, this could be set to 40MB.
  • effective_cache_size – Sets the planner’s assumption about the effective size of the disk cache that is available to a single query. This is factored into estimates of the cost of using an index; a higher value makes it more likely index scans will be used, a lower value makes it more likely sequential scans will be used. The recommended range for this value is between 60%-80% of your total available RAM. On an 8GB system, this could be set to 5120MB.
  • shared_buffers – Sets the amount of memory the database server uses for shared memory buffers. The default is typically 32 megabytes and must be at least 128 kilobytes. The recommended range for this value is between 20%-30% of the total available RAM. On an 8GB system, this could be set to 2048MB.
  • wal_buffers – The amount of memory used in shared memory for WAL data. The default is 64 kilobytes (64kB). The setting need only be large enough to hold the amount of WAL data generated by one typical transaction. The recommended range for this value is between 2-16MB. On an 8GB system, this could be set to 12MB.
  • maintenance_work_mem – Specifies the maximum amount of memory to be used in maintenance operations, such as VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY. Larger settings may improve performance for vacuuming and for restoring database dumps. The recommended range for this value is 50MB per GB of the total available RAM. On an 8GB system, this could be set to 400MB.
  • max_connections – Determines the maximum number of concurrent connections to the database server. The default is typically 100 connections and this parameter can only be set at server start. The recommended range for this value is between 100-200.

Beyond this for further reading, I would recommend checking out other articles, and of course Gregory Smith’s book PostgreSQL High Performance.

Wrapping Up

Hopefully you have found the guide above helpful. The intent was to provide an easy way for people to get started using PostgreSQL in their Rails applications today. If you are starting a new Rails application, I strongly encourage you to make your own informed decision on which persistence engine to use. Research the issue for yourself but I would urge you to at least consider giving PostgreSQL an honest try. Coming from using MySQL prior, I can assure you we did not miss a single thing about MySQL when we migrated.

This post is the first of many that will deal with PostgreSQL and provide details on how to get this excellent RDBMS to work for your application. In a future post, I would like to cover how to setup “hot standby” replication with archival logs (WAL). That guide will take you step by step through creating a master database and several read-only slaves as well as then utilizing those slaves from within your Rails application. We also ported from MySQL to PostgreSQL, so another future post will detail how to best migrate your database as well as pitfalls to avoid.

Persist your data in YAML files instead of SQL database. Wait what?

Why YAML Record?

At Miso, we occasionally ran into situations where we wanted to persist simple data to a text (or yaml) file. Examples include landing page email forms, contact forms, feedback forms, about us “team” pages, etc. In these situations, we wanted a way to persist data to YAML and easily view the results in a text file but also manage the data through forms and controllers as any other data. To achieve this we have created a YAML-backed persistence engine called YAML Record that allows access to YAML based data using a familiar ActiveRecord API.

If you have a small amount of data to persist which changes infrequently but that you do need to update from time to time, you should investigate using YAML Record. In other cases where you have frequently updated information, several query requirements or a medium/large set of data, a SQL or NoSQL solution would probably be a better fit. Using YAML for persistence doesn’t work in every case so use this approach wisely.

YamlRecord is a standalone, simple and lightweight way to map your data from a yaml file, and can easily be integrated along with ActiveRecord and we try to provide the same APIs. On the other hand if you’re using DataMapper, I would recommend checking out dm-yaml-adapter which allows for similar functionality.

How do we use this at Miso?

We have several use cases for YamlRecord. The first one is to persist our team members profiles on our about page. Using YAML Record, we can easily add, update or remove any of this information with a RESTful architecture and standard controllers. We even implemented full DragonFly support with YAML Record so we can link up to each team member with a picture. I’ll probably cover this functionality on my next blog post.

In another case, we use YAML Record for our FAQ. From an relational data perspective, storing this questions requires only a single table without any relationships to other tables or any indexes. For cases like this, a SQL solution and the overhead of creating a migration and a table doesn’t always make sense. We definitely could use a SQL table, but having one with 10 rows and only 3 columns seems almost silly.

------------------------------------
|  id  | question | answer         |
------------------------------------
|  1   | foo?     | blablabla      |
------------------------------------
|  2   | bar?     | blablaaaaa     |
------------------------------------
|  3   | L33t?    | Y3s            |
------------------------------------
|
...
|
------------------------------------
|  10  | n00b?    | n0!            |
------------------------------------

I would rather have a YAML file which looks like the example below. The main upside of using a YAML file is Rails is going to load entire file in memory once, that way your application doesn’t need to go back and forth to read your file each time.

---
- :id: 08cc5f757b26903b9e7b6fcc3a3fbe
  :question: foo?
  :answer: blablabla
- :id: 4f2e73f896d0443c2b66f57cbdb2bb
  :question: bar?
  :answer: blablaaaaa
- :id: 5197b037524a1629d27c69f7aa9891
  :question: L33t?
  :answer: Y3s
- :id: f77c91577af4bf9b1e994e1b5b9ecd
  :question: n00b?
  :answer: n0!

Using the same APIs as ActiveRecord

Here’s a quick sneak peek on what you can achieve with YamlRecord. Given a class HotGirl:

class HotGirl < YamlRecord::Base
  # Declare your properties
  properties :name, :sex_appeal_rate, :age

  # Declare source file path (config/hot_girls.yml)
  source Rails.root.join("config/hot_girls")
end

Here’s some of actions you can do:

# Create your instance as on active record
@hg = HotGirl.new(:name => "Jennifer Anniston", :sex_appeal_rate => 75, :age => 42) # @hg
@hg.save # => true

# Alternatively
@hg2 = HotGirl.create(:name => "Scarlett Johansson", :sex_appeal_rate => 92, :age => 26)

# You can get your items easily
HotGirl.all # => [@megan_fox, @jessica_alba ...]

# Update the record attributes
@megan_fox.update_attributes(:sex_appeal_rate => 100) # => true

# Destroy it
@miss_usa_2011.destroy => true

We tried to provide all essential APIs ActiveRecord provides and more are coming. If you want to know about the available APIs, you can find it on the README or Rdoc

Feedback?

We would love to hear your feedback on this concept to persist data to a YAML file in certain simple cases and about how we can improve this library in the future. The source code is available on Github and we welcome all patches and pull requests. Feel free to share your thoughts about it.

Conclusion

We really believe this concept of persisting data in YAML file makes a lot of sense for us in particular scenarios. Building this was actually a challenge for me as it was my first gem. I want to thank Nathan for his support building the gem and I really learned a lot in the process. After all, I guess we did it for fun and as an experiment to see how this would work.

We also found these 2 gems in same vein as YAML Record, which are worthwhile to check out:

Hybrid (Native + Web) Mobile App Development • Part 2: Maintaining EJS templates, and Bridging Interactions

Welcome to part 2 of this multi-part series of blog posts where we venture into the world of hybrid frameworks. This is where we get into the juicy stuff, if you are new to this, I suggest reading part 1 to understand the motivation behind this approach.

Big Picture Stuff

In a nutshell, what we’re trying to do here is mash a JSON response from a RESTful API call with a ejs (similar to rail’s erb) template to form the html to be rendered by a UIWebview. Simple, until you start asking questions such as how to an element to transition to a different state? How do you AJAX style interactions? How is template/asset syncing managed? These are all critical questions to ask of a hybrid framework. So before we dive into it, let’s make a short list of what a hybrid framework should be capable of before diving deeper into each feature separately.

A hybrid framework should:

* Maintain template syncing from a third party resource

* Handle bridging interactions between the web and native views

* Manage a RESTful API SDK

Maintain Template Syncing

At Miso, our mobile web templates are served off an Amazon S3 server. The basic strategy we employed is to download the web templates and cache it on disk, and when you open up the app in the future it would update it’s local templates with the latest and greatest by poking S3.

To accomplish this we used ASIHTTPRequest, a popular web request wrapper, to maintain web requests we send to the S3 server. This means asynchronous requests, cache responses, and a ASINetworkQueue that lets you handle multiple web requests.

There were a few nuggets we threw in worth mentioning here that made this strategy more efficient and scalable. We maintain a mainfest file on the server that has the list of assets the mobile device needs to download. The file would look something like this:

homepage.ejs?12398789
javascript/application.js?9437392
images/logo.png?1283740

We keep a versioning number by appending ?<some numbers> after each file. This is generated automatically every time we make updates to the web templates. Having this file allows us to scale well when we add more files, because all we have to download is the manifest file on startup.We then leverage ASIHTTPRequest’s cachedResponse flag and iterate through each file and check with s3 to ensure that this file hasn’t changed since we last fetched it.

To give a more concrete view of how exactly we maintain template syncing from a web request stand point, here’s a code snippet of an implementation of the delegate method where a ASIHTTPRequest completes showing the logic explained earlier:

- (void)requestFinished:(ASIHTTPRequest *)request {
    NSString *responseString = [[[NSString alloc] initWithData:[request responseData] encoding:NSUTF8StringEncoding] autorelease];

    // If the returned file is the manifest file, iterate through it and send web requests for each asset
    if ([[[request url] description] hasSuffix:@"manifest.mf"]) {
        // Separate assets and shove them into an array
        NSArray *assets = [responseString componentsSeparatedByCharactersInSet:[NSCharacterSet newlineCharacterSet]];

        for (NSString *asset in assets) {
            // Get rid of trailing white spaces in assets list
            if ([[asset stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceCharacterSet]] isEqualToString:@""])
                continue;

            // Add each asset in ASINetworkQueue
            [self addUrl:asset queue:_queue];
        }
    } else {
        // If it's asset file, templates on all views need to reload unless the response is from cache
        _needsReload = _needsReload || ![request didUseCachedResponse];
    }

    NSString *file = [[request url] relativePath];
    NSString *fqFilePath = [_localBaseUrl stringByAppendingPathComponent:file];

    // If the file doesn't exist on disk, or the response is not from cache, then it's a updated file and we should save it
    if (![request didUseCachedResponse] || ![[NSFileManager defaultManager] fileExistsAtPath:fqFilePath]) {
        [self saveLayoutFile:[[request url] relativePath] data:[request responseData]];
        _needsReload = YES;
    }
}

I want to mention one last thing before I wrap up this section, and that is how we handle updating templates across different version of our app. Let’s consider the situation where you intend to release a new version of your app complete with the latest web templates. You realize that there are backwards compatibility issues of the new templates with an older version of the app. If you were to update the current web templates it’d break everyone on an older version of the app. Ooops! How we work around that is to maintain separate buckets of web templates for different versions of our app. It’s not always necessary, but something good to keep in mind.

Bridging Interactions between Web and Native views

Let’s elaborate on what this means via examples:

* User clicks on a link in a webview, and it pushes another UIViewController onto the navigation stack

* User posts a comment via a native controller, and we want that to fire off an AJAX request and render on to the web view

How Miso achieves this is by defining certain protocols to conform to when generating links within our EJS templates. That way, when a load request is captured by the UIWebView we can then route these requests to different parts of the app.

- (BOOL)webView:(UIWebView *)webView shouldStartLoadWithRequest:(NSURLRequest *)request
 navigationType:(UIWebViewNavigationType)navigationType {
    NSString *url = [[request URL] description];

    // Check for protocol type and determine routing
    if (navigationType == UIWebViewNavigationTypeOther) {
        return YES;
    } else if ([url hasPrefix:@"miso://ajax"]) {
        [_ajaxController fireAjaxRequest:url];
        return NO;
    } else {
        // RoutesController delegates native app actions given a miso://<controller>/<action>?<params>
        [[RoutesController instance] processRoute:url viewController:_vc webview:self];
        return NO;
    }
}

The code snippet above is a stripped down version of what we use currently, but the basic concept is the same. We delegate routing by implementing the shouldStartLoadWithRequest delegate method for UIWebViews. So in the case where we capture a miso:// protocol we pass it along to other classes to process.

In the next part, we will start venturing further down the rabbit hole and look into the concepts behind RoutesController and AjaxController. Stay tuned!

Miso Hackathon I

Miso (and I) had our first Hackathon last week, and both of us survived! Our goal: conceive of an idea and build it within 3 days. We built Listify – an iPhone app that let’s you create, browse, and upvote lists of TV show and movies. Also with accompanying web landing pages.

A Little Bit of Context

At Miso, we’ve been working on making fun things for people watching TV, starting with check-ins, but experimenting with synchronized trivia, exclusive content, live Q & A. However, over the last three months, we’ve been focusing on ideas that connect friends around what they watch.

Why a Hackathon?

Entrepreneurs seem to be drinking heavily from the “Lean Startup” and “customer development” kool-aid.  One basic tenet is that product ideas/assumptions can be proven in days, not weeks or months.  Common tactics include applying MacGyver-like strategies to challenge product assumptions.  For example, create a splash page (this was our first attempt – http://playredzone.herokuapp.com) for a product before writing any code.

Generally speaking, it can be challenging to apply their methodologies consistently, but Hackathons certainly seem to intrinsically embody the spirit of Lean Startup.  Also, as most engineers will understand, creating something new from end to end is a very rewarding experience.

On to the Main Event

As we were deciding what to build for the Hackathon, we set the following constraints:

  • Should be consistent with the “connect friends around what they watch” theme.
  • Should be built on the Miso API (i.e., not as a feature in our main product).
  • We have to want to use it.
  • Must be finished in 3 days.

After some happy hour brainstorming, we came up with the basic motivation and mechanics behind Listify.

The need: People need help discovering what movies and TV shows to watch.

The underlying psychologies:

  • People want to express themselves. Lists are a structured and easy way to do that.
  • People take pride in their sense of taste. The band they “discovered”, the new (or old) indie film, are all worthy Facebook posts.
  • People enjoy communities of like-minded people. They’re hard to create, but when it works, it’s unbeatable.

Implementation

Over the course of a few hours, we defined a minimum (and a reach) feature set, then designed and sketched an iPhone application, with accompanying web landing pages for the lists. We also spec’d the RESTful API that would support the iPhone app.

Over the next two (long) days, Josh and I built the iPhone app, while Nathan and Nico implemented the API (which utilized the Miso API, RABL, and Padrino) and the web splash pages. Nico served double duty as our art director.  Aside from some idle discussion that a little thought and investment in library development could make building iPhone apps that are fairly straightforward views of a RESTful API much faster, the implementation went without much incident.

While we’ve been happily using the app in beta for a week or so, we hope to release the app to the iPhone App Store after about a day of cleanup work, as soon as we find the time.

Conclusion

Overall, putting aside all the business rationalizations for doing a company hackathon, building something that we were only imagining a few days before was a ton of fun and a testament to why we’re engineers in the first place.  I’ll be sure to update this blog post when we actually submit to the App Store.

Startup, corporations? Why new graduated students should pick startups

After reading these blog posts on why new graduates should or shouldn’t join a startup, I feel I should give my insights based on my past experiences as well as on my experience at Miso. First of all, let’s put that statement in context, as I have only just started my career and I’m not yet graduated. However, I’ve already worked for more than 1 year (full time and part time) at Société Générale as Tech Op, which is one of the biggest banks in Europe, and all of my past internships have been in startups…

After coming across why we shouldn’t join a startup blog posts, there are 2 points I would agree with him on:

  • Startups are not for everyone, especially tech startups, everyone isn’t a “geek”, neither excited by new features on Twitter, that’s a fact.
  • Both posts quoted Steve Blank: “It’s your curiosity and enthusiasm that will get you noticed and make your life interesting.” It’s kind of obvious, you will enjoy your work if you’re passionate about it.

However, I disagree with most of his other ideas…

1. Passion is the ultimate competitive advantage.

Yes, in both startups and corporations you can work with passion, the main difference between these two is that in a startup, you are sure to work with people who enjoy what they’re doing, and that’s not always the case in my past experiences in corporations.

His main argument is that in large corporations, employees “can afford to have highly specialized jobs filled by people who only focus on one particular component”. Ok I can understand that, but to me being close in his skill area, it’s to me like communitarianism because you can’t learn and you can’t expand in your field. At Miso even if my official position is Rails Developer, sometimes I had to sketch out some design for new features. Basically my point is yes, in startups you have to assume many roles, but it’s necessary, it teaches you to think globally and not be focused on only your code.

2. Leverage

Have a big companies name on your resume always gives you some credibility, that’s true, but I would rather have an unknown startup on my resume which had failed and be able to explain why it had failed, and on the other hand if you have worked for a successful startup, being part of the small team looks significantly more impressive.

3. “Startup years are like dog years”

It’s been already 2 months I’ve been working at Miso and I feel like it was a year. It’s an intense internship, I learn so much things not only in ruby, in product side, and in community management… I’m not claiming I’ll be a guru of these domains at the end, but’s it’s definitely improving my culture and above all helping me to understand why I’m coding these new features.

4. More Money

If your first motivation is making money, don’t work for a startup… I turned down a better offer in terms of salary, to accept the Miso one because I feel it was and is the right place to learn. I don’t know if Miso will be the next big hit in the valley but anyway, I don’t care if we fail, again the most important thing it’s to understand why you failed.

“If you make meaning, you’ll probably make money”- Guy Kawasaki. I really believe this quote can be applied as employee not only in a company context, I really need to know why I’m doing this feature, for which purpose. At Miso, the purpose is simple; we want to “Make TV more fun, change the way how people are watching TV”. During my time at Société Générale, I didn’t even know what the company actually did. I worked in a subsidiary called Société Générale Securities Services and to this day, I still don’t know what it’s doing. The worst part was not one of my colleagues was able to explain it to me either. The only thing that mattered was to keep the servers running, beyond that I never was encouraged to understand the larger picture.

5. Relationships

It seems in a corporation you can build a network of “hundreds or thousands of coworkers”. If it’s the case, it’s a bit curious to build this network only within the company. I believe in openness and at Miso, the engineering team tries to be close to San Francisco Ruby Community. We attend meetups, hosting ruby workshop and generally try to be involved. There are 2 goals behind this, promoting what we’re doing and also promoting ruby. Most of all it’s about sharing and improving our culture. You could learn for example what’s the best practices in hot new startups just by making a connection. This gives you some insight on what you’re doing. It’s a win-win situation to be involved outside of the narrow focus of your company.

Relationships in startups with your colleagues are not the same as in a corporation. The day-to-day interactions drive more than a simple colleague relationship. I’m not saying your colleague will become your best friends, but I definitely know I will stay in touch with my Miso colleagues after my adventure here. And I still hang out with my lovely pariSoma friends as well.

To conclude, for these reasons I definitely know that I don’t want to work for corporations. On the other hand some people needs a “safe” work because their situation requires it, I can understand that too. But that doesn’t mean startups can’t work with “corporations”, we need them and they need us. We work with many larger organizations here such as Fox, Logo, Showtime, etc. The most important thing I believe is that working for a fast paced startup is worth trying because you will rapidly grow your experience and learn about things that you never expected.

Building a Platform API on Rails

Introduction

In my last post, I discussed the failings of to_json in Rails and how we approached writing our public APIs using RABL and JSON Templates in the View. Since that post, we have continued expanding our Developer API and adding new applications to our Application Gallery.

The approach of using views to intelligently craft our JSON APIs has continued to prove useful again and again as we strive to create clean and simple APIs for developers to consume. We have used the as_json approach on other projects for years in the past and I can say that I have not missed that strategy even once since we moved to RABL. To be clear, if your APIs and data models are dead simple and only for internal or private use and a one-to-one mapping suits your needs, then RABL may be overkill. RABL really excels once you find yourself needing to craft flexible and more complex APIs rich with data and associated information.

The last post was all about explaining the reasons for abandoning to_json and the driving forces behind our creation of RABL. This post is all about how to actually use Rails 3 and RABL in a psuedo real-world way to create a public API for your application. In other words, how to empower an “application” to become a developer’s platform.

Authentication

The first thing to determine when creating a platform is the authentication mechanism that developers will use when interacting with your API. If your API is read-only and/or easily cached on a global basis, then perhaps no authentication is necessary. In most cases though, your API will require authentication of some kind. These days OAuth has become a standard and using anything else is probably a bad idea.

The good news is that we are well past the point where you must implement an OAuth Provider strategy from scratch in your application. Nevertheless, I would strongly recommend you become familiar with the OAuth Authentication Protocol before trying to create a platform. Starting with this in mind, your next task is to pick an OAuth gem for Ruby. Searching Github, you can see there are many libraries to choose from.

The best one to start with is Rails OAuth Plugin which will provide a drop-in solution both to consuming and providing APIs in your Rails 3 applications. The documentation is fairly decent and coupled with the introduction blog post, I think reiterating the steps to set this up would be unnecessary. Follow the steps and generate an OAuth Provider system into your Rails 3 application. This is what will enable the “OAuth Dance” where developers can generate a request token and sign with that token for a user to retrieve and access token. Once the developer gets an access token, they can use that to interact with your APIs. The easiest way for developers to consume your API is to use a “consumer” OAuth library such as Ruby OAuth or equivalent in the client language.

Building your API

Now that we have authentication taken care of in our application, the next step is to create APIs for developers to interact with on our platform. For this, we can re-use our existing controllers or create new ones. You already likely have html representations of your application for users to interact with. These views are declared in a “index.html.erb” file or similar and describe the display for a user in the browser. When building APIs, the simplest way to think about them is to think of the JSON API as “just another view representation” that lives alongside your HTML templates.

To demonstrate how easy building an API is, let’s consider a sample application. This application is a simple blogging engine that has users and posts. Users can create posts and then their friends can read their streams. For now, we just want this to be a blogging platform and we will leave the clients up to developers to build using our APIs. First, let’s use RABL by declaring the gem in our Gemfile:

# Gemfile
gem 'rabl'

Next, we need to install the new gem into our gemset:

$ bundle install

Next, let’s generate the User and Post tables that are going to be used in this example application:

$ rails g model User first_name:string last_name:string age:integer
$ rails g model Post title:string body:text user_id:integer
$ rake db:migrate db:test:prepare

Nothing too crazy, just defining some models for user and posts with the minimum attributes needed. Let’s also setup model associations:

# app/models/user.rb
class User < ActiveRecord::Base
  has_many :posts
end

# app/models/post.rb
class Post < ActiveRecord::Base
  belongs_to :user
end

Great! So we now have a user model and a post model. The user can create many posts and they can be retrieved easily using the excellent rails has_many association. Now, let’s expose the user information for a particular user in our API. First, let’s generate the controller for Users:

$ rails g controller Users show

and of course setup the standard routes for the controller in routes.rb:

# config/routes.rb
SampleApiDemo::Application.routes.draw do
  resources :users
end

Next, let’s protect our controller so that the data can only be accessed when a user has authenticated, and setup our simple “show” method for accessing a user’s data:

# app/controllers/users_controller.rb
class UsersController < ApplicationController
  before_filter :oauth_required

  respond_to :json, :xml
  def show
    @user = User.find_by_id(params[:id])
  end
end

As you can see the controller is very lean here. We setup a before_filter from the oauth-plugin to require oauth authentication and then we define a “show” action which retrieves the user record given the associated user id. We also declare that the action can respond to both json and xml formats. At this point, you might wonder a few things.

First, “Why include XML up there if we are building a JSON API?” The short answer is because RABL gives this to us for free. The same RABL declarations actually power both our XML and JSON apis by default with no extra effort! The next question might be, “How does the action know what to render?”. This is where the power of the view template really shines. There is no need to declare any response handling logic in the action because Rails will automatically detect and render the associated view once the template is defined. But what does the template that powers our XML and JSON APIs look like? Let’s see below:

# app/views/users/show.rabl
object @user

# Declare the properties to include
attributes :first_name, :last_name

# Alias 'age' to 'years_old'
attributes :age => :years_old

# Include a custom node with full_name for user
node :full_name do |user|
  [user.first_name, user.last_name].join(" ")
end

# Include a custom node related to if the user can drink
node :can_drink do |user|
  user.age >= 21
end

The RABL template is all annotated above hopefully explaining fairly clearly each type of declaration. RABL is an all-ruby DSL for defining your APIs. Once the template is setup, testing out our new API endpoint is fairly simple. Probably the easiest way is to temporarily comment out the “oauth_required” line or disable in development. Then you can simply visit the endpoint in a browser or using curl:

// rails c
// User.create(...)
// rails s -p 3001
// http://localhost:3001/users/1.json
{
  user: {
    first_name: "Bob",
    last_name: "Hope",
    years_old: "92",
    full_name: "Bob Hope",
    can_drink: true
  }
}

With that we have a fully functional “show” action for a user. RABL also let’s us easily reuse templates as well. Let’s say we want to create an “index” action that lists all our users:

# app/views/users/index.rabl
object @users

# Reuse the show template definition
extends "users/show"

# Let's add an "id" resource for the index action
attributes :id

That would allow us to have an index action by re-using the show template and applying it to the entire collection:

// rails s -p 3001
// http://localhost:3001/users.json
[
  {
    user: {
      id : 1,
      first_name: "Bob",
      last_name: "Hope",
      years_old: "92",
      full_name: "Bob Hope",
      can_drink: true
    }
  },
  {
    user: {
      id: 2,
      first_name: "Alex",
      last_name: "Trebec",
      years_old: "102",
      full_name: "Alex Trebec",
      can_drink: true
    }
  }
]

Now that we have an “index” and “show” for users, let’s move onto the Post endpoints. First, let’s generate a PostsController:

$ rails g controller Posts index

and append the posts resource routes:

# config/routes.rb
SampleApiDemo::Application.routes.draw do
  resources :users, :only => [:show, :index]
  resources :posts, :only => [:index]
end

Now, we can fill in the index action by retrieving all posts into a posts collection:

# app/controllers/posts_controller.rb
class PostsController < ApplicationController
  before_filter :login_or_oauth_required     
  respond_to :html, :json, :xml      

  def index          
    @posts = Post.all      
  end  
end

and again we define the RABL template that serves as the definition for the XML and JSON output:

# app/views/posts/index.rabl  

# Declare the data source  
collection @posts  

# Declare attributes to display  
attributes :title, :body  

# Add custom node to declare if the post is recent 
node :is_recent do |post|      
  post.created_at > 1.week.ago
end

# Include user as child node, reusing the User 'show' template
child :user do
  extends "users/show"
end

Here we introduce a few new concepts. In RABL, there is support for “custom” nodes as well as for “child” association nodes which can also be defined in RABL or reuse existing templates as shown above. Let’s check out the result:

[
  {
    post: {
      title: "Being really old",
      body: "Let me tell you a story",
      is_recent: false,
      user: {
        id : 52,
        first_name: "Bob",
        last_name: "Hope",
        years_old: 92,
        full_name: "Bob Hope",
        can_drink: true
      }
    }
  },
  { ...more posts... }
]

Here you can see we have a full posts index with nested user association information and a custom node defined as well. This is where the power of JSON Templates lies. The incredible flexibility and simplicity that this approach affords you. Renaming an object or attribute is as simple as:

# app/views/posts/index.rabl

# Every post is now contained in an "article" root
collection @posts => :articles

# The attribute "first_name" is now "firstName" instead
attribute :first_name => :firstName

Another benefit is the ability to to use template inheritance which affords seamless reusability. In addition, the partial system allows you to embed a RABL template within another to reduce duplication.

Wrapping Up

For a full rundown of the functionality and DSL of RABL, I encourage you to check out the README and the Wiki. In addition, I would like to point out excellent resources by other users of RABL and encourage you to read teohm’s post as well as Nick Rowe’s overview of RABL as well. Please post any other resources in the comments!

My intention with this post was to provide a solid introductory guide for creating an API and a data platform easily in Rails. If there are topics that you would be interested in reading about more in-depth related to API design or creation or building out a platform, please let us know and we will consider it for a future post.

Hybrid (Native + Web) Mobile App Development • Part 1: The Motivation.

In the Beginning

Miso’s most popular platform, the iPhone, was initially conceived through the use of iOS’s native framework. Aside from occasional REST API calls to the web server for data, the entire user experience was delivered through native UI elements provided to us by the iOS SDK. Everything was great! Native apps are fast, performant, and did its job. As the iPhone app gained popularity and traction with the community, the natural next move was to bring the Miso experience to other mobile platforms. Thus, the android, iPad versions of Miso were born driving even more users to our service.

And then?

While native apps certainly have their advantages, launching features across multiple platforms was terribly time consuming. Eventually, we found that unless the size of our engineering team increased significantly, the native app approach just isn’t going to scale well in the long run.

At the time of writing this article, there are 3 popular approaches to mobile app development. Native, web, and hybrid. In this article, we will briefly discuss the pros and cons of each method, and what Miso eventually chose and the motivation behind it. (We chose hybrid!)

Native

Despite the complaints over the maintenance costs and scalability issues mentioned earlier, there are definitely use cases where you should strongly consider using the native SDK. Angry Birds would not have worked if they didn’t leverage hardware accelerated graphics. So any app that needs mobile device hardware, performance centric, and a very rich UI with bouncing and flashing buttons and labels should stick with the native approach.

For Miso, those strengths weren’t high in our list of priorities. We wanted multi-platform feature releases to be iterative, fast and scalable.

Web

The web approach touts strengths that seem to be what Miso needed. No need to submit to an app store, just deploy it on the web at any time and any mobile device with a web browser would be able to get the latest and greatest! Not to mention, you get all the Javascript/CSS goodies all for free!

We liked that. It was the opposite of the native approach. We can get away from painstakingly getting our pretty designs to work across multiple platforms, and release features and bug fixes with a push of a button almost instantly.

However, with web apps that means going anywhere in the app requires firing off a HTTP request and getting a response before you can render anything. It’s slow, clunky, and simply unsuitable for the user experience we wanted to deliver. We couldn’t leverage local caching of pages or data. Zero access to the iOS framework for nifty push notifications, gestures, or even the occasional use of native UI elements.

Hybrid

Miso engineers are a greedy bunch. We wanted the best of both worlds; The fast, snappy feel of a native apps coupled with the ease of styling and quick releases of web apps. So we embarked on a long journey of slowly porting native views of Miso’s iPhone app to web views.

It took us months to build a framework we are satisfied with, and we learned a ton on the way to achieving that goal. The Miso app you see now (3.0.x) has web views supporting a majority of our layouts by combining JSON responses from REST API requests with EJS templates. In this latest release (3.0.3), we’ve also introduced the leveraging of local disk caching of javascript/css and web templates to optimize performance; and added support for AJAX requests to make web views even prettier and user friendly.

At this time, we are happy with our solution as it gives us the ability to tweak the look and feel of our app without submitting a release to the app store. It also opens up the path to eventually integrating these web views into Miso apps across other mobile platforms. Ultimately allowing us to release the latest Miso features to all of our users in a timely fashion.

Next Time

While many other articles talk about the hybrid approach, few have attempted to educate in building a solid framework to support this idea.

In the subsequent articles, I will cover the concepts and designs of our web view framework and things we’ve learned along the way that may help should you choose this hybrid approach. Feel free to leave any comments and suggestions on topics you’d like me to cover!

Part 2

Low hanging fruit for Ruby performance optimization in Rails

Our goal: we currently spend about 150-180 ms in Ruby for an average web request, and we think it’s reasonable to improve that to be around 100 ms.

Two of the most expensive operations in Ruby are: object allocation and garbage collection. Unfortunately, in Ruby (unlike Java), garbage collection is synchronous and can have a major impact on the performance of requests. If you’ve ever noticed that a partial template rendering occasionally randomly takes a couple of seconds, you’re probably observing a request triggering garbage collection.

The good news: it’s easy to quickly (< 10 min) see how much your app is impacted from garbage collection. You’re likely to improve your performance by 20-30% just by tuning your garbage collection parameters.

If you have a production Rails app, and you’re even remotely interested in performance, I’m going to assume you’re using the excellent New Relic service.  If you’re using REE or a version of Ruby with the Railsbench GC patches, it’s easy to turn on garbage collection stats that will be visualized by New Relic.

You’ll get pretty charts like this:

by adding:

GC.enable_stats

somewhere in your initialization.  After enabling garbage collection stats in our application, we can see that approximately 20% of our Ruby time is spent in garbage collection, which implies that there’s also a not-so-insignificant portion of time spent in object allocation.

What’s next

We last tuned our Ruby garbage collection parameters about 5 months ago and, after an initial performance boost, we’ve seen the application time spent in Ruby creep back up.  To try to bring the response time back down, our next steps are to:

  • Consider taking another pass at garbage collection parameter tuning.  Since we’ve already taken one pass at this, I’m not sure if we’ll be as impactful the second time around, but we’ll see.
  • Identify the slowest controller actions via New Relic and profile them using ruby-prof or perftools.

Performance tuning using ruby-prof is likely going to vary a lot depending on the code, but if we find techniques that might apply more broadly, we’ll be sure to blog about it here.

Links

How redis can ruin your day, and what you can do to fix it

Over the past few years, Redis has become one of the internet’s more popular NoSQL, RAM based datastores, owing largely to its ease of deployment, the abundance of libraries/interfaces, available in a multiplicity of flavors  (we use ezmobius’s redis-rb gem), and perhaps most importantly, the flexibility of its data structures.  Compared to something like memcached, a cache key in redis can correspond to a single value (string or integer), a list (an array of values), a set (an ordered or unordered group of non-repeating values), or a hash (a set of N named fields, each storing a separate value).

For many of you, none of that is necessarily news, and even if it is, the internet abounds with redis how-to’s and introductions, so instead of rewriting what’s already been written, I’d like to share with you what we’ve learned about what I’d call “the dark-side” of Redis, the side that you only get to see after the two of you have had a few too many drinks at a hotel bar, and things start to get real weird, real fast.  Here at Miso we’ve been using Redis long enough to have had at least a few of these awkward moments with it, and although they’ve never been uncomfortable enough to make us consider replacing it altogether, they have been major points of frustration at times.  This post is my attempt to provide a first-hand account of redis’s sordid underbelly, in the hopes that you may be able to avoid some of issues we’ve grappled  with (and continue to) over the last year.

Where’s my memory?

One of the most confounding aspects of redis for the beginner may be the unpredictable and at times incomprehensible relationship between the memory footprint of redis-server and the actual amount of data being stored.  This was originally the impetus behind most of the high-level analysis we performed; we were perpetually running out of RAM on our caching server, but we knew (according to this script) that we were only storing a couple of gigabytes of values across all of our redis databases.  Sure, we expected redis to use a little extra memory to take care of metadata like key expiries, and other stuff, but we consistently saw redis-server using up to 5-10x as much memory as we would expect intuitively.

To understand the issue better, I began running a series of tests designed to examine how redis allocates memory given various datasets.  The idea was to populate redis with a bunch of records containing random data, using both strings and hashes (these were the only data structures that we were interested in using), and then measure the memory footprint in relation to the total amount of “stuff” (characters) that we saved. At the outset we were most interested in discovering what parameters/configuration yielded the most efficient storage performance (our metric was bytes/character – a value I’ll refer to as ‘overhead’).  Below are three graphs, comparing the total number of records, key size, and value size to overhead:

Certain patterns leap out almost immediately; for instance, just about any way you slice it, the smaller the total amount of data being stored the larger overhead.  Conversely, the ‘overhead’ just about always decreases asymptotically toward 1 byte/character as the amount of data being stored increases.  This makes perfect sense, as there is a “base footprint” that redis requires no matter what, and as the dataset grows, there is a more well-defined relationship between the actual amount of data contained in redis and the memory it consumes.

We can also infer (with some help from the redis documentation) that more “continuous” data is stored more efficiently.  For instance, if we need to store 2 million characters, it is more efficient to store it this way:

1000 records * ( 100 characters per key + 1900 characters per value)

than this way:

10,000 records * ( 100 characters per key + 100 characters per value)

This is all pretty consistent with the recommendations in the redis documentation.

To Hash or not to Hash

It also became clear from our tests that with randomized keys and values, hashes have slightly higher overhead than strings, and once again this makes sense, as hashes contain more “metadata” (information about how the data is structured), and that comes at a cost.

This seems contrary to the information provided by the redis documentation (see “Use Hashes when Possible”), which suggests that for recent versions of redis (2.2 and higher), hashes are far more efficient than strings, but keep in mind that we have been generating completely random data  (essentially noise) up until this point, and noise is, by definition, incompressible.

The story changes quite a bit when you have a non-noisy  set of cache keys that can be considered compressible, for example:

    user:1:last_signin => "Last Thursday"
    user:1:favorite color => "Blue"
    user:1:name => "Justin"

In these cases, it’s obvious to anyone familiar with hashes that the same data could be structured like this in something like JSON

user:1 => {name: "Justin", favorite_color: "Blue", last_signin: "Last Thursday"}

This format obviates the need to repeat “user:1″ for each value being stored, in theory reducing the amount of overall data redis needs to record.

To test this hypothesis, I generated data for 100,000 users, each with 5 fields holding randomized strings of 10 characters, using hashes first:

(user:1 = {:field0 =>”dugf4dfgv3″, :field1 => “oiw2335hnb”….})

then flat strings, with the field embedded in the key (user:1:field:0 = “).  The hashified example had a memory footprint of 21 MB, compared to 61 MB for the flattened data – a savings of about 2/3.  The same test with with 33,000 records and 10 fields produced 10 MB of data when hashes were used as opposed to 41 MB for the flattened data, once again a very significant reduction.  The lesson to take away from all of this is to use hashes whenever it makes sense.  If you are creating multiple records for values that all correspond to a single object (a user in our example), a hash is probably the better alternative.  If you have a significant amount of data (more than 10,000 records), you will absolutely  reduce the amount of memory used by redis.

Is that it?

No, it most certainly isn’t.  There are a few other little gotchas that we’ve encountered along the way, some of which we still have no explanation for.  For instance, if your instance of redis-server is using a significant amount of the total memory available on your machine (we’ll say greater than 70%), you need to be VERY vigilant, as we have experienced huge leaps in memory consumption for seemingly no reason.  For instance, this weekend within a span of 5 seconds, redis decided it wanted another 200 MB of memory without a single record being added to any of our databases.  The same thing happened twice more over the course of 24 hours, culminating in a whopping 20% size increase with no discernible cause – and that’s pretty significant for 4 GB of data.  We are still at a loss to explain what happened during this period.   If you plan on using redis in production, plan on having monit, god or something similar in place to keep an eye on it, just in case it decides it wants to be sneaky while you aren’t paying attention.

It’s also a smart idea to make frequent use of the redis-cli tool that ships with redis, to view the output of:

redis-cli info

This command will provide you with information about the ACTUAL amount of information being consumed by your data, along with the total amount of data redis-server believes it’s using, and an attendant fragmentation ratio.  It will look something like this:

used_memory:41825152
used_memory_human:39.89M
used_memory_rss:68186112
mem_fragmentation_ratio:1.63

The mem_fragmentation_ratio gives you an idea of how wasteful redis is being.  In many cases, despite how ugly it may sound, a simple restart will free up most of the memory that redis no longer needs, but hasn’t had a chance to deallocate yet.

Another optimization-related note to bear in mine is that there are specific settings in the redis.conf file which you can use to tell redis how big you believe your hashes, lists or sets will be.  Redis will theoretically use these values to further optimize the storage of your data, saving even more space.  We haven’t found them to provide that much utility in our preliminary tests, but that doesn’t mean they offer none, and the  documentation suggests that this configuration can actually be quite effective in reducing redis’s memory footprint.

Beyond that, there isn’t much else you can do once your data starts to become unmanageable, aside from dramatically rethinking the way in which you cache. Originally, redis offered a solution for datasets that were simply ALWAYS going to be too large for memory –  redis virtual memory – which would write infrequently used values to disk, and only store the most important, frequently accessed records in memory.  This turned out to be a bit of a flop, in that for our dataset, it took up to 30 minutes for redis to start up with the virtual memory enabled.  The creator of redis, antirez, is attempting to roll out a superior replacement to redis virtual memory, the redis diskstore in version 2.4, which should be in beta sometime later this year .

Until then, we are stuck with the somewhat scary proposition that redis will continue to outgrow our hardware (as it has in the past), and in that case, our options are either to optimize even further, buy more hardware, or drop it altogether.  Our best advice is to be as smart as possible from the beginning about how you use redis, and never make the assumption that because redis is so fast and lightweight for smallish datasets (100,000-1 million records), that it will continue to be for 10 million or more records.  Antirez himself states that redis was written in such a way that it is left to the developer to decide how he/she wants data to be stored:

But the Redis Way is that the user must understand how things work so that he is able to pick the best compromise, and to understand how the system will behave exactly.

- antirez

At a certain point it will start to become prohibitively annoying to make sweeping changes to your app while simultaneously modifying all of your historical data to comply with those changes -so perhaps the most important thing to remember is that you should never expect redis to magically solve your caching problems for you.