Data Migrations for NoSQL with Curator

The NoSQL movement has brought us a wave of new data stores beyond the traditional relational databases. These data stores come with their own tradeoffs, but they provide some incredible benefits. At Braintree, we are moving in the direction of using Riak as our next generation data store. We love its focus on scalability and availability. Servers can fail without causing any downtime, and we can add more capacity by simply adding more servers to the cluster.

One great feature of relational databases, however, is the consistency in the shape of the data. You know if you have a people table, every row has the same columns. Some fields might be null, but there won't be any surprises. Furthermore, if you want to rename or modify a column, it's a simple operation. In the case of PostgreSQL and other databases, a rename is nearly instantaneous. We lose this ability with Riak and most NoSQL databases. We can easily add attributes (columns), but we cannot easily rename them or change the data within each document (row).

Since our apps are always evolving at Braintree, we needed a way for our data to keep up with our code. Our solution is something we're calling lazy data migrations, and we've built it into our repository and model framework, curator. You can read more about curator on our blog at Untangle Domain and Persistence Logic with Curator.

The problem

Say we have a collection of people in Riak. This is analogous to a people table in a relational database. When we first built the app, we added fields for first_name and last_name:

person = Person.new(:first_name => "Joe", :last_name => "Smith")

Some time has passed, our app has data, and we now realize that names are a pain. What do we do with middle names? What about people with multiple first or last names? We want to just simplify the system and collect only a name. We no longer care about a separate first and last name. The problem is we have a ton of data in the old format. How do we handle that old records have a first_name and last_name, but going forward, we want just name?

In a relational database, we would simply write a database migration that looks like:

ALTER TABLE people ADD COLUMN name VARCHAR;
UPDATE people SET name = first_name || ' ' || last_name;
ALTER TABLE people DROP COLUMN first_name, DROP COLUMN last_name;

This migration might take a while to run, but once it's done, we know that all data has been migrated. We can then change all of our code to only deal with name, knowing we no longer have first_name or last_name.

In a NoSQL database like Riak, we cannot simply change the schema. We have to come up with a different solution. Here are the steps we went through in trying to come up with the solution that made its way into curator:

Solution attempt 1: Scattered conditionals

The first solution is to make the Person class smart enough to handle both cases.

class Person
  attr_accessor :first_name, :last_name, :name
end

We can populate whatever fields we get back from the data store. Then, when we want to do something with the name, we have to use code like:

if person.name
  puts "Name is #{person.name}"
else
  puts "Name is #{person.first_name} #{person.last_name}"
end

The problem with this approach is that we have to use branching code like this whenever we want to use the name. It quickly gets messy.

Solution attempt 2: Gathered conditionals

The second solution is to move this logic to the place where we read the Person out of the data store:

attributes = fetch_from_riak
if attributes[:name]
  person = Person.new(:name => attributes[:name])
else
  person = Person.new(:name => "#{attributes[:first_name]} #{attributes[:last_name]}")
end

Now, we only have to do it once and we can change our Person class to only know about name.

This solution works well, but what happens a year down the road when we've made lots of data changes to many different models? We don't want a bunch of conditionals all over our persistence code.

Our solution: Lazy data migrations

We pulled the idea from solution 2 into the idea of a migration (similar to ActiveRecord migrations). Migrations target a given collection at a given version. They look like this:

class ConsolidateName < Curator::Migration
  def migrate(attributes)
    first_name = attributes.delete(:first_name)
    last_name = attributes.delete(:last_name)
    attributes.merge(:name => "#{first_name} #{last_name}")
  end
end

This migration is stored in db/migrate/people/0001_consolidate_name.rb. We've also added the concept of a version to each Model. By default, models start at version 0. When they are read from the Repository, the attributes are run through any migrations that are a greater version (based on the version in the filename):

person = PersonRepository.find_by_key("person_id")
person.version #=> 1  

Now, the migration logic is isolated from the rest of the application. The rest of the app can safely assume that all Person objects have only a name:

class Person
  current_version 1
  attr_accessor :name
end

We mark the Person class with current_version 1 to signify that new instances start at version 1, since they have a name attribute rather than first_name/last_name.

These migrations run when models are read, so they are lazy. Data will migrate as it's used, and update when saved. This means that, unlike with relational databases, the website can be up and serving requests while the data is migrated.

If you want to force the data to migrate (and not wait for all data to be used), you can simply find models who haven't been migrated and save them. The version attribute is indexed by default:

PersonRepository.find_by_version(0).each do |person|
  PersonRepository.save(person)
end

Testing

Unlike ActiveRecord migrations, curator migrations have no side effects. They simply accept a hash and return a new hash. This makes them easy to call from a unit test:

require 'spec_helper'
require 'db/migrate/people/0001_consolidate_name'

describe ConsolidateName do
  describe "migrate" do
    it "concatenates first_name and last_name" do
      attributes = {:first_name => "Joe", :last_name => "Smith"}
      ConsolidateName.new(1).migrate(attributes)[:name].should == "Joe Smith"
    end
  end
end

Limitations

Curator migrations are lazy, so at any given time you might have documents with different versions in the data store. This is not normally a problem since the migrations will run as soon as the objects are read. However, if you add a migration that changes an indexed field, you cannot rely on that index to return all of the correct values until you migrate them all. In this case, you might want to force migration by reading and saving all of the documents.

Next Steps

You can see these migrations in action in the curatorrailsexample.

Let us know what you think about lazy data migrations in curator. Feel free to open issues on GitHub, submit pull requests, and help us make it better.

***
Paul Gross Paul Gross is a Lead Developer at Braintree. He previously worked at ThoughtWorks, a global IT consultancy, building custom software in diverse languages, including Java, .NET, Python, and Ruby. More posts by this author

You Might Also Like