How We Built the Software that Processes Billions in Payments

With our announcement of the $34 million Accel investment, I wanted to share the things that got us here.

It all starts with people: the most important ingredient in building great software.

Next are the practices: testing, pair programming, collaboration, and agile development.

And with the software itself, there's a huge premium on on features like High Availability and Quality of Service in the payments space. We've spent a lot of time coming up with some amazing solutions in these areas.

People

Building a greenfield payment gateway is challenging. Conventional advice when building software is to release early with a minimum viable product to start gathering feedback and interacting with users. But you can't exactly slap a beta label on a payment gateway and then ask for forgiveness if bugs impact your customers' ability to collect payments and run their business. And of course, security and high availability are of paramount importance.

We built most of the first version with only two pairs (four developers total, pair programming). We were fortunate to have an incredibly talented team. This quote from Paul Graham comes to mind:

Steve Jobs once said that the success or failure of a startup depends on the first ten employees. I agree. If anything, it's more like the first five. Being small is not, in itself, what makes startups kick butt, but rather that small groups can be select. You don't want small in the sense of a village, but small in the sense of an all-star team.

We have a small team of all-stars here at Braintree. It's amazing what we've been able to accomplish with so few people. In conversations I've had with others about us, people are always shocked to find out that we only have 7 developers, especially as the same team is also responsible for production operations.

Practices

There are a few key principles that we apply to all of the problems we tackle; these are the things we regard as the most important ingredients for building good software.

Testing: testing is at the forefront of our development philosophy. We never need to check our code coverage to know that it's at 100%: with disciplined TDD, no line of code will be written without a test. We don't have a QA team. That might be terrifying when you consider the type of software that we're building, but we're confident that our automated testing is thorough and will catch any regression bugs. We use continuous integration to test every version of every client library against our gateway.

Pairing and collaboration: we pair program to write all of our software. We work on Mac Pros with two keyboards and two monitors. We work in an open team room; no cubicles or private offices. Communication is key to our process, and we don't want to hinder it with walls.

Agile: Agile development methodologies mean different things to different people. For us, the most important part of Agile is doing what works best for the team. We have a story card wall and release a few times a week. We keep the team in sync with daily standups and have a retrospective once a month to discuss things that are going well and opportunities for improvements. We're pragmatic, not dogmatic. Although we have strong opinions, we're never afraid to try to new things to see if they work and reconsider our positions if the situation warrants it.

Polyglot: Although most of our software is written in Ruby, we don't confine ourselves to a single programming language. We believe in using the best tool for the job while maintaining a slight bias toward the tools the team knows the best. We've written infrastructure components in Python, and we build client libraries for integrating with Braintree in Ruby, Python, Node.js, PHP, Java, and .NET.

Software

Our software is like an iceberg: most of the interesting work that we do is beneath the surface, and only a fraction is exposed.

High Availability: Customers never stop buying things online, so as a payment gateway, we can't go down for maintenance. This makes a whole host of what would normally be routine tasks much more complex. Today I'm proud to say that after putting significant thought and effort into this problem, we are able to perform all our maintenance without downtime. We can deploy new versions of our software, make database schema changes, or even rotate our primary database server, all without failing to respond to a single request. We can accomplish this because we gave ourselves the ability suspend our traffic, which gives us a window of a few seconds to make some changes before letting the requests through. To make this happen, we built a custom HTTP server and application dispatching infrastructure around Python's Tornado and Redis. It was a great project to work on, and it's been a big win for us.

Quality of Service: Like all web-based services, we need to ensure that a single customer cannot over-utilize resources and affect performance for everyone else. It's common with rate limiting to simply block requests once a certain threshold is exceeded. Of course, none of our customers would be too happy if they lost a legitimate payment because one of their devs kicked off a barrage of API search calls that exceeded a hard quota. So we needed a more sophisticated algorithm. With some work on our custom server, HTTP requests are now no longer FIFO for us. We schedule the requests to handle rate limiting and fair share queueing. If one customer slams us, others aren't affected, and that customer is only slowed down, not blocked.

What We're Working on Now

DIY Availability Zones: High availability also means preparing for worst-case scenarios. Our current DR plan requires DNS failover to the IP address at our alternate data center. We're working on connecting our DR data center to our production environment with fiber, and multihoming them so that both are live and accepting traffic using BGP routing. It will be like having multiple EC2 availability zones, but with bare metal. When this goes live, we'll be able to unbalance an entire data center as easily as we can unbalance an app server.

Scaling: many web applications have significantly more read requests than write requests. But payment gateways are unique in that the majority of our traffic is writes -- we get more POST requests than GET requests. This means that much of the standard scaling wisdom (caching, read-slaves, etc.) can't be applied out-of-the-box. We'd like to get our architecture as close to the dyno-slider as possible, where the only thing we have to do is buy more hardware as our load increases. There are plenty of interesting pivots to be made in our software to get us there.

Product features: our users are developers, so many of the features that we work on are very technical. On our short-term road map we have: push notifications / web hooks, email customization using liquid templates, asynchronous data exports, an AJAX/JSONP API, and more features that will improve the life of any developer integrating payments into his or her application. In some ways, we are our own target audience, so it's exciting to be able to plan and build features that we know we'd love to have.

The Perks

We spend 10% of our time on open dev days, where everybody is free to work on whatever project they're interested in, regardless of whether it's relevant to Braintree. Last week, for instance, we used Node and ZeroMQ to build a mesh chat protocol in a bring-your-own-implementation mini contest. We think the best people should have the best compensation, which is why we offer generous salaries and 401k matching.

We're looking for people who are interested in building amazing software to transform the payments industry. If you're interested, read more about careers with Braintree and Venmo.

***
Dan Manges Dan Manges was formerly the CTO and Software Development Lead at Braintree. More posts by this author

You Might Also Like