Our upgrade path was split into five stages:
- Cruft removal / Dependency updates
- Installing rails_xss plugin
- Updating Rails
- Opt-in releases
- Full release
Portrait of the Systems Team while upgrading Rails
It's worth noting that while this was our expected path all along, we didn't initially expect to upgrade Ruby as part of this project. (More on that later.)
Before any new work was to be done, we wanted to clear out the cobwebs to make upgrading easier. This meant deleting unused code, refactoring our test suite to be more uniform, creating new abstractions to consolidate duplicated code (particularly around ActionMailer and TimeZone) and upgrading dependencies across the board.
By attacking components piece by piece and doing many small pull requests and deploys, we made steady progress. Other than generally cleaning cruft, we focused on a couple specific goals: transition our plugins to gems, and upgrade the gems that would prevent us from upgrading Rails.
Depending on the size of your application, this kind of prerequisite may not be as important, but for an application as old as Chorus, it really helped us cut off the fat. Every crufty component removed made it easier in the future when we were fixing test failures or doing mass replacements on Ruby/Rails APIs.
Project flamethrower: The eternal quest to destroy cruft in Chorus
Most importantly, though - there was only one developer (hat tip Skip Baney!) working on this phase, but his dedication to the cause helped get our entire team in the mindset of removing/refactoring cruft wherever possible. Developers on totally different projects started in on the fun, and the product team as a whole has learned the great joy of deleting code.
Rails 3.x switches its ERB template engine to erubis, which escapes HTML by default - a change in behavior from Rails 2. This new default helps protect from XSS attacks, but it also means we had to check every page type in Chorus to make sure things rendered properly. We did automate some 'double escaping' detection, but the majority of this work was simply creating a spreadsheet of all the page types in Chorus and scheduling a day for a group of developers to swarm on checking and fixing.
This work was then put into master and deployed. We had to clean up various bugs in nooks and crannies as they were reported, but being able to isolate this chunk of work from upgrading Rails 3 as a whole was a big, important step in the process.
To do the actual Rails gem upgrade, we blocked off a week for four developers to work together and get as far along as we could. We created a punchlist-style FogBugz task list seperated into chunked up areas, and just did one thing at a time. Some components were upgrading gems, upgrading routes styles, rewriting deprecated ActiveRecord calls, and just going through the rails_upgrade checklist.
The beginning of this process was stressful because we'd be making massive sets of changes without even being able to boot our application, let alone tests. Once we could boot, we started focusing on unit tests, and once passing, functional and integration tests. To be sure - the time invested in our test suite in the first phase greatly paid off here.
At this point, we did our Rails upgrades in steps - going from 2.3 to 3.0 to 3.1 and finally 3.2. We stopped at 3.2 because it was the last version to support Ruby 1.8, which was still the version we were planning on releasing with. For this phase (as well as the next one), all work was done in a branch that we constantly merged master into. Merge conflicts were common but usually easy enough to sort out.
Our collective prior experiences taught us that Rails upgrades require very extensive QA. The last major update took weeks of testing, and that was when Chorus only ran SB Nation. Given the current size and scope of Chorus, it was logistically impossible to do a "well, make sure everything works" attempt, so we schemed for an alternative.
The approach we landed on was largely inspired by a blog post by Envato on upgrading Rails on a large production system. We needed something that would allow us to test our work against real traffic without disturbing the multiple master deploys we do every day to production.
We have the fortune of having a healthy provision of application servers, so what we did was remove one from our main LB pool and deploy our upgrade branch to it, creating a single 'Rails 3 Beta' server. We then created a special landing page that would assign an "opt-in" cookie to all of our relevant domains, and set up a rule on our load balancer to direct all requests containing that cookie to go to the beta server.
Since this beta server was connected to the production database, beta testers could read and create content right along with everyone else. This was essential to getting people to actually use the beta and find bugs. In the past, we've set up totally separate environments to try to get real users to beta test, but because it's not fresh content and they aren't doing actual work there, it's usually quickly forgotten about.
It is worth noting that while the beta environment shared a database server, we did assign a separate memcache key prefix for the beta server due to marshaling incompatibilities with ActiveRecord objects between versions.
With this 'beta opt-in' system in place, the developers working on the upgrade opted in and we started experimenting with temporarily putting the beta server back into the general load balancer pool. While this was a very useful way to shore up exceptions (side note: we use Sentry for application tracking, and highly recommend it!), it was at this point we started to have serious performance concerns.
Chorus's New Relic web transaction graph for Ruby 1.8 and Rails 2.3
Chorus's New Relic web transaction graph for Ruby 1.8 and Rails 3.2
With requests taking about four times as long to process, we quickly ran into Unicorn queue problems and 502 spikes. We went through several rounds of trying to optimize GC settings and experiment with OOB GC, but nothing that made a big enough dent. On a whim, we time-boxed a couple of days to spin up another branch to try out upgrading to Ruby 2.1 to see if that would help the problems. This upgrade was actually significantly easier than the Rails one, since tests caught just about all the problems.
Around this point we started outlining a final deploy plan and started working with our top notch operations team to figure out the best path for upgrading Ruby on our servers. We landed on using slightly customized/recompiled Brightbox Ruby 2.1 packages, which made it easy to upload the binaries before hand and use update-alternatives when we wanted to make the switch.
Once our beta server was running 2.1, we put it back in the general pool to kick the wheels a bit and were much happier with the results:
Chorus's New relic web transaction graph for Ruby 2.1 and Rails 3.2
Now that we were confident in performance and general exception cases, we started phasing more people into the beta. We started with everyone on the product team, and then invited our editorial teams one by one. Getting detailed bug reports from these kinds of Chorus power users was extremely useful - they would not only notice show-stopping errors, but subtle things like widgets that didn't render or queries that were out of order.
By the end of this process, we had about 120 people opted in to the beta and collected/fixed 50 bugs from user feedback. All of this gave us a healthy degree of confidence for the full deploy.
The biggest diff ever deployed to Chorus at once
We planned our full release for April 23rd at 3AM - our lowest traffic point of the day. We did a lot of extra work to make sure we could release without interrupting service, but still wanted to be as safe as possible and deploy off hours. In preparation, the Ruby 2.1 packages we created were uploaded to all our servers, but not installed yet. We took half of our application servers out of the load balancer at a time, updated the ruby packages, deployed the new code, did some smoke checks and then put them back in the pool. Naturally, some new exceptions started cropping up and new bugs were found, but generally, all explosions were averted - the process went exactly as according to plan, and a fresh set of eyes at the start of business day greatly helped in fighting fires.
For all the horror stories about updating a large application, things went pretty well. Having buy-in from editorial teams and management to dedicate multiple months of developer time meant we could tackle this project properly, and to all of us that write code in Chorus, it's been well worth it - boot times are faster, graphs are more stable, and the world is our oyster... until the next major Ruby or Rails update. lol/sob
Any questions about more specific parts of our upgrade process? Let us know in the comments! Or if you are a developer that has no fear refactoring large, constantly evolving code bases - Vox Product is hiring full stack engineers.