Upgrading ReSci from Rails 3.2 to 5.1

We did it! After 6+ months of diligent planning and focused execution, we’re celebrating a successful upgrade of our main application from Rails 3.2 to 5.1! In this article we discuss the motivations for doing this upgrade, how we methodically did it, and the learnings we gathered from this process.

 

Background:

At ReSci, many of us have been big Ruby and Rails users for 10+ years (I reminisce about reading the iconic skateboard book and installing Rails v1.2.6). We use Ruby on Rails extensively across many of our micro-services. Since we launched ReSci in 2012, we have been heads down developing new features. Regretfully, we let one of our main applications linger too long on Rails v3.2. As this application and its test suite grew in size and age, our team faced increasing challenges.

 

Motivation for Upgrading Ruby and Rails:

  • Streamline our dev environment
    • We used Zeus pre-loader (it was great 5 years ago!) which would randomly fail.
    • Specs were slowww as molasses
    • Libraries were incompatible
    • Much time was wasted getting our environment to work normally
  • Faster performance (new Ruby / Rails)
  • Speed up our deploys (benchmarked our CI pipeline / specs from 5 hrs to 3 hrs)
  • We faced EOL on versions and had to fork certain Gems to upgrade
  • Security (Granted we took precautions to back-port security patches)
  • Engineer Happiness – fighting with the old environment was not FUN!

 

Much of our affinity towards Ruby and Rails lies in its elegance and efficient workflow. It’s fun to work with! Instead of an easy, streamlined development process, we found ourselves wasting excessive amounts of time wrangling with our environment, waiting for specs, and digging into random failures. This was precious time that took away from developing features. “UPGRADE” was on everyone’s mind for a long time, but always got de-prioritized. It was daunting! Finally, we did the cost-benefit analysis, and we made it a priority.

 

Development Approach:

The Rails upgrade project actually began as an internal “side-project” — an effort by the team to upgrade slowly in the background. A couple engineers did POC’s and kicked off an “rails5-upgrade” branch. Whenever the team had spare cycles, we would commit to this branch. Our goal was simply “get all specs to pass!” As momentum grew, we held a few “upgrade-a-thons” where a bunch of the team would get together to make progress on the upgrade (and play games!). Finally, we gained enough traction to make this an official priority and make the final push to production.

 

Our high-level approach was to create an upgrade branch that we would slowly get to 100% working, while daily development continued on the master branch. The obvious downside to this approach was the need to keep the upgrade branch up to date periodically with master. In retrospect, there may have been better approaches to consider (see Github blog – they did a similar upgrade, but approached it differently, and launched just around the same time).

 

One important decision we made early on was to have zero database migrations or data migrations in our upgrade branch. This was to ensure that we could run both systems concurrently, and we could deploy the Rails 5 environment to a subset of our servers. In addition, this made any potential roll-backs trivial.

 

In order to tackle such a big project, we carefully split our project into a few big milestones:

 

  • Milestone 1:
    • Get all specs to pass (unit tests, integration tests, end-to-end tests).
      With nearly 10,000 specs in our suite, we knew that getting all specs to pass would be a good start. It would require a lot of effort, but give us a high level of confidence.
    • Get the front-end application to load
      This proved to be non-trivial, as some front-end libraries needed to be refactored, removed, etc.
    • Rip out old tech-debt
      We were very happy that we could do this as an added bonus. We actually removed ~70k lines of code!
    • Document lists of follow-ups
      Deprecations, non-backwards compatible changes, etc. Keeping this organized was super-important to our execution.
    • Maintain weekly merges of master
  • Milestone 2:
    • Compile AND complete a large series of smoke tests
      We went through our entire application, and created a very thorough set of smoke tests that we wanted to see passing. We split these amongst the team, so that we could get through this big list.
    • Document any bugs (for follow-up)
      Any bugs, defects were logged separately to attend to in a separate step.
    • Maintain weekly merges of master
  • Milestone 3:
    • Complete all items on the bug list
    • Full QA cycle of application
    • Maintain weekly merges of master

 

Deployment:

When it came time for deployment, our approach was to split deployment into a few steps (low-risk to higher-risk). We did a 1 week code-freeze, while we thoroughly tested our entire application in lower environments. In production, the goal was to roll-out Rails 5 servers in conjunction with our Rails 3.2 servers, with limited scope on what background jobs and front-end requests they would process.

 

Deployment Week Schedule:

  • Monday:
    • Background servers: Launch one Rails 5 background machine watching limited (low-risk) background Resque queues. Any jobs that encountered errors could be easily retried on the version 3.2 machines.
    • Web servers: Launch a single Rails 5 web machine on our load balancer accessible internally. Begin testing routes internally.
  • Tuesday-Wednesday:
    • Background servers: Gradually increase the number of queues that the Rails 5 servers were watching, until 100% of queues were covered. Any issues that came up could be addressed immediately.
  • Thursday
    • Web servers: Launch a single Rails 5 machine to production during off-hours and monitor traffic.
    • Once everything was verified and looked good (background + web), we did the full switch-over, and continued monitoring.

 

Our Learnings:

  • Make it a priority: In order to accomplish this upgrade successfully, it was important that everybody (product managers, engineers, devops, QA, other teams, etc) was all on the same page. List out the advantages of the upgrade, discuss, convince, and don’t discount engineer happiness!
  • Don’t rush: Take a methodical approach. Write out your plan and any timelines so it can be visualized by everybody! It takes time, but the time you invest in the process and preparation will pay off in making it a smooth transition. It will also help keep everyone accountable.
  • Deploy in chunks that you can easily roll-back. No matter how much you test in lower environments, production is always a different beast. If you deploy small chunks (background v.s web servers, limited queues, limited traffic) you minimize the risk and can easily roll back anything that does not work as expected.
  • Don’t let your application get too large: This is hard to do, especially once it’s already “too large.” Explore micro-services when it makes sense. The smaller the service, the more realistic any upgrade will be.
  • Keep it contained: You’ll be tempted to refactor, address tech-debt, re-write. Some of it will make sense, but you need to have a realistic, balanced view on this. If your goal is to do a big Ruby / Rails upgrade, you don’t want to worry about changing functionality un-necessarily. The more refactoring you do, the harder it will be to review and merge quickly and frequently.
  • Good test coverage is key. The more confident you are in your test suite, the easier any upgrades will be.
  • Upgrade regularly to avoid a huge upgrade!

 

Looking Forward:

Over the last 6 years at ReSci, we have learned many invaluable lessons in scaling large-scale Rails applications. We have learned the importance of quality code, proper design, refactoring, maintaining good specs, and how to make resilient, self-healing services. One of my favorite quotes is from Tim Jenkins of Sendgrid. In this blog about scaling at SendGrid, he said, “once you have millions of messages going through the system every minute, the ‘one in a million’ error cases start happening very regularly.” We see this first-hand, every day, as well at ReSci.

 

We are committed to continued investment in our application, and paying off tech-debt regularly. At the same time, we are committed to continuing to extract out components into individual, scalable services of their own.

 

Right decision? YES! We’re absolutely happy in our decision to make this upgrade happen. I’m proud to be working alongside a team that was able to execute on this vision and moreover, a team that shares such a strong dedication to long-term quality, reflection, and continued improvement.