Aug 6, 2024

Going Live With The Platform Migration

Cutting over to the new PaaS went about as smoothly as we could have possibly hoped, but that doesn’t mean nothing went wrong. Here are all the things that happened from T-12H to launch:

Before Launch

Getting our data warehouse connected to the new database was failing to connect via SSH because the public key was shared over a certain app and was being truncated. This took a while to figure out.
After that, getting the data warehouse Postgres user access to the right database instance took a little bit longer than anticipated because we had multiple databases in our new RDBMS and I was using the wrong one.
We wanted to limit access to the bastion host (database proxy) we set up for the data warehouse and in doing so accidentally misconfigured an access rule, which caused a hiccup in the warehouse sync.
The production Rails app in the new PaaS was still pointing to our test database instance. 🤦
We forgot to increase the number of web instances to 2.
We forgot to connect the APM (Application Performance Monitoring) integration to Kubernetes. This took longer than anticipated because Kubernetes patches had to be applied.
The link to a release in our error reporting tool, utilized by our integration platform, was using the Git commit SHA instead of the app version number.
The database sync in Bucardo hit a snag on a particular row in one of our tables which violated a constraint. The PaaS support removed this table from the sync, which resolved the issue. We chose to do this because the table is a cache we could populate manually and quickly.

After Launch

We misconfigured the load balancer for a moment, causing some jobs to land in the new PaaS prematurely. We dumped the Redis state into a JSON file to back it up; otherwise, this data could have been wiped when we pushed the Redis data from our previous provider into our new Redis instance.
Using two origin pools with the load balancer did not go as expected because putting the API into maintenance mode on the previous provider meant that origin pool was unhealthy, so our load balancer directed all traffic to the new PaaS. 🥴
The environment variable that set the environment in our error reporting tool was set to staging instead of production (a remnant from testing).
We realized we forgot to test PDF receipt generation, and it turns out this was not working without additional dependencies being added to our Docker image. We decided to resolve this the next morning.
Multiple ActiveRecord Models were not being returned from the API because we did not migrate the Rails cache and it had to be warmed up manually.
We realized we did not codify the changes to the number of instances or RAM for some of the processes.

All of that seems like a lot, but given what could have gone wrong, it didn’t feel like much. Thankfully, we caught most of these things before we went live, and the issues that occurred after launch were relatively minor.

Note: I have purposefully redacted any concrete information about the specific platform and our infrastructure for security purposes.