Maintenance update: Mastodon 4.1 and new region


On February 16th, we updated the instance to run Mastodon 4.1 and moved to the AWS datacenter located in California. The maintenance window took longer than anticipated due to problems executing the latest CloudFormation templates.

Timeline

All events are documented in Pacific time

  • Feb 16 21:15: Shut down services in the cluster and started a snapshot of the database.
  • Feb 16 22:00 (approx): Failures occurring during the execution of the new CloudFormation template.
  • Feb 16 23:43: The previous deployment appears to be unrepairable. Migrating server data to a new deployment.
  • Feb 17 03:30: Currently importing the data from the previous site into the new database. Next step is to run the web server and see if the auto schema update works, after which I can start moving the media to the new S3 bucket.
  • Feb 17 04:00: Sidekiq, Web, and Streaming services appear to be running with the migrated database. DNS is not working.
  • Feb 17 06:30: The DNS problem resolved itself while I napped. Testing now. Login and browsing works, account info and text content -- anything in the database -- is available, and new posts are arriving as expected.  Images are broken, however.  I'm continuing to copy the media files from the old S3 bucket to the new one and expect this to fix the this problem.
  • Feb 17 21:55: Media copy job is still in progress.  I've fixed the problem with broken avatar images.
  • Feb 18 08:00: Media copy job complete.

Narrative

"So Paul, how did the Mastodon server upgrade go last night?"

🙄

I've done a few major service upgrades in my time and have learned a few things. First, always write out the steps you're going to execute during the outage. It's too easy to miss a something, even if you know the process inside and out. Chances are that you're going to be making changes at a time when you would normally be asleep so you are not going to be sharp. The most important preparation, though, is to get the system into a state where no changes can be made, then BACK UP EVERYTHING. Losing data means losing the confidence your customers and members have in you.  It was a good thing I followed both of those tenets on Thursday night.

At about 9:15 pm, I decided to get started.  I shut down the web, Sidekiq, and streaming services and watched the database until there were no more connections.  Once I was sure that nobody was reading or writing data, I took a database snapshot and temporarily shut down the database just to be absolutely sure no data would be changed while I modified other services.  With the data taken care of, I uploaded the latest CloudFormation template and started the update process.  This template contains the service configuration for everything needed to run Mastodon on AWS in a scalable, secure way. It is broken down into sub-templates for specific parts of the system, like the web service, load balancer, database, and so on. The first challenge I encountered was that the sub-template for updating the networking configuration did not complete successfully. It was attempting to change a subnet in the virtual private cloud (VPC) used by several systems to communicate with each other and the Internet. I believed that CloudFormation could re-create resources if they didn't exist, so I removed the subnets and reran the template. That appeared to unblock the subnet update, but the next sub-template failed on a different part of the VPC configuration. I continued to remove configurations to unblock the update but it seemed like every time I did this, something else would fail.

By this time, it was 11:00 pm, and I was ready to call off the upgrade and regroup.  I returned to the screen to restart the database and saw an error message that hadn't been there before: "Invalid networking configuration". A quick web search revealed that the database is permanently associated with a VPC, and by deleting the VPC, I had put the database in a state where it could no longer be started up again. No worries, because I had the snapshot, right? Yes, but about half of the services that make up the site were still up and running.  If I deleted the CloudFormation template to clean up those services, it would probably remove the S3 bucket where 1.6 million member images and media files live.  Backing up those files to another bucket was the way to go.  I created a new bucket and started the copy process.  It seemed like the files were moving at a glacial pace.  The copy was going to take many hours, if not days.  I posted an update to the main page of hub.montereybay.social to let folks know what was going on, then sat back to think.

Here's where we take a slight digression.  AWS has datacenters all around the world.  One of the basic best practices when deploying applications on AWS is to use the datacenter closest to where your customers are in order to reduce network latency.  When I built out the instance in November of 2022, I wanted to install it in the us-west-1 data center in Northern California.  I ran into a problem with the load balancer configuration in that datacenter that I couldn't resolve without some deeper investigation and work. I shrugged my shoulders, switched to the Ohio datacenter, and installed what would become MontereyBay.social.  I always wanted to get the instance moved back to California, though.  Latency wasn't really going to be a problem, but there were a couple of other arguments for moving (and separating) the services that make up that instance.  First, I have my personal blog and a couple of web sites in the same AWS account. That makes it hard to separate the costs of the Mastodon server from the costs of those other services.  Second, we do need another technical admin for the site, and it will be a lot of work to set up permissions so that person can just work on the Mastodon-related things in the account. It's possible, but not easy or scalable. What I needed was an AWS Organization.  Earlier that week, basically on a whim, I had set up an Organization, created an account, and transferred the DNS and certificate management to that account and moved it to the us-west-1 region.

Back to our story. It was past 1 am, I'd been working on the upgrade for hours, and didn't have a path to upgrade or restore the existing deployment. What were my assets? A "clean" account located in us-west-1, a backup of the instance database, and my determination to be up and running by morning.  I decided that the time to move to California had come.  I started by running the latest CloudFormation templates in us-west-1 under the new account. Success! The load balancer problem didn't recur and the templates finished without any other issues.  The next step was to figure out how to get the database snapshot transferred to the new account in the us-west-1 datacenter.   AWS provides tools for sharing snapshots across accounts, though I found the actual process to be a bit convoluted. I had to create a new cryptographic key, share it across accounts, copy the encrypted snapshot with the new key, share the copy, copy the shared snapshot to the new account, and restore it to a brand new database instance.

3:30 am. My next step was to figure out how to move the member data from the restored database to the new database.  I strongly suspected that the database created with the Mastodon 4.1 scripts had a different schema than the restored database, which was from a Mastodon 4.0.1 installation.  The safest way to proceed would be to delete everything in the new database and do a clean copy from the restored database, then let the web application run its automated schema update. Enter my long-time friends, pg_dump and pg_load. I stopped the web, Sidekiq, and streaming services – remember the tenet about not running updates while the data can be modified? – then created a new EC2 instance in the VPC used by the databases and installed Postgres.  Next, I dumped the restored database to a file, connected to the 4.1 database, dropped it, and re-created it from the file.  Using these old tools was actually comforting.  I restarted the web service and was relieved to see the log message indicating that the schema had been updated.  Sidekiq and streaming started up with no issues as well.  I opened up a browser, typed in https://montereybay.social/, and saw... "Safari cannot find server." Wha? I'd transferred the hosted zone and domain registration to us-west-1 days ago, and it had been working fine right before I started the update.  I poked around us-east-2 and us-west-1 to find a duplicate hosted zone. Shaking my fist at CloudFormation (which it didn't really deserve), I deleted the duplicate zone, copied the DNS servers from the remaining One True Zone over to the domain registration, checked the SSL certificate to make sure it was still valid, and tried again.  Same error.  Ok, I thought, DNS takes some time to update.  Thirty minutes later and I was still getting the same host not found error.  At that point, it was 4 am, and I was out of ideas.  I updated hub.montereybay.social again with the progress, not realizing that the same DNS problems that were preventing me from loading the site were also affecting hub.

I woke up about 90 minutes later.  When I refreshed my web browser I was relieved to see the login page for MontereyBay.social.  I logged in and saw that my account was there, as were all of my posts, followers, and all the rest.  Images were broken, but I expected that.  I started the process of copying media from the old S3 bucket in us-east-2 to the new bucket in us-west-1.  I was also happy to see new posts with images coming in to both the Local and Federated feed.  We were back, in a sort-of-functional way.

In the new version of the templates, the maintainer had added a CloudFront cache to reduce the time to serve static images. I looked up the URL for my own broken avatar image and was able to find the file on the new S3 bucket. This made me nervous. If the file was there, why wasn't it being served by CloudFront? The first thing to check was the permissions on the S3 bucket. The bucket policy clearly had CloudFront listed. I removed and recreated the policy just in case there was a typo, but no luck. While I was redoing the permissions, I noticed that CloudFront was configured to use an older method of authenticating with S3. Switching to the new auth method fixed the problem.

My "one hour upgrade" turned into a 12-hour, all-nighter maintenance and migration task. I learned a lot about several AWS services, though. This experience should help when I get around to taking my Certified Solutions Architect exam. I'm also happy that the site is now hosted as close as possible to our members, and that we now have the latest features (editing alt text after posting FTW!). I'm even kind of relieved to see that I can still make it through a tough outage at my age. Look out, kids – Grampa has still got the moves! 

Things that went well

  • Setting up the Organization and new account in us-west-1 ahead of time reduced the delay in getting the site back up and running.

Things that could be improved

  • Deleting resources to unblock the CF template run was the wrong approach. The template should be the source of truth for the full configuration of the instance, and it should be runnable at any time.
  • The web interface for copying between S3 buckets tended to freeze up if my laptop went to sleep.

Resulting Actions