Jul 25, 2024

Dumping and Restoring a Postgres DB

As part of the platform migration project we wanted to stand up a POC in one of the new PaaS providers. This effort involved standing up a new RDS database which the POC application would connect to. To boost our confidence in the POC, we also wanted to run regressions against the POC to determine if there were any problems with it, most likely around the application’s configuration or infrastructure.

In an effort to simplify regression testing (and speed things along), we decided to keep the POC pointed to our old Heroku Postgres instance so that we wouldn’t have to worry about getting all the correct data in the new RDS instance.

Unfortunately, testing with the Heroku Postgres instance (and Heroku Redis instance, for that matter) resulted in some major latency. We were seeing RTTs in Redis of over 6 seconds and time spent talking to the database for an average request to the application could be seen in the ballpark of 10 seconds on some requests. What this meant for regression testing was that some of the requests would time out. It was so bad that we could not reliably do any testing.

Why was the latency so high? The POC was making a request to the Heroku Postgres and Redis instances in different AWS Regions. While that explains some of the latency, we’re still not sure why the latency was so bad.

Next step, we decided to point the POC to the RDS and ElastiCache instances we had configured which were in the same AWS Region as the application. Unfortunately we didn’t have enough data configured in the database to continue testing that day.

We thought an acceptable path forward would be to just dump the staging database from Heroku and restore the data in the RDS instance so that is what I did today.

Dumping the database seemed straightforward enough: the Heroku Postgres instance is publicly available so I should be able to just use the pg_dump tool to get what we need.

Here is the command I used:

pg_dump --no-owner -h <host_ip> -U <username> --data-only <db_name> > total.dump

So I ran the command. WCGW?

About 10 seconds in I realized I never checked how big our staging database is. If the database is sufficiently large, pg_dump was not going to cut it and could negatively impact performance. Also, this is making a network request just to get the data to my local machine. If the database is sufficiently large, this could take a long time.

Luckily, our staging database is only taking up about 1.5GB. Therefore, nothing bad happened in staging 😉 But it did take a few minutes to get all the data. I periodically checked the file size to ensure the data was being retrieved and written to the file.

ls -la total.dump

While this was running, I could see some output from pg_dump warning me about circular references with some records. I’ve seen this before, so I wasn’t concerned.

Now that I had the data, I needed to push it up to the RDS instance. This, too, would be simple, I thought. We can use psql to ingest the dump file like so:

psql -h <host_ip> -p <port> -U <username> -d <db_name> < total.dump

Before I did that though, I wanted to make sure the new database was empty. The obvious thing to reach for was:

bin/rails db:drop db:create db:schema:load

But the application has active connections to the database and I didn’t want to compete with it. So I ended up using Postico to just truncate all the tables.

Then all I had to do was run the psql command! WCGW?

As psql tried to ingest the data, I started seeing some errors crop up around Foreign Key constraints being violated. I wasn’t sure what this going to mean until after psql had finished though. Upon finishing, I discovered that under the hood psql < filename is using COPY for all the data in the dump file. If there is any error whatsoever, it skips that entire table’s worth of data. So every table in the schema that had a foreign key constraint that was violated during loading the data was empty. Unfortunately for me, this was most of the tables we needed to get loaded with data.

Another thing was that this command took quite a while to run. Was there any way I could speed it up?

My first inclination was to downsize the amount of data being dumped and restored. Do we really need 1.5GB worth of data? I ran some queries in our Heroku Postgres staging database and realized most of the data is sitting in our _versions tables which we don’t need for this. I know we can filter tables to dump using -t so I looked at the psql help and found -T does just the opposte: it excludes the table from the dump. Perfect.

This time, the new dump file was just 500MB instead of 1.5GB!

Now, what to do about the FK constraints?

It turns out Postgres enforces FK constraints using triggers. You can disable all triggers on a table like so:

ALTER TABLE table_name DISABLE TRIGGER all;

Woo! Let’s try it:

ERROR:  permission denied: "RI_ConstraintTrigger_a_74070" is a system trigger

🙅 You can only disable triggers if you are a superuser.

psql> \du
---- results showed I was, in fact, not a superuser

Hmm, unfortunately the user we use is not a superuser. How do I make my user a superuser then? 🤔

psql> ALTER USER username WITH SUPERUSER;

🙅Turns out, AWS does not let anyone augment their permissions to superuser levels since they manage the instances.

Out of desperation, I figured I could just manually drop and re-add all the foreign key constraints. That sounds like a pain, but it was a path forward:

SELECT conrelid::regclass AS table_name, 
       conname AS foreign_key, 
       pg_get_constraintdef(oid) 
FROM   pg_constraint 
WHERE  contype = 'f' 
AND    connamespace = 'public'::regnamespace   
ORDER  BY conrelid::regclass::text, contype DESC;

The above query will list all foreign key constraints on the public schema which is just what I needed. There were over 100 foreign key constraints 😫

So I dumped these into a spreadsheet and added some columns so that each row became a valid SQL statement to drop the FK constraint 😬

Then I dumped the contents into Postico and viola! All the FK constraints were dropped!

Now I could finally try having psql ingest the new and improved slim dump file and it worked!

Mostly…

Some tables still didn’t make it 🤔

Looking at the output, I saw an error around a column not existing. D’oh! The git branch we’re using for the POC became stale and didn’t have the latest schema migrations. A new column was added to a table so psql was complaining about that.

I augmented the dump file to delete that column and the corresponding data and tried again.

This was safe because every table is expected to have a Primary Key constraint. So psql would skip over every table that had data because the first row it tried to ingest would blow up due to violating the PK constraint.

Now I was done and could call it a day!

If you thought “wait a minute…”, then congratulations for making it this far and paying attention. I didn’t re-add the foreign key constraints!!

This was simple though, I just flipped what I did in the spreadsheet and made a list of all the FK constraints to add. One very nice thing about the above query is the call to pg_get_constraintdef which is a Postgres function that returns the definition of the constraint. This allowed me to add the FK constraints back with ease!

I added them back and we were able to run regressions with no problems!