How and when should I actually migrate my database?
For a long time, across different projects throughout my career, I’ve seen database migrations happen during application startup. These migrations usually run as part of a post-deployment hook, just before the new deployment receives any production traffic. But… what’s happening in this short (or sometimes longer) timeframe after the migration is done and before the new app boots, while the old app is still active?
- What if you’re doing a slow rollout? Won’t the old instances fail because the database changed?
- What if your new code has an issue and you have to do an immediate rollback? Won’t the old version now also fail until the database is rolled back? Wouldn’t that result in big disruption of service?
- What if your migration takes a long time? Will it take a long time until your new code is deployed as well?
I had so many questions, but the root one was: “do I have to worry about this at all?” The answer was yes. Because I saw how issues and incidents happened where those questions suddenly became reality.
And over time, I built up an understanding and realized the answers to those questions—but I also realized how this seemingly dangerous process can actually be pretty chill. So let’s do a quick dive into when (and how) to seamlessly migrate your database.
How a small change turns into a stressful experience
A popular example is how to rename a database column. Let’s take a look into what will happen if you just rename a column in a single migration with a simple ALTER TABLE RENAME COLUMN
SQL statement:
- You change the schema locally, rename the column, and trigger a deployment to production.
- At some point during deployment, the migration runs.
- The migration will lock the whole table—no reads and writes to it anymore for a short amount of time. In the example of a simple rename, this is a quick operation.
- The migration was successful, but the deployment is not fully done, and the old app is still active and receives traffic. The old app will now start to throw errors because it’s trying to access a column that doesn’t anymore exist. Right now, your users will sometimes see errors popping up.
- You sit there, your butt cheeks clenched, and you wait until hopefully all new instances have booted and receive 100% of the production traffic.
- All looks good, the new version is active. You feel relieved, and your butt cheeks go back into a relaxed state.
- 10 minutes later, you’re alerted to an incident: There’s a high error rate. Your change introduced a critical bug. Luckily, you are prepared for this, because you can roll back to the previous version within seconds with 1 simple click.
- But your old code uses the old column name, and the rollback did not revert your migrations. The rolled back version still throws errors. Maybe it refuses to boot at all.
- Shit. What now? Are you prepared to run a script to revert the database migration? Do you push a new commit and wait for multiple minutes until it’s deployed?
Of course, it doesn’t always have to result in an incident. But there’s a short amount of time where users will see errors, and it’s definitely a more risky situation. So what can we do instead?
Be backwards compatible… multiple times
To prevent all those issues, we have to divide our migrations into multiple backwards compatible changes. Let’s continue with the example of renaming a column, and how we would divide this into multiple steps:
- Add a new column to the schema and run the migration. Don’t migrate the data yet; just update the schema. Your code will still reference the old column. This new column will be empty and should have no default value—then you’re not locking the table. Simply don’t touch the old column, and you won’t break anything.
- Migrate the data separately. Do it in batches to avoid database load and locking the whole table. Depending on your setup, this is likely not even a part of your automated deployment pipeline, but instead just a script that you’re executing manually. I’ll talk more about those manual tasks later in this article.
- Update and deploy your code to reference the new column. Depending on what data you’re migrating, it’s likely that you want to read and write from both columns.
- You should now be in a state where both columns are always in sync. Be sure to do a sanity check to ensure that’s the case.
- Now remove any references in your code to the old column and deploy it. Do not yet delete the old column, just don’t access it anymore.
- After the new version is fully rolled out to production, the old column is not used anymore, and you can finally drop the old column from the schema. You can wait a few days before you delete the old column, to increase the time frame of backwards compatible changes.
Because we’re doing a bunch of backwards compatible changes, if we fuck up at any stage, we can try again and no one will notice. And we can also do instant rollbacks to a previous version in an emergency. The longer we take to do the last step, the more room we give ourselves to roll back.
Now that we know how and why we safely migrate our database, when and where should we perform these steps?
Treat it as an async process in multiple one-off tasks
You can do all of these steps in an async manner. And that’s what I think is the beauty of it. Those steps don’t have to happen as part of your automated deployment pipeline, and you don’t even have to stress about getting it done quickly during a single workday. You can take your time!
In this setup, you wouldn’t even need to migrate the schema during a deployment. It can totally happen during normal operations. And if you don’t want to block the deployment pipeline with a long-running migration, and potentially block any urgent fixes, it maybe shouldn’t happen as part of a regular deployment.
But it’s also a good practice to not raw-dog into your production database from your local machine and execute any SQL.
Prepare your steps as separate scripts, and let your colleagues review them. This isn’t about code style or maintainability—in the end, they’re just one-off scripts. But your colleagues may see some typos, or issues where you’ll accidentally cause high database load.
And then have a way to execute these scripts inside your production infrastructure. For example, a Kubernetes Job which can connect to your database, or any equivalent service in your infrastructure that can connect to your database but is not your production service which handles real traffic. Then you don’t have to directly connect to your database from your local machine or from a deployment machine like GitHub Actions.
Compiled apps, CLI dependencies, and migrations
I’m working a lot with dockerized TypeScript apps. The final docker image only contains bundled code, and dependencies that are referenced in the bundled code. But migrations aren’t part of the bundle—because they aren’t part of my running app. And a CLI to run migrations, like drizzle-kit
, also isn’t part of the final bundle!
Knowing that I want to run migrations async in my production environment, I was wondering for some time how the actual fuck I’m supposed to run them within my production docker image—because it doesn’t contain my migrations and the drizzle CLI. I can’t just start that image in a one-off job and execute my tasks. So, should I install drizzle-kit
in my production image? Copy all migrations? But that would increase the size of my docker image, and I want to keep it small.
The solution to this is quite simple: just have a separate image. You could, for example, execute a Dockerfile as a one-off script:
FROM node:22-slim
WORKDIR /app
COPY package.json pnpm-lock.yaml
RUN pnpm install --frozen-lockfile
COPY ./migrations ./migrations
CMD ["npx", "drizzle-kit", "migrate"]
That’s what I’m doing, with a small tweak that I’m inlining this Dockerfile inside a docker-compose
file, as a separate service called “migrate”. I can just start that service on production, it builds itself, runs the migrations, and then exits.
Or you could even not care about migration files at all, use the schema in your code as the source of truth, and use something like drizzle-kit push
:
FROM node:22-slim
WORKDIR /app
COPY package.json pnpm-lock.yaml
RUN pnpm install --frozen-lockfile
COPY . .
CMD ["npx", "drizzle-kit", "push", "--force"]
Of course: the trade-offs!
As with everything, there are trade-offs, and you don’t need to do this strict approach all the time. Some projects simply don’t have those requirements. And in some projects maybe the few seconds of downtime are okay.
Do you need to do all of this in your side project? Probably not.
Would it be okay if your production docker image is a few MB larger and you execute the migrations, including data migrations, during app boot? Could be okay.
But will you still learn something from practicing it? Absolutely.