Version control for source code is used by most development teams today. It gives you a history of all the changes you made, you can use it to share your code with others and you can work on an experimental feature without polluting the stable product. These are just some of the reasons why version control software is considered indispensable today. But where does our database fit in all this? How can we share the changes we want to make to the database with our colleagues, and eventually with our production environment? Database version control is something that most developers have to deal with regularly, yet only a few have actually thought about what solution might be best for them. Most people have a solution that sort of works for them, but when you ask them about the subject they are pretty convinced that there must be some better way to manage database changes, they're just not entirely sure what that solution is – but the silver bullet must be out there somewhere, right?
This article will give you a bit more insight into the theory behind database version control, and gives you several suggestions on how you could deal with the problem. We will look at simple solutions and also take a look at some of the tools available to help.
Writing your own script
A common starting point is to have a custom script to apply database patches (sometimes also called "deltas" or "migrations") to your database. This approach is widely used and works quite well for a lot of people. Most of the available tools are enhanced versions of this approach and as a result they are subject to the same limitations, as we will see later! The idea behind this approach is simple: we store patch files in a directory in our project. These patch files are checked into version control alongside our source code.
Each of these patch files contains some SQL detailing a database change. So, a patch file might look something like this:
Using this approach, whenever a developer makes a change to the database schema they must add the SQL patch file to this directory describing the change. The patch file is then checked in to version control together with the source code. When anyone updates a copy of the code, they will receive patch files as well as source code changes, and these patches need to be applied to the database to bring it to the version expected by the code.
In order to automate this using a script, our script basically needs to do two things
- Run database patches on the database
- Remember which one we executed last, so next time we know which ones to execute
To keep track of which patch file we executed last, we could simply update a value in our database, first checking the current patch level and then running any later scripts. The code steps (this is psuedo code) would look something like this:
This approach works well and is much better than making manual or undocumented database changes. We could create a simple script that updates our project from source control and then runs any patches automatically, and use this to update our projects on both development and production platforms.
As simple as this solution might look, similar solutions work quite well for many applications and these make a great starting point for implementing database version control in your own projects.
Of course you hope to never get into this situation, but sometimes you might want roll back a production update. This can for example happen when the application was insufficiently tested, or when the application behaves or performs differently on the production machines than expected. Rolling back your source code is usually quite simple: you can just change the symlink to the directory containing the previous version, or use your version control software to update the code base to an earlier revision. For databases though, reverting is a bit more complicated.
One possible approach is to maintain undo files. This would be a file that does exactly the opposite of what your patch file did, and can be used to revert an update whenever something went wrong. For example, to roll back the example patch file shown above, our script would look something like this:
Bear in mind though that some database operations are not reversible and in those situations this approach is less useful. Imagine a patch file that drops a column from a table and an undo file that creates the column again. Yes, your database schema would now be the same again, but the contents of the column are gone.
One of the main things I found out about database version control is that different projects require different approaches, and while I do not make use of undo files myself, other people rely on them and use them very frequently. Consider the options; does your project require rollback scripts? If you think so, include these in your process. They might one day save your life (or at least your deployment).
Write your patch first
No matter what solution you use, it is always a good idea to first create your patch and then run it on your own database using your usual script. This approach ensures that the script actually works, and the version number in your database will match the executed patches.
Typing your own patches instead of copy/pasting them from the phpMyAdmin output might be a bit more cumbersome, but it does encourage you to think about your schema changes thoroughly before actually applying them.
I found it useful to have the first patch (patch-001.sql, in the example above) to create the initial schema for the database – created at the point where database patching was implemented. If you have an existing project and you want to start using database patches, then patch-001.sql is simply a dump of the database schema as it is at that moment. Now whenever the schema needs changing, add a patch. This way you can always rebuild your database structure quickly and easily dropping the existing one and then, using your script to run all your patches.
Another advantage of this approach is that the database patches directory in your version control repository will always describe the correct database structure. Since code and database structure are both under source control, we always know which database changes go with which code version.
A great addition to this approach is to maintain an SQL file that inserts dummy contents into your database. Insert some users, news articles, whatever data your application uses. This makes it even easier to just drop your database and replace it with a new one with data known to be sane - especially useful when you end up with lots of test data from debugging a particular issue!
Where is my silver bullet?
You might think that there would be a better solution, to compare databases and synchronise any differences between them, for example. However, if you imagine for a moment that you want to create a tool to do just that. It all works fine when adding or removing stuff, but it gets tricky when renaming tables or columns. The tool would not be able to tell whether a column was renamed, or a column was dropped and another column was added (warning: the existing tools that do compare databases always go for the "drop and add" solution if you renamed something). Since such a tool wouldn't be able to tell the difference it would have to guess, and therefore may not work correctly in all situations.
So database version control is not as simple as making two schemas equal to each other; it is about remembering all changes you did in development and then repeating exactly those steps at all other locations. Exactly how you document those steps is not really important. So far we have used SQL, but as we will see later on, other tools use languages like XML or even PHP. No matter what language you use, you will always have to document all the steps you did, in order to repeat exactly those steps somewhere else. Documenting those steps is the part that developers consider to be cumbersome and painful, but it is absolutely required if you want to do database version control in a reliable way.
Until now we have been naming our patch files as patch-001.sql, patch-002.sql and so on. This works fine while development is on one branch and everyone updates and commits frequently. However, with multiple branches you might have to reconsider this approach.
Imagine a branch containing patch-003 and patch-004. Meanwhile, development continued on trunk and that also contains patch-003 and patch-004. When we want to merge the branch back to trunk we have a problem since the files will collide. There are several solutions for this, commonly we either keep separate sequences for the separate branches until we merge them again, or we name patch files uniquely (or in a manner likely to be unique), even when we merge branches.
In the first case, we would put the patch files for our branch into a separate directory, or prefix our patches, naming them 'trunk-001.sql' and 'branchA-001.sql'. By changing our script so it can deal with two sequences, we can merge the branch to trunk or back without any conflicts, and run patches from both branches. This works well during development, but in the end the same problem exists: a lot of renaming and reordering of the patch files is needed. Basically this "solution" is a way of postponing the problem, not solving it.
The second solution entails giving your patch files unique names to avoid a conflict. For example, use the current time in your filename (e.g. patch-201012202028.sql), or include the current SVN revision in your filename. When you use this approach, be aware that it is not enough to simply record the last executed patch number. After all, when branches are merged, you will be adding patches "from the past" into the sequence. Our script would need to remember a list of all executed patches, rather than just one number. This way, merging back and forward and finally joining all patches together wouldn't be a problem.
These solutions reduce (or even eliminate) the chance on a conflict on file system level, but on database level there still could be some serious problems. Imagine somebody renaming a column in the branch, and somebody else changing the properties of the column in the trunk. Now all of a sudden, the order in which the patches are executed is important.
When it comes to branch merging, human intervention is often needed. When you merge branches, look at every single one of your patches and see where they fit in the target sequence. In what order should they be executed? The pain of merging can be reduced by using some of these tricks, but there may always problems you should be aware of.
There are a lot of tools out there that can help you with database version control, and using existing software instead of writing your own script can save a lot of time. In this chapter I will mention a few tools; rather than giving lots of detail on how to use them, I will instead show some of the differences, the pitfalls, the advantages and the downsides of the various solutions. Links are included with each section so you can read more about any of the tools which interest you.
Phing + DBDeploy
Phing is a PHP-based build system, similar to Apache Ant. This means you can use it to automate a number of tasks; for example you could write a Phing script that runs 'svn up', then changes the permissions on certain folders and finally executes your database patches. You could use Phing to call your own script, but you could also use DBDeploy, a database version control tool that comes with Phing when you download it.
When you use DBDeploy you provide it with the connection parameters (host/database/username/password etc.) and point it to the folder containing your patch files. Your patch files are SQL files and they are named using a number, followed by a description (for example: 33-create-user-table.sql). When you now run your Phing task, DB deploy will check for any new patch files and will generate an SQL file containing all the SQL statements to update the database and to update DB Deploy's own meta data table. You can then have Phing execute the generated SQL file by calling your commandline SQL client, using the generated SQL file as input.
The tool is really simple and provides basically the functionality of the simple script shown earlier in this article. Getting started with Phing and DBDeploy is pretty easy, and since it's written in PHP you probably don't have to install any other software to run it.
If you would like a ready-made solution, and you want to have something not too complicated that works, you should definitely check out these tools. To get started I recommend you take a look at this blog post, by Dave Marshall, where he explains how to get started using Phing and DBDeploy.
A more advanced tool which is also available is LiquiBase, which comes with more options than DBDeploy. You can, for example, compare two databases and have a patch file generated to make the one equal to the other (but, as mentioned before, beware when you have renamed any columns), or generate a patch file from an existing database schema. The patches are written in XML, meaning the LiquiBase knows the context of your patches and can actually figure out the undo-version of the patch file by itself. This tool is really nifty, and having all patches in one big XML file means that merging patches is similar to merging source code. You might still encounter some problems, but Liquibase does everything possible to keep the pain to an absolute minimum.
Liquibase is the most complete tool I have seen so far for database patching, however there are some downsides. The patches are written in XML, so expect to be learning a new language to create patches (although it does support SQL snippets in the XML which could be useful). The biggest downside for PHP developers is that it is written in Java, so we need to learn how to work with a new platform to take advantage of this tool.
If you will be dealing with lots of branches and database migrations and you are willing to invest some time into getting familiar with the tool, in exchange for one of the most complete solutions out there, LiquiBase is something you should certainly take a look at. The documentation on the LiquiBase website is very good and extensive, and should contain all the information you need to get going. They also have some good training videos for those who prefer to learn that way.
Depending on the DBMS used you will probably have to install a separate JDBC connector and tell LiquiBase where this connector can be found. Jérôme Renard wrote a good blog post on how to do this using MacPorts, but I'm sure the yum or apt-get version of this howto is quite similar.
Doctrine is an ORM (Object-Relational Mapping) tool, and it comes with a migrations script built in. Patches are PHP classes with an up() and down() method, describing how to apply and roll back the patch respectively. Most of doctrine migration's functionality is quite similar to the tools above; you can run patches, roll them back, and so on.
The unique feature of this tool is that it lets you generate patches by comparing a YAML file, describing your database schema, with the actual database. This means that whenever you need to add a column, you can simply edit the file, then run the included script generate-migrations-diff, and your patches will be generated automatically. We are still comparing schemas however, so be aware of the pitfalls when renaming columns as described earlier.
A Doctrine Migrations YAML file describing your database can look something like this: (example taken from the Doctrine project)
Maintaining a separate YAML file (and pulling the entire Doctrine project in) is quite cumbersome if you don't use the rest of Doctrine's functionality, however it is a really complete tool with a lot of features so I highly recommend it for existing Doctrine users.
Forging the silver bullet
Whether you write your own script, or use an existing tool, managing database versioning is a hard problem. Executing patches is not the real problem; this is more or less taken care of. However, the writing of the patches is still a time consuming task, requiring us to write, generate or merge patches ourselves. Last year I spoke to a lot of people at conferences about this subject, and I heard about some fantastic tools that are being built. New versions of the tools above, which solve many current problems, new approaches that were still too secret to tell - and even a tool monitoring the MySQL binary log on the development machine to detect schema changes and generate patches automatically. Hopefully we will see more and improved tools coming out in the future, but for now all we can do is build the best bullet we can - even if it's not silver!
How are you handling database versioning in your project? Using one of the tools above? If you know a really awesome tool that should have been in this list, please add a comment and explain why it should not be forgotten!
- Video of my talk "Database Version Control Without Pain" at PHPNW10
- Corresponding slides of my talk "Database Version Control Without Pain" at PHPNW10
- DB Deploy
- Blogpost introducing the Akrabat DB Schema Manager: a database version control solution for Zend Framework
- Keep our PostgreSQL databases syncronized with PHP, a blogpost by Gonzalo Ayuso about a tool he made to keep multiple PostgreSQL databases synchronized