Content Based Backup Perl Script

So for a couple of months now I've been messing around with and tuning a content based backup script in Perl. I think it's ready for a wider audience so here we go. To cut the chase, see the downloads section near the bottom of the page for the latest a version of the scripts and other files. Continue reading if you want the methodology and other details.

Latest software release date is: Tue Sep 20 18:50:31 EDT 2011

Background

So after losing a whole project due to a failed hard disk a number of years back, I've been very aggressive of automated backup systems. All of my systems have mirrored hard drives in them these days and my critical systems get backed up nightly with offsite syncing monthly.

The mirrored drives protect me against a disk failure but they don't save me from a water pipe bursting over my server hence the offsite backup. They also don't save me from an errant file remove or fsck "fixing" a filesystem by removing a corrupt directory entry. When going back looking for some school source that I hadn't thought about in years, I discovered that a whole tree bad been missing from my source repository for months that I was luckily able to recover from an old archive.

These challenges detail the necessity of a couple types of backups and archives:

Short term backups for fat finger moves or system crashes.
Longer term archives for errant removes of non critical files that aren't missed for months (maybe years).
Mirroring critical data to survive a drive fault.
Offsite backup to survive a disaster (fire, water, asteroid).

Backup Schedule

My old solution has been to do incremental tar files nightly with full backups weekly. I then sync the backups with offsite storage which I try to do semi-monthly. This has worked well for me and saved my ass a number of times. Whether it is recovering from a problem that just happened or finding older revs of html files.

To give the most flexibility to keep historical archives and short-term backups, I wanted to keep daily incremental backups for 2 weeks while full backups are kept each week for a month, each month for a quarter, each quarter for a year, each six months for two years, and each year perpetually.

Problems With Old System

Even though I was aggressive at weeding out older incrementals and spacing out the full backups I was keeping, I was still finding that my backup drive (now 250gb) was having a harder and harder time storing enough full backups to make me happy -- especially since each full backup was pushing 20gb.

I was also worried that I had backed up files that I now wanted to exclude because of sensitive information. I am on a couple chat groups where some buddies and I solve the worlds problems daily which often include language that the Powers-That-Be might consider revolutionary (gasp). Nothing crazy but I didn't want to be the one who got others in trouble. So there are certain logs and records that I don't backup these days but didn't want to take the file to unpack, remove, and re-tar from 200gb of backup files every time I found something sensitive in a backup.

But the most frustrating part about my old system was that I was backing up files multiple times. Especially with large log files which I try to roll up by year, I estimated that a large percentage of files were backed up multiple times -- some files maybe 10 times. Turns out that after converting to my new system that over half of my long term backup storage was of duplicate files.

Iron Mountain

So I started working recently for Iron Mountain. Our digital division has been working on PC and server remote backups for some time now. They have excellent services and provide remote, fully distributed, mirrored computer backup solutions for small and large organizations. Very impressive technology.

Iron Mountain's systems organize the backups by individual file since (for example) 90+% of the backups of all Windows XP systems are common files. The concept of per-file backups has been done before but it was the first time I had been exposed to the concept and it started me thinking of how to improve my current application. Yes, I know, I am re-inventing the wheel with this and others have done this before. But backups are important and I had particular requirements that I didn't see solved anywhere else.

Content Based Backup System

So I reorganized my backup scripts to treat each file separately instead of a tar-ball. Any file can be a part of any number of backups. The files are stored compressed in a directory hierarchy by their digest signature (default is SHA256) and organized using a SQL database. There is one database row per file per backup so if a file exists in 10 different backups, it has 10 different rows in the file table.

The Perl backup script walks your filesystem using the File::Find package and backs up each file into the content directory. If a previous version of the same file exists then all it needs to do is add a line for it in the file database table. Other Perl dependencies are:

DBI
DBD::Pg (or DBD::mysql)
Digest::SHA2
PerlIO::gzip
Unix::Mknod

Please note that the scripts depend on Digest::SHA2 not Digest::SHA256 which is broken. The scripts also depend on these more standard Perl packages:

Digest::base
Fcntl
File::Copy
File::Find
File::stat
Getopt::Long
IO::Handle
POSIX

Content Directory Hierarchy

All of the standard files are compressed into a directory hierarchy. The directory is specified to all of the scripts with the -c flag. When the file is backed up, its digest is first calculated and looked up in the database to see if it exists already. If it does then an entry is written into the files database table -- nothing is written to the content directory.

If a file doesn't exist then it is compressed into the content directory. To avoid race conditions it is read again, a new digest is calculated, and it is compressed into a file. Once this is done, the compressed file is moved into the directory hierarchy based on the new digest. If the file's digest is:

6e422c7c93303f213fcf5597668c443b56599977c75a03e4b957ec4db58dde86

then it will live in the content path:

6e/42/2c7c93303f213fcf5597668c443b56599977c75a03e4b957ec4db58dde86.gz

I chose 2 levels of directories since with 475,000 files, I see a maximum of ~20 files in the bottom directories (6-7 average). More levels would have been overkill with most being unpopulated and less levels would have resulted in 1000s of files in the lowest directory which would have been less optimal for most filesystems.

NOTE: Only standard files are stored in the content hierarchy. Special files which include directories, devices, zero-length files, and symbolic links are recorded in the files database table.

Database

I use PostgreSQL database for my backup system but my SQL and scripts should work fine with MySQL after a little tweaking (feel free to kick me back a patch). For my backup database, I run a Postgres instance on the backup drive on a different port under my UID. This allows me to run my main Postgres database servicing all of my webaps on my main drive and my database for the backup files on a the backup drive without getting them confused. There are 2 tables in my backup database:

"backups" Table

The backups table records each backup instance. It assigns the backup a unique ID with the help of a sequence variable "backup_id_seq". This ID is used in the "files" table to assign a file to a specific backup.

The backups table records when the backup was created, it's ID, what machine name, what collection of files is being backed up (/usr, /, ...), whether it is a full or incremental backup (according to the backup.pl '-i' flag), and some statistics about numbers of new files, directories, duplicate files, and compressed sizes. It also has a comment field in case you need to better describe a backup.

"files" Table

The files table contains an entry for each file stored in each backup. If a file (with the same exact content digest) has been backed up 20 times over the years, it will have one entry in the content hierarchy but 20 entries in the files table.

The files table records the type of file (see below), full path, backup ID, linkpath (if a symlink), size, modes, modification times, node information (if a device), owner, group, and content digest (for normal files only). There are 5 different types of files:

normal file backed up into the content directory
standard directory entry
symbolic link recorded using the linkpath field in the table
device node recorded with major and minor information
zero length file which is not backed up into the content directory

Backing Up

More details about the backup.pl script can be found at the top of the script file. The normal usage of the backup script is something like:

backup.pl -c /backup/host/CONTENT /

This will backup the / filesystem. You will probably want to run it as root so that it can read all of the files. It will copy the files from these filesystems into the /backup/host/CONTENT directory organized by digest signature and it will add to the backups SQL database. So if your / contained:

/foo         (file containing 'hello there')
/bar -> baz  (bar is a symbolic link to baz)
/baz/        (directory)
/baz/foo2    (file containing 'wow, lookie')
/baz/foo3    (0 length file)

The /backup/host/CONTENT directory would contain 2 entries for the files 'foo' and 'foo2':

12/99/8c017066eb0d2a70b94e6ed3192985855ce390f321bbdb832022888bd251.gz
6e/42/2c7c93303f213fcf5597668c443b56599977c75a03e4b957ec4db58dde86.gz

The 'files' database table would contain 5 entries for the various files in /.

By using the '-i' flag to the backup.pl script, it will only backup files that have a modification time (via stat) newer than the last full backup time. This "incremental" backup should actually take the same space in the content hierarchy as "full" backups except it runs a lot faster because it doesn't have to calculate digests on every file. Also, incremental backups only write entries into the files table for new files. For this reason I do incremental backups nightly and full backups weekly.

NOTE: by default the backup script will not cross device boundaries unless the '-x' flag is specified. This allows you to backup /usr without backing up /usr/var for example.

Restoring

The normal usage of the restore.pl script is something like:

restore.pl -c /backup/host/CONTENT %var/log/syslog%

This will check the files table in the backup database and files all files which match the SQL like pattern '%var/log/syslog%' and will restore them to the current directory. If it matches the following files and directories:

/etc/syslogd.txt
/usr/var/log/syslog.txt
/usr/var/log/syslog.txt~
/usr/local/var/log/syslog/auth.txt

Then it will create the etc/, usr/var/log/, and usr/local/var/log/syslog/ directories in the current directory, and will restore the files into the directories. It will try to recover the owner group and proper modes of the files if the caller has the permissions to do so. You will probably want to run restore as root.

Verifying the Backups/Database

There are two scripts which are used to verify the state of the backup system. The check_content.pl script walks the content directory hierarchy and verifies that all files have a valid entry in the database. It can also help reclaim disk space as detailed below. By using the '-s' flag, the script will also uncompress and verify all file signatures which takes 2-10 times longer so should probably be done on a monthly basis.

The check_db.pl script is used to verify the backup database. It walks all of the entries in the files table and verifies that they are sane and that they refer to a file in the content directory if applicable.

Cleaning Up

So every once and a while or automatically based on scripts, you will need to go through your list of backups and pair them down. I remove all incremental backups older than two weeks with the following SQL:

delete from backups where not "full" and created < NOW() - INTERVAL '2 weeks';

After you remove entry(s) from the backups table, you will need to go through the files tables and remove all of the corresponding entries:

delete from files where backup not in (select id from backups);

This will, however, not free up any space in your content directory. Aside from testing the content directory the check_content.pl script also notices if a file is not in the files table and will either move it to an orphaned directory for examination (-m) or unlink it (-U). Notice that if a file entry is removed from the files table other backups may still refer to it so you may not reclaim any space.

Importing Current Backups

The easiest way to import current backups is to restore a backup into a directory and then back it up using the backup.pl script:

# mkdir oldbackup
# cd oldbackup
# tar -xzf ../usr_20010423.tar.gz
# backup -I /usr -c /backup/host/CONTENT  -w 04/23/2001 .

This takes a backup of the /usr filesystem and extracts it into a temporary directory. The backup script then backs up the current '.' directory while specifying '/usr' as the initial (-I) part of the path. If your backups are of /usr/* then the '-I' is not necessary. The '-w 04/23/2001' in the above line sets the date of the backup and overrides the default which is to take the current date/time.

Real-Life Statistics

So to compare, I am running *BSD on a single processor, single core, AMD Athlon XP 1.1ghz. I have a mirrored pair of 80gb drives that I am backing up to a 250gb drive -- all 7200rpm, SATA/150 drives.
So before I converted to this system my tar files backups were 177gb and I had to hold off doing a full 20gb backup which would have filled it up. With the new system, my backup's reside in ~70gb. This is an overall savings of 60% with more backups. The overall storage in the content directory hierarchy is 52gb with the rest being filesystem overhead and databases.
New full backups of what I've deamed as "critical" files take 1.2gb each to take a snapshot of 19gb of 62,000 files (93% compression). A full backup of my 30+gb Unix /usr partition just took 5 hours. It backed up 260,000 files of which only 3,400 were new content. The 3,400 new files added 3.7gb to the content directory which compressed down to 2.5g.
I just cleaned out 13 older incremental backups which corresponded to ~52,000 file entries. After running the check_content.pl script, this removed 1,700 files and 8.2gb of space from the content directory. This really helps when your incremental backups contain a lot of logs, database files, and other frequently updated files.
The backup database takes up ~1gb in size and I dump it often to 130mb compressed files. My files database table has ~2 million rows which correspond to ~475,000 files in my content directory hierarchy. Each file is therefore backed up on average more than 4 times.
The check_db.pl script takes 2.5 hours to check all 2.2 million rows in the file database. It makes sure that all of the permissions are good, that the corresponding file (if any) exists, and other checks.
The check_content.pl script takes about an hour to walk through all 475,000 files and verify their entries in the database. It takes significantly longer (maybe longer than 24 hours) to open and verify the digests on all 50gb of file contents, so I do it monthly.

Downloads

backup256.tgz -- Tar file of all of the below files. Latest software release date is: Tue Sep 20 18:50:31 EDT 2011
backup.pl -- The backup script itself.
check_content.pl -- Script to check the content hierarchy and look for orphaned files or errors.
check_db.pl -- Script to check the database and look for errors.
restore.pl -- Restore script to restore files from the backup.
sql_schema.txt -- Postgres SQL statements to create the database. Should work on MySQL with a few tweaks.
sql_useful.txt -- Some useful SQL statements that I use on occasion.

Free Spam Protection Android ORM Simple Java Zip JMX using HTTP Great Eggnog Recipe Eero Model Comparison