Incremental snapshot backup under Linux

Regularly doing backup is necessary. Making backups is annoying as it is so repetitive, no fun. Let's automatise it, that's what machines are good for. The goal is to have a script being run regularly by cron which does all the necessary. But what is that, all the necessary? Of course, having a copy of all data at some safe place in case the unthinkable happens. But sometimes it comes in handy being able to go back in time. "Yesterday it worked, what is it that I changed since then?" What the thing should do is thus:

What might look like being contradictory to itself is actually pretty easy with the help of rsync, hard links and sshfs. The main idea I took from here.

Using hard links for backup rotation

Files in a Unix file system consist of two parts, the data itself and a link. The link is what a directory entry points to. It contains the access and permission information and a pointer to the file's data. There can be more than one link pointing to the same file data. When copying a file using the command

$ cp -l <original> <copy>
only the link is being copied, not the data itself. Typing now
$ rm <original>
removes only the corresponding link without removing the data. <copy> still points to it. The data carries a link counter and gets deleted only when the link counter drops to zero. (Such hard links are not to be confused with so-called soft or symbolic links which actually point to hard links and are used as a kind of deviation or aliasing.) This mechanism is what we are going to use in combination with the rsync tool. Assume that we have a full backup in a directory named backup.0. By copying this directory to a new one using
$ cp -al backup.0 backup.1
we create a tree of directories in backup.1 which contains hard links to the corresponding data as does backup.0. But the actual data is stored only once. The command line switch -a ("archive") is used to preserve all permission and access time information. The next incoming backup has to compare against backup.0, delete in there what has changed, to replace it with the changed version. Deleting (more properly 'unlinking') first is necessary because otherwise the data linked to would be overwritten, thereby changing also the content in backup.1 and all older backups, which is of course not what we want. By unlinking first, the link in backup.1 will continue to point to the old data while the link in backup.0 will point to the new file content. Backups taken more in the past than backup.1 can simply be rotated using mv
rm -rf backup.<oldest>
mv backup.<oldest-1> backup.<oldest>
mv backup.<oldest-2> backup.<oldest-1>
    ...
mv backup.1 backup.2
cp -al backup.0 backup.1
The -rf flag on the rm command is needed because we want to recursively remove the corresponding directory tree. The mv command performs a simple renaming of the particular link, being here the name of the backup directory.

rsync

The main work of the backup system is done with rsync, a Unix tool for remote synchronising files and trees of files over the network. It takes care of comparing original and backup, transfers only the changes and keeps also track of file permissions. Its man page explains all used options.

Mounting a remote file system via sshfs

Although rsync does a pretty good job, some of its useful options seem to be available only locally, i.e. on the host where it is called. Especially preserving file access control seems to be working properly only when using rsync for pulling data, not for pushing. At least, I couldn't make it working the other way round. The solution is to mount the remote backup tree in the local file system using sshfs, a file system implementation that uses ssh for the network traffic (like scp or sftp). This solution avoids the security issues coming with nfs. It has to be said though that the price is a writable mounted backup directory during the process of backing up. And of course the increased network traffic. An alternative would be that the backup scripts reside on the remote machine, rsync logging into the machine to be backed up on public key mechanisms. What is the better choice depends on policy, which machine is used how and who has access to what. I use a Linux host on the net for backing up the files on my desktop machine on which only I personally have access. I assume thus private keys being more secure on this machine than on a server somewhere out there.

ssh public key authentication

To log into the remote backup machine, respectively to get access to the backup tree residing there, we are going to use ssh public key authentication. To this end, a separate backup account has to be created on the remote backup machine. Loging in over the network by ssh or on the console is disabled for that user to minimise security issues.

Preparing the account

I assume here that both the local and the remote machine run Debian Linux. For other Linux brands the administration may be somewhat different, but this text should give enough of an idea to figure out how to do it elsewhere. On the remote backup machine, use

$ adduser --home <backup directory> <backup user name>
to create the user account used for the backup procedure. To be sure that it went through, try logging into the backup machine with the newly created user id. The backup may of course be stored under the name of an already existing user. However, keeping things separated is always a good idea. On systems which copy some generic stuff into that directory--to generously provide what in our days is supposed to be a workable environment--you may safely move all that to the WOM (write only memory, /dev/null in Unix flavor). Next, on the local machine run
$ ssh-keygen -t rsa -f <keyfile>
$ chmod 400 <keyfile>
$ ssh-copy-id -i <keyfile> <backup user>@<backup host>
to generate a private-public ssh key pair and copy the public key to the remote machine. <keyfile> should point to the file where the private key shall be stored. It will be created by ssh-keygen. By using ssh with public key authentication ssh will not prompt for a password, which would prohibit running the backup scripts by cron. Instead, it fetches the private key from the key file and logs into the remote machine without prompting. For this to work, a copy of the public key has to be propagated to the remote machine, which is done by the third line. As a last step, normal login into the backup account can be disabled by replacing the encrypted password in /etc/shadow on the remote machine by an asterisk. On my backup machine the corresponding line looks like
backup-user:*:16902:0:99999:7:::
The private key file should be made as inaccessible as possible. Anyone who can read the content of this file can log into the backup machine as the backup user and mess around at he likes. A good idea is to give it read-only access to the user id under which the backup scripts are going to be run and no access at all to anyone else (400).

Setting up the scripts

The backup scripts reside entirely on the local machine. To keep things clean I have separated variables and scripts. The configuration file looks like this

RM=/bin/rm
MV=/bin/mv
MKDIR=/bin/mkdir
CP=/bin/cp
RSYNC=/usr/bin/rsync
SSHFS=/usr/bin/sshfs

LOCAL_DIR=<directory tree to be backed up>
BACKUP_HOST=<host on which the backups reside>
BACKUP_DIR=<directory in which the backups reside>
BACKUP_USER=<user id on remote machine>
SSH_PRIVATE_KEY_FILE=<key file on local machine>
MOUNT_POINT=<local directory where to mount remote tree>
ID_FILE_NAME=<name of id file>
EXCLUDE_FILE=<name of file containing exclude patterns>

The first lines define the binaries. I prefer to use them with absolute paths to eliminate problems with environment settings and the PATH variable. Defining them as variables just saves some typing in the script later on. Most of the other variables should be self explaining. The id file is an empty file placed manually at the root of the remote backup tree which serves as a checkpoint whether the sshfs mount has worked. Writing backups to the local mount directory would i) fill up the local disk and ii) induce problems when mounting a remote file system over a non-empty local directory. Note that the mount point must not reside inside the directory tree to be backed up. rsync has the possibility to exclude files using patterns similar to tar. A collection of names and patterns may be given in a file, one name or pattern a line (see 'man tar' for details).

Hourly Backup

The main work is done by a script which is run by cron on an hourly base. It uses rsync to compare the most recent backup with the present state and copies what has changed. Before doing so, the backup rotation is carried out. A file containing a time stamp is written to the newly created backup. It has to be unlinked before doing so to avoid propagation of the time stamp to older backups. The whole action after mounting the remote directory is placed in a while loop which checks at its top whether the remote directory is properly mounted and no other backup rotation is going on. Using the while loop is sort of a replacement hack for a non-existing goto keyword. The script files have permission 700 (read-write-executable for the owner, nothing for anybody else), the parameter file has 600 (not executable, just as remainder that there's nothing to be executed on its own in there).

#!/bin/bash

. /<path to parameter file>/backup-params.sh

RSYNC_OPTIONS="--verbose --progress --stats --compress --recursive --times \
  --omit-dir-times --perms --links --numeric-ids --delete \
  --exclude-from=$EXCLUDE_FILE"

LOCAL_BACKUP_PATH=$MOUNT_POINT/hourly

if [ -d "$LOCAL_BACKUP_PATH/hourly.1" ]; then
  RSYNC_OPTIONS=$RSYNC_OPTIONS "--link-dest=$LOCAL_BACKUP_PATH/hourly.1"
fi;

# mount backup directory on remote host
$SSHFS -o idmap=user -o IdentityFile=$SSH_PRIVATE_KEY_FILE \
  $BACKUP_USER@$BACKUP_HOST:$BACKUP_DIR $LOCAL_BACKUP_PATH

while true; do
  # check for proper mounting
  if [ ! -e $LOCAL_BACKUP_PATH/$ID_FILE_NAME ]; then
    echo "backup dir not mounted";
    break
  fi

  # no other backup going on?
  if find $MOUNT_POINT/daily -mindepth 1 -maxdepth 1 | read; then
    echo "backup daily dir busy";
    break
  fi

  if find $MOUNT_POINT/weekly -mindepth 1 -maxdepth 1 | read; then
    echo "backup weekly dir busy";
    break
  fi

  if find $MOUNT_POINT/monthly -mindepth 1 -maxdepth 1 | read; then
    echo "backup monthly dir busy";
    break
  fi

  if [ -d "$LOCAL_BACKUP_PATH/hourly.23" ]; then
    $RM -rf $LOCAL_BACKUP_PATH/hourly.23;
  fi;

  for i in `seq 22 -1 1`;
    do if [ -d "$LOCAL_BACKUP_PATH/hourly.$i" ]; then
      $MV $LOCAL_BACKUP_PATH/hourly.$i $LOCAL_BACKUP_PATH/hourly.$((i+1));
    fi;
  done

  if [ -d "$LOCAL_BACKUP_PATH/hourly.0" ]; then
    $CP -al $LOCAL_BACKUP_PATH/hourly.0 $LOCAL_BACKUP_PATH/hourly.1
    if [ -e "$LOCAL_BACKUP_PATH/hourly.0/timestamp" ]; then
      $RM $LOCAL_BACKUP_PATH/hourly.0/timestamp
    fi
  else
    $MKDIR $LOCAL_BACKUP_PATH/hourly.0
  fi

  date > $LOCAL_BACKUP_PATH/hourly.0/timestamp

  $RSYNC $RSYNC_OPTIONS $LOCAL_DIR $LOCAL_BACKUP_PATH/hourly.0
done

fusermount -u $LOCAL_BACKUP_PATH

It is important to notice that through the option 'idmap=user' the user id of the user running the script is translated into the user id of the backup user on the remote machine (and vice versa when reading the backup via sshfs for restoring). Other user ids are not translated using this option. That can be set up using translation files (see man sshfs). The scripts presented here preserve ownership and access rights for the owner / executor of the scripts without caring about other users, supposing that the personal data of this person is being backed up.

Daily Backup

The backup on a daily or less frequent basis doesn't need to call rsync. Hard link copies are enough. For each frequency, daily, weekly, monthly or whatever, I have set up a separate script which basically repeats the rotation part of the main script.

#!/bin/bash

. /<path to parameter file>/backup-params.sh

LOCAL_BACKUP_PATH=$MOUNT_POINT/daily

$SSHFS -o idmap=user -o IdentityFile=$SSH_PRIVATE_KEY_FILE \
       $BACKUP_USER@$BACKUP_HOST:$BACKUP_DIR $LOCAL_BACKUP_PATH

if [ -d "$LOCAL_BACKUP_PATH/daily.7" ]; then
  $RM -rf $LOCAL_BACKUP_PATH/daily.7;
fi;

for i in `seq 6 -1 0`;
  do if [ -d "$LOCAL_BACKUP_PATH/daily.$i" ]; then
    $MV $LOCAL_BACKUP_PATH/daily.$i $LOCAL_BACKUP_PATH/daily.$((i+1));
  fi;
done

if [ -d "$LOCAL_BACKUP_PATH/hourly.0" ]; then
  $CP -al $LOCAL_BACKUP_PATH/hourly.0 $LOCAL_BACKUP_PATH/daily.0;
fi;

fusermount -u $LOCAL_BACKUP_PATH

Setting up cron

The remainder is pretty easy. By running

$ crontab -e
an editor is stared with the crontab file of the user invoking the command. Mine contains the following

# m h  dom mon dow   command
2 6-22 * * * /<path to backup script>/backup-hourly > /dev/null 2>&1
2 0 * * * /<path to backup script>/backup-daily > /dev/null 2>&1
2 2 * * 1 /<path to backup script>/backup-weekly > /dev/null 2>&1
2 4 1 * * /<path to backup script>/backup-monthly > /dev/null 2>&1

The meaning of the first non-comment line is, at minute 02 every hour between 6am and 10pm, every day of month, every month and every day of week, run the hourly backup script and dump its text output to the trash, as well as any error message. The other scripts are run during night on a daily, weekly and monthly basis, respectively.

Without redirection, the output from the script will be sent to you as email, respectively to the user under who's id the script is running. That's convenient when setting up things, checking for errors and making sure everything works fine. But once that's done, you don't want your backup system filling up your mail account with unnecessary stuff. If you feel like having some kind of logging would be sensible, you may redirect the output into a file to which the user id under which the backup script is run has write access. Best is probably to append a time stamp to that file followed by the actual backup output.

At the bottom line

A full scale rotation incremental backup with minimal disk space consumption in 47 lines of shell code for the main backup, 16 lines of code per extra rotation and one line per rotation in crontab, and that's it. Pretty short, huh? Of course, all the convenient GUI stuff in solutions like Apple's TimeMachine is simply absent here. Getting back a file from the backup machine can be done with sshfs as well. But you don't do that every day, right? So there's much less time to be saved by spending hours on making that step quickly to be used too. Backing up files of different ownership is not properly taken care of. However, when backing up to a local file system like an external hard disk, preserving ownership will work out of the box with the present scripts.

By the way, what to do when something was lost? How can one recover data? Type

sshfs -o idmap=user -o ro <backup user>@<backup host>:<backup path> <mount point>
and the backup tree will be visible at the specified mount point (which should be an existing but otherwise empty directory on your local machine). By specifying the option -o ro it will be mounted in read-only mode to minimise the risk of damage.
fusermount -u <mount point>
unmounts the remote backup directory.

Now, have fun with it, but as usual, at your own risk. I cannot give any guarantee that what works in my environment does so too in that of anyone else. But feedback is of course always welcome!

home