BACKUPS

When someone asks "how safe is my data?", there are at least two different aspects to what "safe" means: data loss/corruption and data access security. The first issue is discussed here. For security issues, see the Security web page.

Reasons for loss/corruption

Data loss/corruption can happen for many reasons which fall into three categories: user, system and environment.

The main methods of user-side loss/corruption are:

  1. Accidental file deletion
  2. Accidental file overwrite
  3. Bugs in analysis software

Some major examples of system-side loss/corruption are:

  1. Hard disk failure
  2. Hard disk/RAID controller failures
  3. Operating system (kernel) bugs
  4. Network failures

Some examples of environment-side loss/corruption are:

  1. Fire
  2. Flood
  3. Electric surges

These are somewhat interrelated. When a user puts their desktop in a badly ventilated area so that the drives overheat or even worse the dust bunnies catch the whole system on fire, it is really a user category problem.

NOTE ON RAID: RAIDs only protect against the system-side hard disk failure and only when it is just one disk that has failed. It does not protect against multi-disk failures or any of the other failure modes listed.

RAID IS NOT A REPLACEMENT FOR BACKUP!

And different backup schemes protect against different failures modes to different degrees as is discussed below.

Backup Policy and Methods

There are two classes of data storage at the Martinos Center. These are central UNIX RAID storage and everything else. The later is mostly local disk on user desktops whether it be plain disk or RAID.

For the central RAID volumes, there is a backup to tape every week (usually around 6-8 days depending on volume of data). You can see the list of these volumes if you go to:

http://surfer.nmr.mgh.harvard.edu/cgi-bin/backup/index.html

Your UNIX home directory is on central RAID. Your Windows/Mac files are on your local desktop (except in rare case that the users should know they are mapping UNIX volumes on their Windows/Mac systems). There is no central backup of WIndows and Mac local disks.

For all desktop volumes, the users are responsible for doing backup themselves. One thing they can do is ask us to set up weekly mirror jobs. For this, the user needs to supply a volume equal to or greater in size to the volume they want backed up. Preferably, this volume is on another desktop but at the very least needs to be on a different independent disk or RAID in the same machine to protect against disk/RAID failure.

A database of these mirror jobs is set up in the flat file at

/space/sake/5/admin/notes/DesktopBackup.DB

You can 'more' this file to look for your volumes to check if they are being backed up in this way.

Note that when power outages occur that take down desktops, the backups will not happen. Normally the backups will resume and catchup the following week when both machines are back up. For failures of this type, the IT team does not force a by-hand backup. It is the user's responsibility to request a special makeup backup for that week.

Therefore users should check that their desktop backups look "right" once a week and inform the IT group if they find otherwise. There are some situations in which no error message is generated, for instance if both machines are down at the backup times.

Our standard mirror backup scripts log time of completion in:

/space/backup/rsyncs/

So, for instance, to see when a backup script last ran successfully on the machine "quick", type:

cat /space/backup/rsyncs/quick

and look at the last line. Note that this will not tell you if the backup script encountered an error in the middle and only backed up some of your data; you will only see a listing if all of the backups for that machine finished successfully. It also will not tell you which volumes on that machine are backed up, if some are not.

Some users have independently bought their own tape drives and tapes. The IT group does not keep track of what users do in this regard.

Backup Method Comparison

The tape backup and mirror job backups give different levels of protection. The tape backup gives more protection than a mirror job which is why it is more expensive (both in manpower and equipment). And this is the primary reason why the central RAID volumes are so much more expensive.

With the mirror job, your backup is anywhere from zero to 7 days old. As long as there is not an error during the most recent backup, one is well protected against most methods of system-side and environment-side errors as one can easily recover to the last backup no more than 7 days old. Of course a fire or similar catastrophe that takes out the machine with the production volume and backup volume will result in a complete loss. The more you can geographically separate the two, the better.

Mirror jobs are not very good in dealing with user-side errors. This is mainly because often those errors are not recognized by users right away. Lets say you accidentally delete something on Friday, the mirror jobs happens on Sunday, and you realize your mistake on Monday. Well, tough luck. No recovery at that point.

This is where tape backups really excel. There is a retention of about 5 months or more on the central tape backups. So one can typically recover a file you accidentally deleted or overwrote up to 5 months ago to about six different snapshot times. Also, after ten tapes have filled up in our tape robot libraries, they are walked over from bldg 149 to a storage cabinet in bldg 120. This typically means all backups older than a couple of weeks in a largely different geographical area. Bldg 36 tape backups are stored in bldg 149.

It is true that one could achieve something close to the tape paradigm with mirror jobs by adding multiple backup volumes. For each 64GB volume for data you have, you arrange for two more more other 64GB volumes used just for backup of the original. Each week you rotate to a different backup volume for backup. That will certainly be more expensive than tape to achieve three or more months backup, but for those users with VERY critical data, adding one more backup volume should be reasonable. More complex backup software could be run to use the extra volumes for increments instead another full mirror. If the original volume does not change much weekly this certainly would work and could give one the best of both worlds. Such software though is expensive and still in its infancy in the industry but is something we may do in the future.

 

 

 

Desktop Backup History Page

 

 

 

For questions or to arrange a mirror backup or to ask for a restore from tape backup, contact the

Contact the Webmaster