BACKUPS
When someone asks "how safe is my data?", there are at least two
different aspects to what "safe" means: data loss/corruption
and data access security. The first issue is discussed here.
For security issues, see the Security
web page.
Reasons for loss/corruption
Data loss/corruption can happen for many reasons which fall into
three categories: user, system and environment.
The main methods of user-side loss/corruption are:
- Accidental file deletion
- Accidental file overwrite
- Bugs in analysis software
Some major examples of system-side loss/corruption are:
- Hard disk failure
- Hard disk/RAID controller failures
- Operating system (kernel) bugs
- Network failures
Some examples of environment-side loss/corruption are:
- Fire
- Flood
- Electric surges
These are somewhat interrelated. When a user puts their desktop in a
badly ventilated area so that the drives overheat or even worse
the dust bunnies catch the whole system on fire, it is really a user
category problem.
NOTE ON RAID: RAIDs only protect against the system-side hard disk failure
and only when it is just one disk that has failed. It does not protect
against multi-disk failures or any of the other failure modes listed.
RAID IS NOT A REPLACEMENT FOR BACKUP!
And different backup schemes protect against different failures modes
to different degrees as is discussed below.
Backup Policy and Methods
There are two classes of data storage at the Martinos Center. These
are central UNIX RAID storage and everything else. The later is mostly
local disk on user desktops whether it be plain disk or RAID.
For the central RAID volumes, there is a backup to tape every week
(usually around 6-8 days depending on volume of data). You
can see the list of these volumes if you go to:
http://surfer.nmr.mgh.harvard.edu/cgi-bin/backup/index.html
Your UNIX home directory is on central RAID. Your Windows/Mac files
are on your local desktop (except in rare case that the users should
know they are mapping UNIX volumes on their Windows/Mac systems). There
is no central backup of WIndows and Mac local disks.
For all desktop volumes, the users are responsible for doing backup
themselves. One thing they can do is ask us to set up weekly mirror jobs.
For this, the user needs to supply a volume equal to or greater in size to
the volume they want backed up. Preferably, this volume is on another
desktop but at the very least needs to be on a different independent disk
or RAID in the same machine to protect against disk/RAID failure.
A database of these mirror jobs is set up in the flat file at
/space/sake/5/admin/notes/DesktopBackup.DB
You can 'more' this file to look for your volumes to check if they are
being backed up in this way.
Note that when power outages occur that take down desktops, the
backups will not happen. Normally the backups will resume and catchup
the following week when both machines are back up. For failures of
this type, the IT team does not force a by-hand backup. It is the
user's responsibility to request a special makeup backup for that
week.
Therefore users should check that their desktop backups look
"right" once a week and inform the IT group if they find
otherwise. There are some situations in which no error message is
generated, for instance if both machines are down at the backup
times.
Some users have independently bought their own tape drives and
tapes. The IT group does not keep track of what users do in this regard.
Also, for the admin group and select other individuals there is a
Retrospect backup system handling backup for their Windows and Mac
desktops. This is not a generally available option to the Martinos
Center so will not be discussed.
Backup Method Comparison
The tape backup and mirror job backups give different levels of
protection. The tape backup gives more protection than a mirror job
which is why it is more expensive (both in manpower and equipment).
And this is the primary reason why the central RAID volumes are so
much more expensive.
With the mirror job, your backup is anywhere from zero to 7 days old.
As long as there is not an error during the most recent backup, one is
well protected against most methods of system-side and
environment-side errors as one can easily recover to the last backup
no more than 7 days old. Of course a fire or similar catastrophe that takes out
the machine with the production volume and backup volume will result
in a complete loss. The more you can geographically separate the two,
the better.
Mirror jobs are not very good in dealing with user-side errors.
This is mainly because
often those errors are not recognized by users right away. Lets say you
accidentally delete something on Friday, the mirror jobs happens on
Sunday, and you realize your mistake on Monday. Well, tough luck. No
recovery at that point.
This is where tape backups really excel. There is a retention of about
5 months or more on the central tape backups. So one can typically recover a
file you accidentally deleted or overwrote up to 5 months ago to about
six different snapshot times. Also, after ten tapes have filled up in
our tape robot libraries, they are walked over from bldg 149 to a storage
cabinet in bldg 120. This typically means all backups older than a
couple of weeks in a largely different geographical area. Bldg 36
tape backups are stored in bldg 149.
It is true that one could achieve something close to the tape paradigm
with mirror jobs by adding multiple backup
volumes. For each 64GB volume for data you have, you arrange for two more more
other 64GB volumes used just for backup of the original. Each week
you rotate to a different backup volume for backup.
That will certainly be more expensive than tape to achieve
three or more months backup, but for those users with VERY critical
data, adding one more backup volume should be reasonable. More
complex backup software could be run to use the extra volumes for
increments instead another full mirror. If the original volume does
not change much weekly this certainly would work and could
give one the best of both worlds. Such software though is expensive
and still in its infancy in the industry but is something we may
do in the future.
Machine for Users to do their Own Tape Backups
There is a machine in South Central called 'parvo' which has several
types of tape drives and a CD/DVD writer. Users are welcome to make
select backups on their own with tapes/DVDs they buy using parvo.
For questions or to arrange a mirror backup or to ask for a restore
from tape backup, contact the
|