a Sun Hosted Business by Ron Austin,
from Administrator Induced Data Loss
Editor's intro:-Disk to disk backup
(or snapshots in the cloud)
are supposedly quick and simple. But what happens when your live data and
your backup are simultaneously zapped by a systems administrator's script error?
true story below relates the cautionary tale of how one Sun customer
recovered their business when their server and database suppliers were unable to
help them in a critical data loss situation.
a Sun Hosted Business from Administrator Induced Data Loss
|In March 2003,
ActionFront solved a major technical glitch involving
Oracle and saved
the client's business!
A data-centre user-error caused a major data
loss emergency on a mission critical system. Had the company in question been
un-able to regain access to their server and data, it would have imperiled the
future of the business.
Here is a detailed description of the setting,
the problem and the resolution.
- The main server was a SUN E5500 Server with 10 CPUs running under
- The application itself was based on Oracle Enterprise Edition ver
- The data storage side ran under a Veritas file system with multiple
- 56 of these partitions were used to serve data to a particular application.
- Approximately 260 files resided across these 56 partitions, and while
ranging in size from ½GB to 8GB, most were approximately 2GB.
- 46 of the partitions resided on an EMC Symmetrix storage system, 10
partitions resided on SUN storage systems.
- The EMC Symmetrix was configured as a series of discrete RAID0 arrays:
18GB drives, each mirrored to a second 18GB drive. Each array was portioned
into two 9GB segments yielding the 46 partitions on that server.
- The usable capacity exceeded 600GB in total.
- A new EMC Clarion server had been purchased and the plan was to consolidate
all the storage from the 56 partitions on the Clarion server. A migration
process was planned.
On March 8, 2003 - a quiet
Saturday, the systems administrator wrote a script to perform the migration and
then decided to test the script with the actual copy commands "commented-out".
He made a typo error in the copy command, in effect instructing the main data
storage to copy onto itself, and then compounded his mistake by commenting-out
the wrong line.
He initiated the test run which then attempted to copy
each file over itself. Under the Solaris/UNIX file system this over-wrote the
file inodes erasing all file allocation information and truncating each
file to zero length.
Overwriting directory information, unlike the
actual copying of data, is a very quick process and the damage was done almost
Gonna Call? - Ghostbusters? - No... the server wasn't haunted.
administrator knew immediately that he had a
He started calling for help late on the
A friendly voice which offers practical
help is found on the phone Sunday morning 2 AM.
Each of his Vendors,
(Veritas, SUN, EMC and Oracle), were sympathetic but
could not offer any help at this stage.
He then started to look for
a 24/7 Data Recovery service that was prepared to come on-site. One of the Data
Recovery market leaders boasts about their Remote Recovery expertise. However
when they were contacted they insisted that the entire storage configuration be
shipped to them which the customer dismissed as impractical.
found the ActionFront website where the 24/7 Critical Response service is
promoted along with the standard "Priority Service", which is offered
during extended hours, 6 days per week and meets the time-line and budget
expectations of most of our customers. The Critical Response Service is for
the select few clients that need an extra-ordinary level of around-the-clock
service and have sufficient budget resources available to cover the costs.
24/7 contact information for the Critical Response Service is displayed on the
ActionFront website, and the distressed admin reached the ActionFront consultant
on call at 2 a.m. Sunday.
A long discussion of his
circumstances and problem ensued. He made it clear that the system was needed
to carry on day to day transactions. This precluded shipping it out or even
shutting it down completely. On the advice of the ActionFront consultant, he
did the very best thing he could under the circumstances. He un-mount all the
damaged partitions in order to prevent overwriting any actual data. He missed
one of the 56 partitions because it was running a process during the un-mount
ActionFront responded with a lot more than just good advice:
by noon Sunday a plane ticket was purchased, the traveling recovery kit procured
from the office and a senior technician was on his way to a distant city. He
arrived at the customer site at 9:30 pm and went to work.
attempting any recovery activities, his first action was to make a complete copy
of all media involved. The network remained live as the customer carried on "business
as usual" as much as possible. He copied complete images of the affected
drives, segment by segment to large capacity drives brought along as part of the
recovery kit. The copy process was very slow because it was done over the live
Unhappy with the slow copy process, the ActionFront specialist
analyzed the configuration and parameters he had to work under and soon devised
a way to stream multiple devices simultaneously. This increased the transfer
rate without affecting anything else, and shortened the overall recovery time.
Working on the copied versions of the files, the ActionFront
technician repaired and rebuilt the damaged file systems. Fortunately, the
allocation map was intact and this speeded up the recovery process at this
stage. (The allocation map could have been recreated, had been wiped out, but
it would have extended the recovery time.)
One partition was not
un-mounted at the beginning of the crises, resulting in overwrite damage to 4 of
the 6 files on this partition. One of these files had its Oracle file header
damaged and Oracle provided support for the recreation of this damaged header.
Some of the other logical devices that were damaged were also fixed at
this stage and then reintegrated into the database.
This story has
a happy ending...
All was fixed by 6 pm Friday March 14,
approximately six days after the crisis began. While the company had some
ability to function during the crisis, the loss of much of its historical data
threatened the profitability, and perhaps the viability of the company itself.
So why do data storage professionals need 3rd party Data
Fault-tolerant data storage systems are
generally reliable and well managed. When device failure, user-error and other
problems do cause these systems to fail, it is a rare event; often the first
time the operator has been faced with these circumstances. It can be beyond the
training and experience of most of the technical community including Vendor
technical support, let alone the unlucky systems administrator. Under the
stress of a data emergency, even the best technicians can make mistakes when in
unfamiliar territory; whereas our data recovery specialists deal with these
situations every day and they are well qualified to address the problems.
advice about dealing with a data loss emergency is available.
has published a number of articles with general advice which can help you
Guide - IT Professional Edition and the original
Guide (both of which can be viewed by clicking on these links without
log-ins on our sister site STORAGEsearch.com - ed.)
with ActionFront Benefit IT and Data Storage Vendors and VARs
loss victims are usually distressed and sometimes they can be angry, even though
the anger is misdirected. A referral to ActionFront, a trusted 3rd party
- Bring a positive resolution to the support call.
- Avoid exposure of trying to provide Do-it-Yourself data recovery advice.
- Help separate a data recovery issue from the expectation that it should be
covered by warranty.
- Calm down the customer and help them understand that their problem is not
- Resolve the customers' problem by recovering their data and making their
systems operational again.
Data Recovery profile, Data
Recovery Vendors Directory
Editor's afterword:- there are no silver bullets which will
protect your data from all data demons. This article clearly shows that relying
on a single type of data protection technology makes you vulnerable to common
mode failure. In this case, the common mode was an installation script, but it
can just as easily be a virus. You should always use more than one type of
backup technology for peace of mind. But it's good to know that there are
companies you can call when all else fails.