storage search
SPARC Product Directory - since 1992
from the makers of StorageSearch.com
sparc product directory

Recovering a Sun Hosted Business
from Administrator Induced Data Loss

by Ron Austin, ActionFront Data Recovery

Editor's intro:-Disk to disk backup (or snapshots in the cloud) are supposedly quick and simple. But what happens when your live data and your backup are simultaneously zapped by a systems administrator's script error?

This true story below relates the cautionary tale of how one Sun customer recovered their business when their server and database suppliers were unable to help them in a critical data loss situation.
SSD ad - click for more info
Recovering a Sun Hosted Business from Administrator Induced Data Loss
In March 2003, ActionFront solved a major technical glitch involving Sun Microsystems, EMC, Veritas and Oracle and saved the client's business!

A data-centre user-error caused a major data loss emergency on a mission critical system. Had the company in question been un-able to regain access to their server and data, it would have imperiled the future of the business.

Here is a detailed description of the setting, the problem and the resolution.

The Setting
  • The main server was a SUN E5500 Server with 10 CPUs running under Sun Solaris
  • The application itself was based on Oracle Enterprise Edition ver 8.1.5
  • The data storage side ran under a Veritas file system with multiple partitions.
  • 56 of these partitions were used to serve data to a particular application.
  • Approximately 260 files resided across these 56 partitions, and while ranging in size from ½GB to 8GB, most were approximately 2GB.
  • 46 of the partitions resided on an EMC Symmetrix storage system, 10 partitions resided on SUN storage systems.
  • The EMC Symmetrix was configured as a series of discrete RAID0 arrays: 18GB drives, each mirrored to a second 18GB drive. Each array was portioned into two 9GB segments yielding the 46 partitions on that server.
  • The usable capacity exceeded 600GB in total.
  • A new EMC Clarion server had been purchased and the plan was to consolidate all the storage from the 56 partitions on the Clarion server. A migration process was planned.
Data Recovery
Data Recovery Services on
STORAGEsearch.com
When Megabyte's storage got broken,
he knew how to fix it as good as new.
.
storage reliability - news & white papers

no SPOF - mentions on StorageSearch.com

how the enterprise adoption of flash changed from 2004 to 2016


Articles about Data Recovery

SSD Data Recovery?
What is Data Recovery?
Selecting a Data Recovery Provider
Surviving Non-traditional Data Disasters
Testing the Limits of Hard Disk Recovery
the Data Emergency Guide for Consumers (pdf)
the Data Emergency Guide for Enterprises (pdf)
Will the Recession ReCenter Data Recovery to China?
Why Consumers Can Expect More Flaky Flash SSDs!
Recovering Data from Drowned / Flooded Hard Drives
Recovering a Business from Administrator Induced Data Loss
Data Recovery - editor mentions on STORAGEsearch.com
Sex, Spies and Hard Drives - Findings from 1,000 Data Recoveries
Data Loss and Hard Drive Failure: Understanding the Causes and Costs

The Problem

On March 8, 2003 - a quiet Saturday, the systems administrator wrote a script to perform the migration and then decided to test the script with the actual copy commands "commented-out". He made a typo error in the copy command, in effect instructing the main data storage to copy onto itself, and then compounded his mistake by commenting-out the wrong line.

He initiated the test run which then attempted to copy each file over itself. Under the Solaris/UNIX file system this over-wrote the file inodes – erasing all file allocation information and truncating each file to zero length.

Overwriting directory information, unlike the actual copying of data, is a very quick process and the damage was done almost instantly.

Who You Gonna Call? - Ghostbusters? - No... the server wasn't haunted.

The administrator knew immediately that he had a huge problem.

He started calling for help late on the Saturday.

A friendly voice which offers practical help is found on the phone Sunday morning 2 AM.

Each of his Vendors, (Veritas, SUN, EMC and Oracle), were sympathetic but could not offer any help at this stage.

He then started to look for a 24/7 Data Recovery service that was prepared to come on-site. One of the Data Recovery market leaders boasts about their Remote Recovery expertise. However when they were contacted they insisted that the entire storage configuration be shipped to them which the customer dismissed as impractical.

He soon found the ActionFront website where the 24/7 Critical Response service is promoted along with the standard "Priority Service", which is offered during extended hours, 6 days per week and meets the time-line and budget expectations of most of our customers. The Critical Response Service is for the select few clients that need an extra-ordinary level of around-the-clock service and have sufficient budget resources available to cover the costs.

The 24/7 contact information for the Critical Response Service is displayed on the ActionFront website, and the distressed admin reached the ActionFront consultant on call at 2 a.m. Sunday.

A long discussion of his circumstances and problem ensued. He made it clear that the system was needed to carry on day to day transactions. This precluded shipping it out or even shutting it down completely. On the advice of the ActionFront consultant, he did the very best thing he could under the circumstances. He un-mount all the damaged partitions in order to prevent overwriting any actual data. He missed one of the 56 partitions because it was running a process during the un-mount procedure.

ActionFront responded with a lot more than just good advice: by noon Sunday a plane ticket was purchased, the traveling recovery kit procured from the office and a senior technician was on his way to a distant city. He arrived at the customer site at 9:30 pm and went to work.

Prior to attempting any recovery activities, his first action was to make a complete copy of all media involved. The network remained live as the customer carried on "business as usual" as much as possible. He copied complete images of the affected drives, segment by segment to large capacity drives brought along as part of the recovery kit. The copy process was very slow because it was done over the live network.

There are hundreds of articles about SSDs on StorageSearch.com
Here, below, are some examples.
  • RAM Cache Ratios in flash SSDs - it's important to know the underlying RAM cache architecture - even if you're happy with the R/W and IOPS performance.
  • 2010 - 1st Fizz in the SSD Bubble? - even the dogs in the street know this is going to be a multibillion dollar market. Greed will play as big a part as technology in shaping the SSD year ahead.
  • the pros and cons of using SSD ASAPs - auto tuning SSD appliances are a new category of SSD which entered the market in the 2nd half of 2009 to accelerate servers without needing human tune-ups. How can you tell if they are right for you? And how well do they work?
  • the Problem with Write IOPS - in flash SSDs - long established as a useful performance modeling metric - this article explains why some specs are exaggerated when applied to flash SSDs - or predict the wrong results for many common applications.

Unhappy with the slow copy process, the ActionFront specialist analyzed the configuration and parameters he had to work under and soon devised a way to stream multiple devices simultaneously. This increased the transfer rate without affecting anything else, and shortened the overall recovery time.

Working on the copied versions of the files, the ActionFront technician repaired and rebuilt the damaged file systems. Fortunately, the allocation map was intact and this speeded up the recovery process at this stage. (The allocation map could have been recreated, had been wiped out, but it would have extended the recovery time.)

One partition was not un-mounted at the beginning of the crises, resulting in overwrite damage to 4 of the 6 files on this partition. One of these files had its Oracle file header damaged and Oracle provided support for the recreation of this damaged header.

Some of the other logical devices that were damaged were also fixed at this stage and then reintegrated into the database.

This story has a happy ending...

All was fixed by 6 pm Friday March 14, approximately six days after the crisis began. While the company had some ability to function during the crisis, the loss of much of its historical data threatened the profitability, and perhaps the viability of the company itself.

So why do data storage professionals need 3rd party Data Recovery Services?

Fault-tolerant data storage systems are generally reliable and well managed. When device failure, user-error and other problems do cause these systems to fail, it is a rare event; often the first time the operator has been faced with these circumstances. It can be beyond the training and experience of most of the technical community including Vendor technical support, let alone the unlucky systems administrator. Under the stress of a data emergency, even the best technicians can make mistakes when in unfamiliar territory; whereas our data recovery specialists deal with these situations every day and they are well qualified to address the problems.

Free advice about dealing with a data loss emergency is available.

ActionFront has published a number of articles with general advice which can help you including:- the Data Emergency Guide - IT Professional Edition and the original Data Emergency Guide (both of which can be viewed by clicking on these links without log-ins on our sister site STORAGEsearch.com - ed.)

Why Alliances with ActionFront Benefit IT and Data Storage Vendors and VARs

Data loss victims are usually distressed and sometimes they can be angry, even though the anger is misdirected. A referral to ActionFront, a trusted 3rd party authority, can:

  • Bring a positive resolution to the support call.
  • Avoid exposure of trying to provide Do-it-Yourself data recovery advice.
  • Help separate a data recovery issue from the expectation that it should be covered by warranty.
  • Calm down the customer and help them understand that their problem is not vendors' fault.
  • Resolve the customers' problem by recovering their data and making their systems operational again.

...ActionFront Data Recovery profile, Data Recovery Vendors Directory


Editor's afterword:- there are no silver bullets which will protect your data from all data demons. This article clearly shows that relying on a single type of data protection technology makes you vulnerable to common mode failure. In this case, the common mode was an installation script, but it can just as easily be a virus. You should always use more than one type of backup technology for peace of mind. But it's good to know that there are companies you can call when all else fails.


SSD ad - click for more info

SPARC(R) is a registered trademark of SPARC International, Inc. SPARC PRODUCT DIRECTORY(SM) is a service mark of SPARC International, Inc used under license by ACSL. Products using the SPARC trademarks are based on an architecture developed by Sun Microsystems, Inc.