|
In
the February 2002 issue of this magazine I spoke about disaster
recovery and how to plan an escape route. In the article I
discussed two steps: identifying elements for Failures &
Risks and drafting and implementing a Comprehensive Disaster
Policy. In this article I am going to highlight some precautions
one can take to avoid disasters and to cope with failure,
which pervades into infrastructure, client and application
systems. In these IT domains, all points of failures and disaster
situations need to be assessed separately. Business continuity
is ensured if you plan risk management strategies in advance,
plan resources to handle failures and disasters, and set procedures
for recovery. I think disaster and recovery planning should
be an integral part of design and deployment of any IT system.
 |
| A
sound disaster recovery/business continuity plan is essential
to protect the well being of an organization. by M. D.
Agrawal |
FAIL-PROOF
SYSTEMS AND DISASTER
We
should not confuse disaster planning with design of fail-proof
systems, though the success of these banks on their interdependency.
There is need to architect the best and leave no scope for
failure while designing both applications and network systems.
Two key features of fail-proof design are:
a) Use of reliable hardware with in-built fault-tolerance
and redundant functionality; a network design with redundancy
both in carriers and at end-systems like switches and routers
the QOS approach.
b) Design and deployment of Software Systems with support
of a balanced architecture design, use of tested tools, and
adhering to known quality standards.
A good practice is to learn and find solutions from past failure.
There is no handbook that is the ultimate reference for failure
and disaster controls.
DISASTER RECOVERY PLANS
Disaster
recovery plans vary greatly, depending on the size and scope
of a company and how critical IT is to its operation.
Business Managers and IT Managers should discuss what kind
of recovery plan is necessary, and which systems and business
units are most crucial to the company. They should decide
on people responsible for declaring a disruptive event and
mitigating its effects. Most importantly, the plan should
establish a process to locate and communicate with employees
after a disruptive event.
METHOD
As per Gartner's recommendations, the best way to get started
is to conduct a BIA (Business Impact Analysis). This will
identify most crucial systems in the business and the effect
of an outage on business. The greater the potential impact,
the more money a company should spend to restore systems quickly.
For instance, a stock trading company may decide to pay for
completely redundant systems that will allow it to immediately
start processing trades at another location. On the other
hand, a manufacturing company may decide that it can wait
for 24 hours to resume shipping. A BIA will also help companies
set a restoration sequence to determine which parts of the
business should be restored first.
PLANNING AND TRAINING
You can form a comprehensive policy on risk management practices,
which lists the disaster recovery plan design, and selection
and deployment of hardware and software elements. There should
be a correct choice of systems for contingency. You can choose
between distributed systems and centralized systems. Some
times distributed systems (NASDAQ experience) offer a better
option for recovery.
You can then design and develop the fail-over applications.
Developing fail-proof systems for heterogeneous clients, servers,
and peer systems is a big task. The system administrator should
study the fixes, patches, and applications in detail to counter
the shortcomings and bugs, especially in desktop and server
operating systems.
Training programs for IT staff should also be incorporated
in the design to create awareness on the subject. I have seen
that in most cases, training and awareness on a critical subject
like ensuring data and network availability, is left solely
to the operation staff. And the basic infrastructure that
ensures data backup and fail-over is missing from the initial
plans. This is a wrong approach. It is necessary to include
redundancy and fail-over solutions in any IT system design.
NASDAQ EXPERIENCE
The September 2001 attack on World Trade Center tested the
contingency plans of American businesses to an unanticipated
degree. Companies that had business continuity plans and contracts
in place with vendors of recovery services were able to continue
business at alternate sites with minimum downtime and loss
of data. And the alternate facilities provided by the vendors
performed rather efficiently.
NASDAQ recovered from the attack within six days. This was
possible because of the distributed systems, contingency planning,
and a redundant architecture. It is a matter of fresh planning
i.e. how to further reduce the recovery time and avoid business
loss.
PREPAREDNESS
We may witness disaster for any reason and in any form. It
may be a natural calamity or initiated by humans. Are we prepared
for it?
Organizations should not neglect to review and update its
business continuity plans and contracts. Only if organizations
learn from the experiences of these business houses will they
be able to add value to the lessons from the attack.
Role of outsourcing in disaster and recovery
Another option, which can turn out to be beneficial, is to
engage an outsourcing agency on encountering a disaster. Considering
the large amount of routine work involved, back to back arrangements
for recovery and backup with an outsource partner is viewed
as a better option. This will supplement backups kept at ISP
sites.
Option for contingency planning
Standby Computers: You can create standby computer systems
and locate them in air-conditioned computer rooms in trailers.
The equipment can be delivered to the nominated sites in the
event of a disaster. While it may not be practical to make
all systems redundant, do not compromise when it comes to
mission critical systems. An alternative is the outsource
route, especially if there is a very large requirement for
desktops and low-end servers.
Data Management: Efficient data management can provide facilities,
which automatically back up computer data to remote vaults
or mirror copies of databases held at a recovery center. This
may not provide fail-over for the system but will ensure data
availability.
Facilities and utilities: Supporting a data center like power
and network channels should be planned along with backup procedures.
The plan should encompass dual power source and supply, and
redundant network links from different utility feeds. The
team commissioned to design, install, and test the offsite
recovery of the data center, should give their feedback to
the project manager.
LEGATO REPORT
Legato published a report, which appeared in CIO Magazine,
on the pitfalls of IT systems during the September 11 attacks.
I will leave you with some lessons which we can learn from.
The most serious predicaments centered on:
-
Lack of documented recovery procedures.
-
Configuring replacement hardware without documentation detailing
the original systems configuration and setups.
-
Lack of tape documentation and tracking and inadequate tape
archive policies.
-
Lack of protection for departmental servers.
To deal with these problems, Legato has identified five key
lessons:
Lesson 1:
Document system configurations: Many customers had not adequately
documented their system configurations. This fundamental precaution
takes form of a simple, but highly important document that
includes configuration addresses and settings required to
make the infrastructure operational. The document should be
stored off-site.
Lesson 2:
Document and archive disaster Recovery procedures: It is important
to have a comprehensive, written disaster recovery procedure
in place. The documents should be archived off-site along
with business-critical data.
Lesson 3:
Safeguard, document and track tape media: From its interaction
with 18 customers that Legato worked with, following the September
11 disaster, it estimates that about 30 percent of business
data was lost because it had not been backed up or rotated
off-site quickly enough. Very few had documentation recording
the contents of the tapes. Tape contents must be documented
and, again, the documentation stored off-site.
Lesson 4:
Identify and protect all business critical servers: Restoring
business operations means that the servers that hosted business-critical
applications like email, ERP and database servers have to
be recovered. In addition to basic data protection, bare-metal
recovery solutions allow easier replacement and rebuilding
of critical application servers with minimal effort.
Lesson 5:
Online Data Protection Can Ease Recovery: The customers who
sailed through the disaster were the ones who had instant
access to online, off-site copies of their production data.
This second or third copy of critical data had been produced
in real-time by replication technologies.
These sorts of investments will be helpful for business continuity
like safety on the highway better late than never. There is
need for accountability. The choice is yours.
M. D. Agrawal is Chief Manager, IS Refinery
System,
Bharat
Petroleum Corporation Ltd.
|