Home > CIO Speak - Disaster recovery
 Print Friendly Page ||  Email this story
Routes to escape

In the February 2002 issue of this magazine I spoke about disaster recovery and how to plan an escape route. In the article I discussed two steps: identifying elements for Failures & Risks and drafting and implementing a Comprehensive Disaster Policy. In this article I am going to highlight some precautions one can take to avoid disasters and to cope with failure, which pervades into infrastructure, client and application systems. In these IT domains, all points of failures and disaster situations need to be assessed separately. Business continuity is ensured if you plan risk management strategies in advance, plan resources to handle failures and disasters, and set procedures for recovery. I think disaster and recovery planning should be an integral part of design and deployment of any IT system.

A sound disaster recovery/business continuity plan is essential to protect the well being of an organization. by M. D. Agrawal

We should not confuse disaster planning with design of fail-proof systems, though the success of these banks on their interdependency. There is need to architect the best and leave no scope for failure while designing both applications and network systems. Two key features of fail-proof design are:

a) Use of reliable hardware with in-built fault-tolerance and redundant functionality; a network design with redundancy both in carriers and at end-systems like switches and routers the QOS approach.

b) Design and deployment of Software Systems with support of a balanced architecture design, use of tested tools, and adhering to known quality standards.

A good practice is to learn and find solutions from past failure. There is no handbook that is the ultimate reference for failure and disaster controls.

Disaster recovery plans vary greatly, depending on the size and scope of a company and how critical IT is to its operation.

Business Managers and IT Managers should discuss what kind of recovery plan is necessary, and which systems and business units are most crucial to the company. They should decide on people responsible for declaring a disruptive event and mitigating its effects. Most importantly, the plan should establish a process to locate and communicate with employees after a disruptive event.

As per Gartner's recommendations, the best way to get started is to conduct a BIA (Business Impact Analysis). This will identify most crucial systems in the business and the effect of an outage on business. The greater the potential impact, the more money a company should spend to restore systems quickly. For instance, a stock trading company may decide to pay for completely redundant systems that will allow it to immediately start processing trades at another location. On the other hand, a manufacturing company may decide that it can wait for 24 hours to resume shipping. A BIA will also help companies set a restoration sequence to determine which parts of the business should be restored first.

You can form a comprehensive policy on risk management practices, which lists the disaster recovery plan design, and selection and deployment of hardware and software elements. There should be a correct choice of systems for contingency. You can choose between distributed systems and centralized systems. Some times distributed systems (NASDAQ experience) offer a better option for recovery.

You can then design and develop the fail-over applications. Developing fail-proof systems for heterogeneous clients, servers, and peer systems is a big task. The system administrator should study the fixes, patches, and applications in detail to counter the shortcomings and bugs, especially in desktop and server operating systems.

Training programs for IT staff should also be incorporated in the design to create awareness on the subject. I have seen that in most cases, training and awareness on a critical subject like ensuring data and network availability, is left solely to the operation staff. And the basic infrastructure that ensures data backup and fail-over is missing from the initial plans. This is a wrong approach. It is necessary to include redundancy and fail-over solutions in any IT system design.

The September 2001 attack on World Trade Center tested the contingency plans of American businesses to an unanticipated degree. Companies that had business continuity plans and contracts in place with vendors of recovery services were able to continue business at alternate sites with minimum downtime and loss of data. And the alternate facilities provided by the vendors performed rather efficiently.

NASDAQ recovered from the attack within six days. This was possible because of the distributed systems, contingency planning, and a redundant architecture. It is a matter of fresh planning i.e. how to further reduce the recovery time and avoid business loss.

We may witness disaster for any reason and in any form. It may be a natural calamity or initiated by humans. Are we prepared for it?

Organizations should not neglect to review and update its business continuity plans and contracts. Only if organizations learn from the experiences of these business houses will they be able to add value to the lessons from the attack.

Role of outsourcing in disaster and recovery
Another option, which can turn out to be beneficial, is to engage an outsourcing agency on encountering a disaster. Considering the large amount of routine work involved, back to back arrangements for recovery and backup with an outsource partner is viewed as a better option. This will supplement backups kept at ISP sites.

Option for contingency planning
Standby Computers: You can create standby computer systems and locate them in air-conditioned computer rooms in trailers. The equipment can be delivered to the nominated sites in the event of a disaster. While it may not be practical to make all systems redundant, do not compromise when it comes to mission critical systems. An alternative is the outsource route, especially if there is a very large requirement for desktops and low-end servers.

Data Management: Efficient data management can provide facilities, which automatically back up computer data to remote vaults or mirror copies of databases held at a recovery center. This may not provide fail-over for the system but will ensure data availability.

Facilities and utilities: Supporting a data center like power and network channels should be planned along with backup procedures. The plan should encompass dual power source and supply, and redundant network links from different utility feeds. The team commissioned to design, install, and test the offsite recovery of the data center, should give their feedback to the project manager.

Legato published a report, which appeared in CIO Magazine, on the pitfalls of IT systems during the September 11 attacks. I will leave you with some lessons which we can learn from.

The most serious predicaments centered on:

  • Lack of documented recovery procedures.
  • Configuring replacement hardware without documentation detailing the original systems configuration and setups.
  • Lack of tape documentation and tracking and inadequate tape archive policies.
  • Lack of protection for departmental servers.

To deal with these problems, Legato has identified five key lessons:

Lesson 1:
Document system configurations: Many customers had not adequately documented their system configurations. This fundamental precaution takes form of a simple, but highly important document that includes configuration addresses and settings required to make the infrastructure operational. The document should be stored off-site.

Lesson 2:
Document and archive disaster Recovery procedures: It is important to have a comprehensive, written disaster recovery procedure in place. The documents should be archived off-site along with business-critical data.

Lesson 3:
Safeguard, document and track tape media: From its interaction with 18 customers that Legato worked with, following the September 11 disaster, it estimates that about 30 percent of business data was lost because it had not been backed up or rotated off-site quickly enough. Very few had documentation recording the contents of the tapes. Tape contents must be documented and, again, the documentation stored off-site.

Lesson 4:
Identify and protect all business critical servers: Restoring business operations means that the servers that hosted business-critical applications like email, ERP and database servers have to be recovered. In addition to basic data protection, bare-metal recovery solutions allow easier replacement and rebuilding of critical application servers with minimal effort.

Lesson 5:
Online Data Protection Can Ease Recovery: The customers who sailed through the disaster were the ones who had instant access to online, off-site copies of their production data. This second or third copy of critical data had been produced in real-time by replication technologies.

These sorts of investments will be helpful for business continuity like safety on the highway better late than never. There is need for accountability. The choice is yours.

M. D. Agrawal is Chief Manager, IS Refinery System,
Bharat Petroleum Corporation Ltd.

- <Back to Top>-  

Copyright 2001: Indian Express Group (Mumbai, India). All rights reserved throughout the world. This entire site is compiled in Mumbai by The Business Publications Division of the Indian Express Group of Newspapers. Site managed by BPD