Archives || Search || About Us || Advertise || Feedback || Subscribe-
-
Issue of December 2005 
-

[an error occurred while processing this directive]

  -  
 
 Home > Cover Story
 Print Friendly Page ||  Email this story

Disaster Recovery/Business Continuity

Prepared for the worst

Kumar Dawada brings you insights into how the banking and telecom sectors have faced the major disasters that have affected India recently. What lessons have they learnt and what type of business continuity planning have they in place to ensure that their business continues without substantial downtime?

Anything that can go wrong, will
Murphy’s Law

A major disaster like the tsunami in Tamil Nadu, the Mumbai floods or the Kashmir earthquake makes headlines due to their dramatic and sensational nature. However, there are many disasters that are low profile in nature but have far reaching effects on business organisations—large and small.

Contemporary business lives under the sword of disaster, and the larger the organisation the more it has to lose. Smaller businesses suffer more because they are unable to absorb losses and respond to sudden downtime. This downtime can be due to anything from natural or man-made disasters like earthquakes, fire, floods, tornados, terrorist attacks, etc., or it can be an extended power loss, telecom failure or air-conditioning failure.

Life Under The Sword

DR today means not just restoring data and system access after an unscheduled downtime. It means having in place a well-planned and tested method of anticipating and responding to any eventuality which leads to downtime. The focus has shifted from reactive to proactive planning, and narrows down to preventing downtime from happening at all.

Even at the global level, the focus has shifted from DR, which concentrates only on the IT department, to Business Continuity Planning (BCP), which covers the entire business perspective including the business processes as well as skilled manpower responsible for running the business.

It is universally acknowledged that BCP is a costly and complex process. It needs the co-operation and co-ordination of the entire business organisation. It asks uncomfortable and embarrassing questions about business processes, and requires thorough understanding of all critical business processes and how downtime will affect each process and the organisation as a whole. The worst part is that it does not have any immediate business benefit. However, those who have it in place know from experience that a well-planned BCP pays for its high cost after the first or second use.

Lessons From Hard Knocks

Business organisations have learnt their lesson from the Tamil Nadu tsunami and Mumbai floods. The results are either investments in disaster recovery sites or development of comprehensive BCP. Some organisations have also outsourced DR to ensure business continuity in case of a major disaster.

“It is necessary to have generators and UPS on elevated platforms. The DR site or data centre too must be elevated, otherwise it becomes vulnerable to flooding,” says Sanjay Sharma, Head, IT, IDBI Bank. It is also necessary to build fuel capacity in-house for the generators. There must be a backup for at least 12 hours.

It is also essential to have standby generators in place. Many organisations had tried to procure diesel or petrol from petrol pumps during the recent disasters. However, the pumps were not working due to power failure, so a manual hand-pump had to be used and the fuel had to be manually transported as no other transportation was available because of the floods. This scenario will repeat again when another major disaster strikes with the likelihood of even the DR sites going down. Organisations were under the assumption that during a disaster they would be able to fly in people from other locations to the DR site, but during the Mumbai floods even the airport was non-functional.

The Mumbai floods can be labelled as a partial disaster because the flooding took place only during the late noon hours. Most of the people were already at work and so the DR site was already operating. However, it has now been realised that the DR site has to be working 24x7x365/6 to tackle a full-fledged disaster, and that it is necessary to have an active DR site operated by dedicated staff.

Sunil Gupta, Head, Product Management & Business Operations (IDC) of Reliance Infocomm says that despite the heavy rains and power failure of 26/7, the Reliance Infocomm data centre was fully functioning. This is because the level-3 data centre’s roofs were waterproof and prevented water from entering server halls. Sand bags were deployed at key points to prevent water from entering the campus, and de-watering pumps were used to remove water from the power substation outside the data centre building.

Diesel generators were activated to prevent any power outage from the main grids, and enough diesel was stored to run the facility for three days without the need of fuel from outside. The Reliance Infocomm data centre is claimed to be resistant to natural disasters like flood, earthquakes and lightning, and even man-made disasters like power failure and riots.


IDBI: Revolving around awareness

Sanjay Sharma

While it is good to have a strong BCP and DR site, it is also necessary to create a DR/BC-aware organisation. For this, training is given to people when they join the organisation.

Training is provided about the infrastructure in place, the solution that has been opted for, the impact of the solution on the business, and most importantly, the impact of the solution not being available on time. The bank’s users are also aware of the function that will be first affected by disasters (for instance, anywhere banking or channels), and what the time gap is before the DR site takes over from the main data centre. A training programme is also present for senior-level management (such as cluster heads and zonal heads) to make them aware what they are expected to do to tackle disasters.

IDBI’s DR Tech

IDBI’s primary site is at Mumbai while the DR site is at Chennai. “The Chennai infrastructure will be able to take 100 percent load of the primary data centre in Mumbai,” says Sanjay Sharma, Head of IT at IDBI Bank.

This is significant because if the DR sites take lesser load than the primary centre even by 10 percent, then as the business volume grows there might be issues; when disaster strikes, the organisation might not be able to operate from the DR site.

“We are working with IBM for the new data centre and have the IBM P-590 series put in place at both the DR and primary sites. We use SD-8300 storage boxes as well as lots of other products and tools that will automate the disaster recovery process,” reveals Sharma.

IDBI uses a hot DR concept and 100 percent replication takes place between the Mumbai and Chennai sites. The sites are connected by broadband links, and the replication takes place on a regular basis.

Activation Responsibility

Though the disaster recovery policy is well-defined, the bank faced the issue as to who will officially declare that disaster has struck, and who will take the decision that the DR site has to be activated. People who operate the site may make mistakes, so an automated DR process was put in place; consequently, when an event takes place the DR site is activated.

The option of manual activation is retained. Once it is officially declared, the concerned person has to run certain options and key in the proper commands to activate the DR site.

The bank has a high-speed link between the DR site and the primary site, and an alternate route for each path of the link so that if one link goes offline the other is available. Network symmetry is ensured so that when the primary site goes down the secondary site takes over without any complication from the perspective of network connectivity.

Getting Priorities Straight

DR is a subset of BCP. “You have to describe how business continuity is achieved. Create the DR site, create the information structure and IT platforms. But even though everything may be automated, some intelligent process is required at local branch levels,” says Sharma.

According to him, when you talk about BCP you also talk about technology and business-related issues, and see how you will operate the branch in case of a disaster. Hence, BCP is combination of technology, non-technology, manual processes and other external dependencies.

Not all functions, products or categories of operations can be treated at par. The applications and business processes have to be classified into different categories like platinum, gold and silver since set-up and maintenance of DR infrastructure is an expensive process. The RTO/RPO (Recovery Time/Point Objectives) varies depending on the risk associated with each application and the type of risk carried with that application.

Organisations have to constantly ponder what is the affordable downtime for the process or business function. “In an effective DR and BCP, every function of the business is included so you can’t ignore anything. You have to see what is more critical and pay more attention to that first,” stresses Sharma.

According to him, there will always be gaps in RPO. It cannot be zero due to its prohibitively expensive nature. In fact, it may be even more than the cost of the data loss due to the disaster. Clearly, the organisation has to balance between cost and RTO/RPO.

The DR Wish list

Sanjay Sharma, IT Head of IDBI Bank, feels that in theory heterogeneous systems talk to each other, but in reality different vendor-based solutions are not compatible. Although some high-end systems provide heterogeneity, it is not always easy.

Sharma’s biggest wish is that heterogeneous systems should talk to each other irrespective of the vendor, who makes them, and no matter what application or infrastructure is being used.

The cost of bandwidth should also become more reasonable. Remote manageability of solutions should be more mature. High-end systems have the option of zero downtime, and can replace any part or component without any shutdown. But there are constraints there as well. Further, scripts should be automated and they should work with the RPO and RTO.

C N Ram, Head, IT, HDFC Bank, feels that the software that is used to replicate data must work seamlessly across diverse geographical locations. There must be different software designs put in place to replicate data instantly.

Towards Active DR

Instead of having a passive DR activated only during drills or an actual disaster, it is best to create an active DR site whereby the organisation can run queries from the DR site and balance operations between the DR and primary site. This will result in the optimum utilisation of resources deployed at the DR site, and node balancing between the DR and primary site.

Whenever there is a disaster or crisis, effectiveness of the DR site is tested. Factors include the magnitude of the disaster and whether people can work during it. Inputs can be obtained from the customers or users of the channels to find out the problems they face. Audits are done at regular intervals to know how effectively the DR site works.

“As the new infrastructure for DR between Mumbai and Chennai is ready, we intend to do dry runs at least once a month. The frequency will depend on factors like the impact it has on the business, branch operations and various channels. In reality, no dry run can be with zero downtime,” emphasises Sharma.

Anticipation And Assessment

Every business and process has potential risks. “We are Basel II compliant so we constantly review the operational risks and do methodological risk assessment. It is then quantified and measured in terms of loss of data because of that particular operation,” informs Sharma.

Risk is associated with every process and software. It has to be assessed how much business loss occurs due to downtime and the resulting business impact. It is then translated into the risk factor. Higher business impact risk factors are given higher DR priority and more sophisticated infrastructure is dedicated to them. The lower the business impact, the lower is the risk, so a different type of treatment is given for the DR scenario or BCP plan. This is an elaborate exercise.

Threats from external sources include somebody trying to crack your site and affecting the reputation of the company or the finances of the company. Each threat is converted into a risk. When real disasters strike, in spite of risk and threat assessment, the things that actually happen are usually beyond expectation. That’s why proper BCP and DR plans must be available in each zone to co-ordinate with the user because during this period panic and anxiety levels among users are high.

Currently, everything is managed from a centralised environment and the branch is only a virtual branch. For each activity or business process, the DR planning or BCP can be done, and it is possible to find out what the local impact is. It is also possible to calculate how a branch operation will be impacted.

External threats can be prevented to some extent by Internet banking, having DMZ (demilitarised zone), firewalls, patch upgradation, etc. Even ethical hacking can be done by external vendors to evaluate the organisation’s belief in its infrastructure security.

Internal threats can be due to misuse of rights, intent of employees to take away company data with them, misuse of access to obtain information which a person is not supposed to, or having access rights on too many systems. Once the change management system is in place, it mitigates many risks, hence all this helps define how you counter internal and external threats.

A BCP must have the following steps—analyse the business, assess the risk, develop the DR and other strategies, develop actual DR and business continuity plan, and finally keep on rehearsing the plan. Having a BCP not only helps reduce financial loss and loss of marketshare, it also protects assets including employees, and reduces or prevents bad publicity.


Fighting Fit

V Babu, DGM, IT, Bank of India, on the DR/BC strategies that Bank of India relies on

According to the Reserve Bank of India’s (RBI) guidelines, it is imperative for every banking and financial institution to have a disaster recovery plan in place. Banks are still in the process of framing a BCP, but they have to be ready to tackle serious business disruptions. V Babu, Deputy GM, IT Department, Bank of India, feels it makes more business sense to outsource their DR and BCP management. Consequently, the organisation is able to focus on its core competencies and provide a better banking experience to its customers.

DR Works

Price Waterhouse Coopers was consulted for DR, and the strategy is updated every year. “The DR site for core banking is at Bangalore, and it is provided for by HP,” says Babu. HP has built and is managing the data centre, disaster recovery site, help desk and call centre for Bank of India.

DR awareness among the employees as well as top-level management staff is necessary. This is achieved through the information security policy as well as the DR set-up of the organisation.

The DR plan is basically implemented in three stages. For the core banking services the DR site is at Bangalore. For non-core banking services there are dual servers and tape backups. A testing cycle is implemented every month to ensure that the data on the tape can be restored properly.

“Auditing the DR site is an ongoing process. The information system audit group goes and audits all branches and the necessary review is done. The reports are then presented to the IT department to rectify any fault in the system,” says Babu.

As the DR plan is based on industry standards, all potential risks and threats are taken into account and assessed. Even during the Mumbai floods of 26/7, the ATMs were functioning properly and most branches were running normally. There was no major issue due to the DR plan and DR awareness among the employees.


The reliance on DR

Sunil Gupta, Head, Product Management and Business Operations (IDC), Reliance Infocomm on the organisation’s DR/BC strategy

“Our DR set-up consists of a highly redundant data centre. The primary site in Mumbai has never experienced a power failure in the last five years. Even the air conditioners and other systems never fail. The data centre is prepared for the eventuality whereby even if all of Mumbai or Maharashtra does not have power for days, it will not be affected,” says Sunil Gupta, Head, Product Management and Business Operations (IDC), Reliance Infocomm. To make a call at the time of disaster is critical for the customer and this is when the real need of a service is felt.

As a telecom provider, infrastructure hosting critical applications has to be robust. The primary site in Mumbai and the Bangalore DR centre are built with level 3 specifications to withstand earthquakes floods and lightning. There is no single point of failure either at the primary or the DR site in Bangalore.

All critical applications have a parallel set-up at Bangalore. As the data in Mumbai is on a SAN, SAN-to-SAN replication is done and if the data is lost or the server is down, it is available at the DR site due to replication using high capacity links.

Classifying Applications For DR

“For critical applications, the replication is done in real-time at Mumbai itself. A three-way DR recovery concept is used. Another DR site or data centre is set up at Mumbai itself. It is used to store any information, function or process for which a delay is not acceptable. Hence a second copy is available at the Mumbai DR site and a third at the DR site in Bangalore,” explains Gupta.

If data, process or application is not critical, then a warm DR recovery concept is implemented. The application is still running at the other end (Bangalore DR site) and data replication occurs in a batch mode every hour or every day as the case may be.

In addition to all this there is also a cold site for storing the data. This is to cover the unlikely scenario where both the primary and the DR site go down. In such a scenario, it is still possible to recover the data from the cold site. Hence, application classification is done depending on the type of data or process and its impact on business, and customers.

The NNOC DR

Being a telecom company, Reliance Infocomm also has a huge network operation centre where people work on a 24X7 basis. The company’s main business is that of a service provider.

To handle situations where the NNOC (National Network Operations Centre) is not operational, a parallel NNOC has been set up at Hyderabad. It can take over operations without any service getting affected.

Infrastructure And DR Operations

At the primary site, all databases are stored on a SAN that has a capacity of more than 500 terabytes which is claimed to be the biggest in India. In storage also, concepts like BCV (business continuity volumes) are used. First, there are servers that are redundant, then a server cluster, the data is then kept on the SAN and a SAN BCV is performed. The data kept on tape is constantly restored on different systems to check if the restoration is successful or not and whether the data is still the same. This process of tape backup and restoration is done on a regular basis. However the frequency differs depending on the priority as well as the nature of the data.

A Well-Defined BCP

Reliance Infocomm has two certifications—ISO 9000:2000 and BS 7799. There are audits for every aspect of the infrastructure. This includes not just the IT resources but also the processes governing them. Processes are available for audits, mock drills, physical resources, logical assets like backup storage, firewalls and the DR site.

Due to the BS 7799 certification, the organisation has well-defined written processes. The documentation is also comprehensive and covers 50 manuals. For every activity the company follows ITIL processes—be it training, design, implementations, change management, event management, or troubleshooting processes.

Similarly, as the organisation has a full-fledged DR plan and a security policy, it has helped it define every possible disaster from the macro perspective as well as from the micro one. The macro perspective includes natural disasters like earthquakes, floods, lightening and even man-made ones such as terrorism and riots. The micro level perspective can be server crashes or system data getting corrupted. Every eventuality is covered with well-defined written procedures and the action plans to adopt in worst-case scenarios.

“Though the Mumbai and Bangalore data centres are redundant due to mandatory provision, we do dry runs every week and shut off the power from the grid and run the DR site to ensure that it is working properly,” says Gupta.

Every area and procedure from customer provisioning to customer support, call centre activities to network components and backend applications supporting them are included in the DR and BC plan.

All services related to service provisioning, service support, operations and customer support are included in the scope of DR and all are run on the hot site or warm site concept. The surrounding administrative functions, which include sales and HR are also covered by DR. In short all applications, processes are well defined and categorised according to the priority and gets DR treatment accordingly.

DR Infrastructure

The organisation uses an EMC solution for a SAN-to-SAN replication. HP OpenView is used for network management to provide a view of the network across the entire country. Tools are used for network redundancy and network DR. There are customer and vendor-specific tools to give access to servers hosted anywhere in the country. For instance, in the event of a disaster in Mumbai, the Bangalore team can access those servers and manage it from there.

DR Awareness

As a part of ISO and BS 7799 certification compliance, each and every person working in the IDC building is trained for fire-fighting, how to respond to disasters and how to provide support.

The DR process is available on the portal and people are trained in the process. The company also has to report every six months to the certification body that people have been trained and retrained on this process. The refresher course is conducted at regular intervals. Surprise mock drills are carried out to check the response time and readiness of the people involved in supporting the DR site.


Banking on BCP

C N Ram, IT Head, HDFC Bank on how the bank deals with crises

C N Ram

C N Ram feels that given the current business scenario, DR has become a part of the overall business continuity planning process. The function of DR is to preserve data and prevent loss of customer and service related data. This can be handled by the IT department. However, during a disaster many other business processes and departments are impacted, so a sound BCP involves processes as well as people who are skilled and capable of taking over and maintaining the DR site on a regular basis.

The primary site of HDFC Bank is in Mumbai, while the DR site is in Chennai—both for the system and operations. Optimum utilisation is done at the DR site by undertaking load balancing. The work is split between both the primary and DR sites. So, people on both sites are familiar with the DR process, can takeover and manage during periods of downtime. The DR site is online and data from the banking system is replicated on the DR site within 15 to 30 minutes depending on traffic delay in the transmission lines.

Processes Covered

All corporate and retail banking system are covered in the organisation’s DR and BCP. “We have an ATM switch which is Basel II-complaint. There is an identical system in Chennai. All ATMs are connected to the Chennai DR site in case of failure of the primary site,” says Ram. The entire retail, wholesale banking system, cash management are also covered in DR. Periodic trials of the ATM switches are conducted every six months.

Though Chennai has its own share of disasters, it is still preferred as a DR site by most organisations. This is because of good infrastructure, availability of proper vendor support and availability of skilled manpower to maintain the DR site and real estate is inexpensive.

26/7 Mumbai Floods

HDFC had some problems during the Mumbai floods especially with the telecom links that snapped at many places. Only the Reliance telecom network was working. There was also the issue of power failures.

Some ATMs were totally cut off due to flooding and the agencies hired to replenish the cash at the ATMs could not do so. However, the data centre was totally operational.

Ram feels that proper DR and business continuity planning is the key to a successful DR and BCP implementation. The DR and BCP should include all the business processes and people necessary to keep the IT resources working during the disaster and protect the data and system. The company should maintain the business continuity in such a way that the customer feels that the organisation is still in normal operation mode.

 
     
- <Back to Top>-  
Untitled Document
 
Indian Express - Business Publications Division

Copyright 2001: Indian Express Newspapers (Mumbai) Limited (Mumbai, India). All rights reserved throughout the world. This entire site is compiled in Mumbai by the Business Publications Division (BPD) of the Indian Express Newspapers (Mumbai) Limited. Site managed by BPD.