|
Disaster Recovery/Business Continuity
Prepared for the worst
Kumar Dawada brings you insights into
how the banking and telecom sectors have faced the major disasters that have
affected India recently. What lessons have they learnt and what type of business
continuity planning have they in place to ensure that their business continues
without substantial downtime?
Anything that can go wrong, will
Murphys Law
A
major disaster like the tsunami in Tamil Nadu, the Mumbai floods or the Kashmir
earthquake makes headlines due to their dramatic and sensational nature. However,
there are many disasters that are low profile in nature but have far reaching
effects on business organisationslarge and small.
Contemporary business lives under the sword of disaster, and the larger the
organisation the more it has to lose. Smaller businesses suffer more because
they are unable to absorb losses and respond to sudden downtime. This downtime
can be due to anything from natural or man-made disasters like earthquakes,
fire, floods, tornados, terrorist attacks, etc., or it can be an extended power
loss, telecom failure or air-conditioning failure.
Life Under The Sword
DR today means not just restoring data and system access after an unscheduled
downtime. It means having in place a well-planned and tested method of anticipating
and responding to any eventuality which leads to downtime. The focus has shifted
from reactive to proactive planning, and narrows down to preventing downtime
from happening at all.
Even at the global level, the focus has shifted from DR, which concentrates
only on the IT department, to Business Continuity Planning (BCP), which covers
the entire business perspective including the business processes as well as
skilled manpower responsible for running the business.
It is universally acknowledged that BCP is a costly and complex process. It
needs the co-operation and co-ordination of the entire business organisation.
It asks uncomfortable and embarrassing questions about business processes, and
requires thorough understanding of all critical business processes and how downtime
will affect each process and the organisation as a whole. The worst part is
that it does not have any immediate business benefit. However, those who have
it in place know from experience that a well-planned BCP pays for its high cost
after the first or second use.
Lessons From Hard Knocks
Business organisations have learnt their lesson from the Tamil Nadu tsunami
and Mumbai floods. The results are either investments in disaster recovery sites
or development of comprehensive BCP. Some organisations have also outsourced
DR to ensure business continuity in case of a major disaster.
It is necessary to have generators and UPS on elevated platforms. The
DR site or data centre too must be elevated, otherwise it becomes vulnerable
to flooding, says Sanjay Sharma, Head, IT, IDBI Bank. It is also necessary
to build fuel capacity in-house for the generators. There must be a backup for
at least 12 hours.
It is also essential to have standby generators in place.
Many organisations had tried to procure diesel or petrol from petrol pumps during
the recent disasters. However, the pumps were not working due to power failure,
so a manual hand-pump had to be used and the fuel had to be manually transported
as no other transportation was available because of the floods. This scenario
will repeat again when another major disaster strikes with the likelihood of
even the DR sites going down. Organisations were under the assumption that during
a disaster they would be able to fly in people from other locations to the DR
site, but during the Mumbai floods even the airport was non-functional.
The Mumbai floods can be labelled as a partial disaster because the flooding
took place only during the late noon hours. Most of the people were already
at work and so the DR site was already operating. However, it has now been realised
that the DR site has to be working 24x7x365/6 to tackle a full-fledged disaster,
and that it is necessary to have an active DR site operated by dedicated staff.
Sunil Gupta, Head, Product Management & Business Operations (IDC) of Reliance
Infocomm says that despite the heavy rains and power failure of 26/7, the Reliance
Infocomm data centre was fully functioning. This is because the level-3 data
centres roofs were waterproof and prevented water from entering server
halls. Sand bags were deployed at key points to prevent water from entering
the campus, and de-watering pumps were used to remove water from the power substation
outside the data centre building.
Diesel generators were activated to prevent any power outage
from the main grids, and enough diesel was stored to run the facility for three
days without the need of fuel from outside. The Reliance Infocomm data centre
is claimed to be resistant to natural disasters like flood, earthquakes and
lightning, and even man-made disasters like power failure and riots.
IDBI: Revolving around awareness
While it is good to have a strong BCP and DR site, it is also
necessary to create a DR/BC-aware organisation. For this, training is given
to people when they join the organisation.
Training is provided about the infrastructure in place, the solution that has
been opted for, the impact of the solution on the business, and most importantly,
the impact of the solution not being available on time. The banks users
are also aware of the function that will be first affected by disasters (for
instance, anywhere banking or channels), and what the time gap is before the
DR site takes over from the main data centre. A training programme is also present
for senior-level management (such as cluster heads and zonal heads) to make
them aware what they are expected to do to tackle disasters.
IDBIs DR Tech
IDBIs primary site is at Mumbai while the DR site is at Chennai. The
Chennai infrastructure will be able to take 100 percent load of the primary
data centre in Mumbai, says Sanjay Sharma, Head of IT at IDBI Bank.
This is significant because if the DR sites take lesser load than the primary
centre even by 10 percent, then as the business volume grows there might be
issues; when disaster strikes, the organisation might not be able to operate
from the DR site.
We are working with IBM for the new data centre and have the IBM P-590
series put in place at both the DR and primary sites. We use SD-8300 storage
boxes as well as lots of other products and tools that will automate the disaster
recovery process, reveals Sharma.
IDBI uses a hot DR concept and 100 percent replication takes place between the
Mumbai and Chennai sites. The sites are connected by broadband links, and the
replication takes place on a regular basis.
Activation Responsibility
Though the disaster recovery policy is well-defined, the bank faced the issue
as to who will officially declare that disaster has struck, and who will take
the decision that the DR site has to be activated. People who operate the site
may make mistakes, so an automated DR process was put in place; consequently,
when an event takes place the DR site is activated.
The option of manual activation is retained. Once it is officially declared,
the concerned person has to run certain options and key in the proper commands
to activate the DR site.
The bank has a high-speed link between the DR site and the primary site, and
an alternate route for each path of the link so that if one link goes offline
the other is available. Network symmetry is ensured so that when the primary
site goes down the secondary site takes over without any complication from the
perspective of network connectivity.
Getting Priorities Straight
DR is a subset of BCP. You have to describe how business continuity is
achieved. Create the DR site, create the information structure and IT platforms.
But even though everything may be automated, some intelligent process is required
at local branch levels, says Sharma.
According to him, when you talk about BCP you also talk about technology and
business-related issues, and see how you will operate the branch in case of
a disaster. Hence, BCP is combination of technology, non-technology, manual
processes and other external dependencies.
Not all functions, products or categories of operations can
be treated at par. The applications and business processes have to be classified
into different categories like platinum, gold and silver since set-up and maintenance
of DR infrastructure is an expensive process. The RTO/RPO (Recovery Time/Point
Objectives) varies depending on the risk associated with each application and
the type of risk carried with that application.
Organisations have to constantly ponder what is the affordable downtime for
the process or business function. In an effective DR and BCP, every function
of the business is included so you cant ignore anything. You have to see
what is more critical and pay more attention to that first, stresses Sharma.
According to him, there will always be gaps in RPO. It cannot
be zero due to its prohibitively expensive nature. In fact, it may be even more
than the cost of the data loss due to the disaster. Clearly, the organisation
has to balance between cost and RTO/RPO.
| The DR Wish list
Sanjay Sharma, IT Head of IDBI Bank, feels that
in theory heterogeneous systems talk to each other, but in reality different
vendor-based solutions are not compatible. Although some high-end systems
provide heterogeneity, it is not always easy.
Sharmas biggest wish is that heterogeneous
systems should talk to each other irrespective of the vendor, who makes
them, and no matter what application or infrastructure is being used.
The cost of bandwidth should also become more reasonable.
Remote manageability of solutions should be more mature. High-end systems
have the option of zero downtime, and can replace any part or component
without any shutdown. But there are constraints there as well. Further,
scripts should be automated and they should work with the RPO and RTO.
C N Ram, Head, IT, HDFC Bank, feels that the software
that is used to replicate data must work seamlessly across diverse geographical
locations. There must be different software designs put in place to replicate
data instantly.
|
Towards Active DR
Instead of having a passive DR activated only during drills or an actual disaster,
it is best to create an active DR site whereby the organisation can run queries
from the DR site and balance operations between the DR and primary site. This
will result in the optimum utilisation of resources deployed at the DR site,
and node balancing between the DR and primary site.
Whenever there is a disaster or crisis, effectiveness of the DR site is tested.
Factors include the magnitude of the disaster and whether people can work during
it. Inputs can be obtained from the customers or users of the channels to find
out the problems they face. Audits are done at regular intervals to know how
effectively the DR site works.
As the new infrastructure for DR between Mumbai and Chennai is ready,
we intend to do dry runs at least once a month. The frequency will depend on
factors like the impact it has on the business, branch operations and various
channels. In reality, no dry run can be with zero downtime, emphasises
Sharma.
Anticipation And Assessment
Every business and process has potential risks. We are Basel II compliant
so we constantly review the operational risks and do methodological risk assessment.
It is then quantified and measured in terms of loss of data because of that
particular operation, informs Sharma.
Risk is associated with every process and software. It has to be assessed how
much business loss occurs due to downtime and the resulting business impact.
It is then translated into the risk factor. Higher business impact risk factors
are given higher DR priority and more sophisticated infrastructure is dedicated
to them. The lower the business impact, the lower is the risk, so a different
type of treatment is given for the DR scenario or BCP plan. This is an elaborate
exercise.
Threats from external sources include somebody trying to crack your site and
affecting the reputation of the company or the finances of the company. Each
threat is converted into a risk. When real disasters strike, in spite of risk
and threat assessment, the things that actually happen are usually beyond expectation.
Thats why proper BCP and DR plans must be available in each zone to co-ordinate
with the user because during this period panic and anxiety levels among users
are high.
Currently, everything is managed from a centralised environment and the branch
is only a virtual branch. For each activity or business process, the DR planning
or BCP can be done, and it is possible to find out what the local impact is.
It is also possible to calculate how a branch operation will be impacted.
External threats can be prevented to some extent by Internet banking, having
DMZ (demilitarised zone), firewalls, patch upgradation, etc. Even ethical hacking
can be done by external vendors to evaluate the organisations belief in
its infrastructure security.
Internal threats can be due to misuse of rights, intent of employees to take
away company data with them, misuse of access to obtain information which a
person is not supposed to, or having access rights on too many systems. Once
the change management system is in place, it mitigates many risks, hence all
this helps define how you counter internal and external threats.
A BCP must have the following stepsanalyse the business, assess the risk,
develop the DR and other strategies, develop actual DR and business continuity
plan, and finally keep on rehearsing the plan. Having a BCP not only helps reduce
financial loss and loss of marketshare, it also protects assets including employees,
and reduces or prevents bad publicity.
Fighting Fit
V Babu, DGM, IT, Bank of India, on the DR/BC strategies
that Bank of India relies on
According to the Reserve Bank of Indias (RBI) guidelines, it is imperative
for every banking and financial institution to have a disaster recovery plan
in place. Banks are still in the process of framing a BCP, but they have to
be ready to tackle serious business disruptions. V Babu, Deputy GM, IT Department,
Bank of India, feels it makes more business sense to outsource their DR and
BCP management. Consequently, the organisation is able to focus on its core
competencies and provide a better banking experience to its customers.
DR Works
Price Waterhouse Coopers was consulted for DR, and the strategy is updated every
year. The DR site for core banking is at Bangalore, and it is provided
for by HP, says Babu. HP has built and is managing the data centre, disaster
recovery site, help desk and call centre for Bank of India.
DR awareness among the employees as well as top-level management staff is necessary.
This is achieved through the information security policy as well as the DR set-up
of the organisation.
The DR plan is basically implemented in three stages. For the core banking services
the DR site is at Bangalore. For non-core banking services there are dual servers
and tape backups. A testing cycle is implemented every month to ensure that
the data on the tape can be restored properly.
Auditing the DR site is an ongoing process. The information system audit
group goes and audits all branches and the necessary review is done. The reports
are then presented to the IT department to rectify any fault in the system,
says Babu.
As the DR plan is based on industry standards, all potential
risks and threats are taken into account and assessed. Even during the Mumbai
floods of 26/7, the ATMs were functioning properly and most branches were running
normally. There was no major issue due to the DR plan and DR awareness among
the employees.
The reliance on DR
Sunil Gupta, Head, Product Management and Business
Operations (IDC), Reliance Infocomm on the organisations DR/BC strategy
Our DR set-up consists of a highly redundant data centre. The primary
site in Mumbai has never experienced a power failure in the last five years.
Even the air conditioners and other systems never fail. The data centre is prepared
for the eventuality whereby even if all of Mumbai or Maharashtra does not have
power for days, it will not be affected, says Sunil Gupta, Head, Product
Management and Business Operations (IDC), Reliance Infocomm. To make a call
at the time of disaster is critical for the customer and this is when the real
need of a service is felt.
As a telecom provider, infrastructure hosting critical applications has to be
robust. The primary site in Mumbai and the Bangalore DR centre are built with
level 3 specifications to withstand earthquakes floods and lightning. There
is no single point of failure either at the primary or the DR site in Bangalore.
All critical applications have a parallel set-up at Bangalore. As the data in
Mumbai is on a SAN, SAN-to-SAN replication is done and if the data is lost or
the server is down, it is available at the DR site due to replication using
high capacity links.
Classifying Applications For DR
For critical applications, the replication is done in real-time at Mumbai
itself. A three-way DR recovery concept is used. Another DR site or data centre
is set up at Mumbai itself. It is used to store any information, function or
process for which a delay is not acceptable. Hence a second copy is available
at the Mumbai DR site and a third at the DR site in Bangalore, explains
Gupta.
If data, process or application is not critical, then a warm DR recovery concept
is implemented. The application is still running at the other end (Bangalore
DR site) and data replication occurs in a batch mode every hour or every day
as the case may be.
In addition to all this there is also a cold site for storing the data. This
is to cover the unlikely scenario where both the primary and the DR site go
down. In such a scenario, it is still possible to recover the data from the
cold site. Hence, application classification is done depending on the type of
data or process and its impact on business, and customers.
The NNOC DR
Being a telecom company, Reliance Infocomm also has a huge network operation
centre where people work on a 24X7 basis. The companys main business is
that of a service provider.
To handle situations where the NNOC (National Network Operations Centre) is
not operational, a parallel NNOC has been set up at Hyderabad. It can take over
operations without any service getting affected.
Infrastructure And DR Operations
At the primary site, all databases are stored on a SAN that has a capacity of
more than 500 terabytes which is claimed to be the biggest in India. In storage
also, concepts like BCV (business continuity volumes) are used. First, there
are servers that are redundant, then a server cluster, the data is then kept
on the SAN and a SAN BCV is performed. The data kept on tape is constantly restored
on different systems to check if the restoration is successful or not and whether
the data is still the same. This process of tape backup and restoration is done
on a regular basis. However the frequency differs depending on the priority
as well as the nature of the data.
A Well-Defined BCP
Reliance Infocomm has two certificationsISO 9000:2000 and BS 7799. There
are audits for every aspect of the infrastructure. This includes not just the
IT resources but also the processes governing them. Processes are available
for audits, mock drills, physical resources, logical assets like backup storage,
firewalls and the DR site.
Due to the BS 7799 certification, the organisation has well-defined written
processes. The documentation is also comprehensive and covers 50 manuals. For
every activity the company follows ITIL processesbe it training, design,
implementations, change management, event management, or troubleshooting processes.
Similarly, as the organisation has a full-fledged DR plan
and a security policy, it has helped it define every possible disaster from
the macro perspective as well as from the micro one. The macro perspective includes
natural disasters like earthquakes, floods, lightening and even man-made ones
such as terrorism and riots. The micro level perspective can be server crashes
or system data getting corrupted. Every eventuality is covered with well-defined
written procedures and the action plans to adopt in worst-case scenarios.
Though the Mumbai and Bangalore data centres are redundant due to mandatory
provision, we do dry runs every week and shut off the power from the grid and
run the DR site to ensure that it is working properly, says Gupta.
Every area and procedure from customer provisioning to customer support, call
centre activities to network components and backend applications supporting
them are included in the DR and BC plan.
All services related to service provisioning, service support, operations and
customer support are included in the scope of DR and all are run on the hot
site or warm site concept. The surrounding administrative functions, which include
sales and HR are also covered by DR. In short all applications, processes are
well defined and categorised according to the priority and gets DR treatment
accordingly.
DR Infrastructure
The organisation uses an EMC solution for a SAN-to-SAN replication. HP OpenView
is used for network management to provide a view of the network across the entire
country. Tools are used for network redundancy and network DR. There are customer
and vendor-specific tools to give access to servers hosted anywhere in the country.
For instance, in the event of a disaster in Mumbai, the Bangalore team can access
those servers and manage it from there.
DR Awareness
As a part of ISO and BS 7799 certification compliance, each and every person
working in the IDC building is trained for fire-fighting, how to respond to
disasters and how to provide support.
The DR process is available on the portal and people are trained in the process.
The company also has to report every six months to the certification body that
people have been trained and retrained on this process. The refresher course
is conducted at regular intervals. Surprise mock drills are carried out to check
the response time and readiness of the people involved in supporting the DR
site.
Banking on BCP
C N Ram, IT Head, HDFC Bank on how the bank deals
with crises
C N Ram feels that given the current business scenario, DR has become a part
of the overall business continuity planning process. The function of DR is to
preserve data and prevent loss of customer and service related data. This can
be handled by the IT department. However, during a disaster many other business
processes and departments are impacted, so a sound BCP involves processes as
well as people who are skilled and capable of taking over and maintaining the
DR site on a regular basis.
The primary site of HDFC Bank is in Mumbai, while the DR site is in Chennaiboth
for the system and operations. Optimum utilisation is done at the DR site by
undertaking load balancing. The work is split between both the primary and DR
sites. So, people on both sites are familiar with the DR process, can takeover
and manage during periods of downtime. The DR site is online and data from the
banking system is replicated on the DR site within 15 to 30 minutes depending
on traffic delay in the transmission lines.
Processes Covered
All corporate and retail banking system are covered in the organisations
DR and BCP. We have an ATM switch which is Basel II-complaint. There is
an identical system in Chennai. All ATMs are connected to the Chennai DR site
in case of failure of the primary site, says Ram. The entire retail, wholesale
banking system, cash management are also covered in DR. Periodic trials of the
ATM switches are conducted every six months.
Though Chennai has its own share of disasters, it is still preferred as a DR
site by most organisations. This is because of good infrastructure, availability
of proper vendor support and availability of skilled manpower to maintain the
DR site and real estate is inexpensive.
26/7 Mumbai Floods
HDFC had some problems during the Mumbai floods especially with the telecom
links that snapped at many places. Only the Reliance telecom network was working.
There was also the issue of power failures.
Some ATMs were totally cut off due to flooding and the agencies hired to replenish
the cash at the ATMs could not do so. However, the data centre was totally operational.
Ram feels that proper DR and business continuity planning is the key to a successful
DR and BCP implementation. The DR and BCP should include all the business processes
and people necessary to keep the IT resources working during the disaster and
protect the data and system. The company should maintain the business continuity
in such a way that the customer feels that the organisation is still in normal
operation mode.
|