Continuity of Business In an MVS Environment
By: Mitchell H. Levine, CISA
Audit Serve, Inc.
Low Cost &
Highly Skilled
IT Audit and SOX Consulting Resources Available Immediately
Call Mitch Levine at (203) 972-3567 or
email levinemh@auditserve.com
for additional information
The explosion at the World Trade Center on February 26, 1993 has once
again reminded us of the need to have an offsite processing facility along
with a proven plan which effectively restores the system within the
desired business timeframes. However events like these also necessitates
the re-evaluation of contingency plans to ensure that all potential issues
are covered. This is the basis of this article.
There are many options available for offsite processing which include:
hot sites provided by disaster recovery vendors (e.g., SunGuard,
Comdisco, and IBM) which offer offsite processing facilities on a first
come first serve basis
a dedicated site, whereby a company has an alternate data center which
contains a duplicate set of hardware which is not used for any daily
business purposes
a cold site, which consists of an offsite facility containing an empty
room that has a raised floor and air conditioning unit. In this situation,
arrangements are made with hardware vendors to supply equipment within a
specific timeframe upon the declaration of a disaster
divided operations, which consist of one processing center operation
being divided into two separate data centers
The main focus of this article is the planning and execution of an
offsite contingency plan which uses a disaster recovery vendor's hot site.
This article is not intended to provide a checklist of required components
of a contingency plan. It will provide a thorough understanding of what
actually occurs at an offsite processing center and the steps that a
company must perform to ensure a successful restoration of operations at a
disaster recovery vendor's hot site.
Offsite Processing Environment Provided by Disaster
Recovery Vendors
A disaster recovery vendor's purpose is to provide offsite processing
services to facilitate the transfer of a company's operation from their
home site to the hot site (i.e., offsite processing center) in the most
time efficient manner. At the same time, the vendor must also obtain
maximum utilization of the equipment by servicing as many companies at the
same time within the same system.
To achieve the latter, disaster recovery vendors utilize PR/SM
(Processor Resource System Manager) architecture which allows LPARS (i.e.,
logical partitions) to be defined. This enables multiple companies to
operate concurrently one machine, but also ensures that each company is
isolated from the other.
The other method that is used to allow multiple companies to operate
within the same system, using the VM operating system to define multiple
MVS guest operating systems.
Disaster recovery vendors have made improvements in the last year in
terms of providing facilities which expedites the time required to restore
one's system at the hot site. One of these improvements was providing
companies with a floor system which is IPLed with the minimum required
system software to allow companies to restore their environment.
The floor systems typically include MVS, JES, TSO, SDSF and is
configured for every piece of equipment contained within the environment.
However, the floor system is only a generic environment and cannot be used
as the system configuration for a company's daily processing environment.
Therefore, once the company's environment has been restored, the system is
re-IPLed with the company's own MVS system which is tailored for their
processing requirements.
Prior to the availability of floor systems, the disaster recovery
vendors provided a chunk of hardware with no operating system loaded. This
required each company to bring their own system software and perform an
initial IPL from a one pack MVS system in order to perform their
restoration. The one pack MVS system is referred to as a rescue pack,
which consists of one or two tape volumes containing the system datasets
(e.g., PARMLIB, master catalog, JES Spool and checkpoint datasets, and
page datasets) required to perform an IPL and a copy of FDR or DFDSS which
is used to perform the restoration. Since the system environment defined
within the rescue pack does not reflect the operation required to run the
normal system, the system will need to be re-IPLed once the system has
been completely restored. It should be noted that the use of rescue pack
is also used for onsite contingency planning in the event of a failure
within the System Resident Volume (SYSRES).
The use of the floor system saves a significant amount of time (i.e.,
approximately one hour) that would have been required to load the tapes
and perform standalone IPL. It should also be noted that all disaster
recovery vendors do not provide floor systems.
Offsite Facility Compatibility
The probability that a disaster recovery vendor will provide an offsite
facility that matches the exact equipment requirements of its customers is
remote. Each company should analyze these differences to determine the
impact to its offsite contingency strategy. One difference in the
processors between a company's home site and a disaster recovery vendor's
offsite processing facility would not be a major compatibility issue as
long as the company's memory and CHIPIDs requirements are met.
However, the differences in DASD can be a major problem if the offsite
processing center's DASD uses different device types or if its DASD has a
lower density then a company's home site. For instance, it would not be
possible to restore a full volume backup from a company's home site's
triple density DASD volume to a hot site's single density volume.
In addition, the hot site should have an equivalent amount of DASD
required by a company to perform its production processing. This includes
work packs (i.e., used for sort work space, temporary datasets), since
during the batch processing cycle, the workspace requirements remain the
same. The DASD used to support software development would not be required
to be available at the hot site since software development is not usually
required by an offsite contingency plan.
Preparation Requirements
Planning for a disaster, which necessitates a transfer of operation
from a company's home site to the hot site (i.e., offsite processing
center), does not culminate with the development of a contingency plan.
Components used in the restoration process are constantly changing. This
section of the article will discuss such preparation requirements.
Backup Strategies
The success of the restoration operation at an offsite processing
center is predicated on a complete backup of the home site's production
environment, which has integrity. No matter how comprehensive the
contingency plan or the number of times that a contingency plan is tested,
there is no method available for creating data for a missing backup tape
or a corrupted backup. There are several approaches used to perform a
system backup. The first method is a full volume backup which requires a
number of magnetic tapes (i.e., a backup of a triple density volume
requires up to six tapes depending on the type compression used by the
controller).
Most installations do not perform full volume backups on a daily basis
since their environments are not constantly changing or based on the
amount of time required to perform a full backup. Therefore, incremental
backup are performed for those days in which a full volume backup is not
performed. Installations typically use products such as FDR or HSM, whose
catalog identifies the datasets that have changed and therefore require a
backup.
Both methods described are backups initiated by the operations area
which is not familiar with the activity that occurs within the
applications themselves. This point is critical since files will be
corrupted if a backup occurs while the files are open. This control issue
applies to all mentioned backup strategies that are used. It is the reason
that the most effective backup strategy for application systems is to have
the application development staff (i.e., who have knowledge of which files
are at open at specific times), define the backup process. This will
ensure that backups which have integrity are performed.
In the near future, IBM will release a new method for performing
backups which provide integrity through the use of concurrent copying. IBM
placed into the control unit the ability to take a backup of a file while
it is open and still maintain integrity.
MVSCP GEN
Since the hardware used at the hot site is different from the home
site, the disaster recovery vendor will send each of the their customers a
copy of the MVSCP GEN that is running at the hot site as part of an
ongoing maintenance schedule. The MVSCP GEN defines all of an
installation's devices such as console addresses that you IPL from, DASD,
printers, terminals, and controllers.
Companies will review the MVSCP GEN to determine the addresses used by
the hot site. If the hot site's hardware addresses are not contained in
the MVSCP GEN, JCL errors will occur since the system will not recognize
the unit names (e.g., tape drive) that are coded in the installations'
jobs. Therefore, installations must blend the hot site's unit names into
the GEN since the unit names that are hardcoded into a company's JCL
cannot be changed. This is accomplished using the Eligible Device Table
(EDT) within the MVSCP GEN which allows an installation to reference
multiple addresses, including hardware addresses that are defined at the
home site and the hot site. The EDT maps the unit names used by one's
installation to the equipment addresses which the disaster recovery vendor
has supplied.
When restoring to a hot site, which has either a floor system or an
installation standalone IPL system, the MVS CPGEN would be required to be
rerun to define (i.e., GEN) the addresses of the hot site along with the
EDT.
DASD Addresses
As previously mentioned, in order to run your installation's system in
a different location without changing the operating environment,
provisions must be made to allow a system to address the hardware located
at the hot site. This situation involves the addressing of DASD by one's
system. For example, catalogs are used to locate the DASD on which
datasets reside or to identify the volumes in which APF libraries are
stored. Since DASD addresses are different for the home site and hot site,
installations must alter the volume ID record on the VTOC of the DASD used
at the hot site to the addresses used by an installation at their home
site. This technique is referred to as "clipping the pack".
Typically installations have JCL prepared which perform this function.
Restore Jobs
The preparation step which requires the most constant change is the JCL
used to restore the system. The JCL which calls the backup tapes, changes
each day since the tape volser for each night's backup changes. Most
installations have a product or have devised their own automated process
for creating restore jobs.
Other Preparation Requirements
The following items should also be considered when making preparations
for restoring a system at a hot site:
special JES definitions (JES2PARM) required which reference EP lines
and printers that are used by the hot site
special CONSOLxx PARMLIB member required to define the console
addresses used by the hot site
special NCP GEN required for the contingency site since the
telecommunication lines are mapped differently at the hot site
Other Offsite Contingency Planning Considerations
When developing a contingency strategy for restoring one's system at an
offsite processing center, a tape management system is the best
organizational tool to ensure that the proper tapes are sent and recalled
to and from the offsite media storage facility.
When reviewing the process used for sending tapes to the offsite
storage facility, provision should be made to ensure that the JCL and
other support items required to restore a system at the hot site are also
shipped to the offsite storage facility.
All good recovery plans include a plan of how to restore the system
back to one's home site when the disaster is over. This critical plan is
overlooked by most contingency plans. Therefore, careful analysis is
required to ensure that the methods used to operate your system at the hot
site is compatible to the operating requirements at the home site. For
example, if your installation is restoring its system to the hot site's
DASD which contains more space then your home site, then the additional
space will be used by the installation as datasets expand in size based on
normal daily processing. When it is time to return to the home site, a
full volume backup of the DASD at the hot site is performed. However, the
DASD at the home site will not have the space available to perform the
restore. The solution for such a situation would be to allocate a dummy
dataset to the DASD at the hot site which fills the space created by the
differences in the two site's DASD capacity.
It should be noted that when a restore is performed at the hot site,
the disaster recovery vendor's floor system has security installed, but is
bypassed. Security will not be in effect until an installation re-IPLs its
system with its installation's configuration which occurs after all of the
restores have been performed.
Many third party vendor system software products used by an
installation have controls which prevent the software from being used on a
different CPU. Most of these vendors provide a facility to allow their
software to be temporarily used on a different CPU by providing a vendor
zap. Typically these zaps will only function for a set period of time.
Vendors do not provide these facilities in advance except in order to
perform a contingency test. Therefore, steps to contact the vendor and
perform the zap should be included in the contingency plan. All
contingency plans should contain exact steps needed in order to restore
the system. The most effective contingency plan identifies various steps
which can be performed concurrently and which are not dependant on other
tasks. This is important since the objective is to restore the system in
the least amount of time possible.
Conclusion
The information provided in this article is intended to provide a
background of critical processes and functions required to restore a
system at an offsite processing center. The contingency plan should
contain procedures for performing these critical functions. However, there
are various methods that can be used to perform these functions which
should considered when reviewing a contingency plan.
This article was written more than five year ago.
Events may have changed since this article was written.
For a free proposal to perform an audit of your organization or provide
SOX support & testing services, contact Mitchell
Levine of Audit Serve at (203) 972-3567 or via e-mail at Levinemh@auditserve.com.
Copyright 2006, Audit Serve, Inc. All rights reserved.
Reproduction, which includes links from other Web sites, is prohibited except by
permission in writing.
This article appeared in a past issue of the Audit Vision
E-Mail Newsletter.
|