IMPROVEMENTS ARE NEEDED TO STRENGTHEN SSA`S OVERALL RECOVERY TESTING
The March 1997 recovery testing at COMDISCO did not meet all of
its testing objectives. The disaster recovery team (DRT) was able
to re-establish the data processing and network environments; however,
they were unable to complete the on-line and batch application testing
with the FOs. We believe that if the DRT had more time they could
have completed more objectives. Improvements are needed to strengthen
SSA`s overall recovery testing process.
Application Test Objectives Were Not Completed
This test produced several new circumstances described below which
resulted in an unstable operating environment when the applications
were being tested on Saturday, March 1. The unstable operating environment
was the result of the DRT not having enough time to resolve operating
and application start-up problems which were caused by the following
- missing data files;
- new release versions of several support software products being
introduced at the same time;
- inexperience of new personnel; and
- new hardware.
If the DRT had more time up front to solve the start-up problems,
we believe most of the test applications could have been successfully
completed on March 1. SSA`s dynamic data processing and application
environments are becoming more complex each year. Given these complexities
and interdependencies, we believe that regardless of the extent of
planning by the DRT there will always be the risk of unanticipated
The window of opportunity for testing on-line applications is on
Saturdays when FOs are closed and the network can be switched over
to COMDISCO. Late Saturday and Sunday is used to execute the batch
systems, perform on-line maintenance, and purge the system of SSA
test data. The DRT needs an additional 24 hours up-front (start Thursday
at 8 a.m. rather than Friday at 8 a.m.) to resolve any operating
start-up problems so on-line application testing can begin on time
early Saturday morning.
6 of the 12 Critical Workload Areas Have Been Tested to Date
After four testing opportunities at COMDISCO (December 1993, August
1994, January 1996, March 1997), only 6 of the 12 critical workload
areas have been tested. Of the six areas that have been tested, only
the on-line queries, processing title XVI claims and MTEXT workloads
have been totally successful. There was also limited success in processing
post entitlement events (for example, some applications have run
successfully while others have not.) See Appendix A for a list of
the 12 workload areas.
We believe the reason why only 6 workloads have been tested to-date
is because of incomplete planning by SSA for testing all the applications
in the 12 critical workload areas. Our conclusion is based on the
- SSA does not have a multi-year (master) application test scheduling
plan to ensure that all critical workload areas are tested on a
cyclical basis; i.e. every 3 years. According to SSA, each test
plan stands on its own merit, which means the results from each
test have not been compiled for developing an overall application
testing plan schedule.
- In our discussions with SSA, there were some inconsistencies
within SSA components as to what the critical workloads were within
the 12 workload areas. The inconsistencies in defining the critical
workloads indicate that planning needs improvement. For example,
we noted inconsistencies in the latest BRP document dated January
31, 1996 which identified the critical workloads. We questioned
why the 800-number system to schedule appointments and referrals
was listed as a critical workload in Appendix F of the BRP but
not listed in the executive summary as a critical workload. One
SSA component said it was a critical workload, while another said
it was not. In another example, we inquired why the MTEXT workload
which had been scheduled for the March 1997 test was canceled.
The reason given was because SSA now believes this workload is
not critical. Originally, it was believed that some new beneficiaries
would not get their checks unless the MTEXT notices were generated.
Better planning would have resulted in eliminating the MTEXT workload
from the critical workload list.
- For the March 1997 recovery test, one application test objective
was to process title II claims through the MCS. However, not all
title II claims are processed through MCS. While all claims are
initiated through MCS, if MCS identifies exceptions (such as missing
Master Beneficiary Record data) the claim must then be processed
either through the Claims Automated Process System (CAPS) or through
the Manual Adjustment Debit, Credit and Award Process (MADCAP).
In February 1997, MCS processed 70 percent of the claims, CAPS processed
4 percent, and MADCAP processed 16 percent. Testing for only those
title II claims that could be processed through MCS overlooks about
30 percent of all title II claims.
Finally, SSA only has the opportunity to test every 12 to 18 months
at COMDISCO. Currently, SSA is testing between three and four applications
per test date. Testing a larger number of applications would be more
Documented Performance Standards Exist to Measure Stress Test
The purpose of stress testing is to determine the volume of transactions
at which the network would experience significant delays. These tests
are designed to simulate how the system will perform under actual
conditions with a high volume of transactions being processed at
one time. For this test, DIET officials said they were at about 350
transactions per second before the network began experiencing delays.
In comparison, we have been told that during the peak time for a
normal day, the National Computer Center (NCC) will process over
900 transactions per second. However, the DIET stress test results
cannot be measured since there are no documented performance standards.
Consequently, the SSA officials that we talked with could not explain
if this service performance level at COMDISCO would be acceptable
in a disaster situation. The results are not meaningful unless they
can be measured against a stated service performance standard.
Also, for the March 1997 test, the results (350 transactions/second)
that were achieved were based only on log on/off and query only transaction
profiles. The profiles used for the test excluded those transactions
that would have resulted in an action to update a data base. Since
this was not representative of a typical daily production transaction
mix at NCC, these stress results are even less meaningful. We were
told that not all transaction profiles could be used for this test
because of some technical limitations.
in Establishing the Support Environment and Incompatibilities
between Different Facility Complexes Prevented the Successful
Completion of the MTAS Workload
Most of SSA`s critical workload applications run in the PPF
complex environment; however, several applications run outside it.
Examples of these applications include Falcon, PSC/OCRO batch, and
MTAS which run in the MISF complex and VTAM and NETVIEW which reside
in the Network Management Facility complex environment.
For the March 1997 test, SSA tested the MTAS application at COMDISCO.
This was the third time the time and attendance application did not
meet all of the test objectives. One reason for the problem is SSA
has attempted to execute an MISF application in the PPF environment.
According to SSA officials, this presents a number of logistical
and technical problems, such as record blocking lengths, which to
date has made the MTAS application incompatible in the PPF environment.
Also, because most of the other non-PPF critical workload applications
have not been tested to date, SSA has no assurance these applications
Sent to COMDISCO from the OSSF May Not Include All Critical
Files to be sent to COMDISCO from the OSSF currently are judgmentally
selected from over 45,000 tapes at the OSSF. This process introduces
human error since all critical tapes may not be selected, thus losing
valuable time in a disaster recovery situation. This condition occurred
in the March 1997 test when several MTAS and IDMS files were missing.
While SSA has made some improvements in the development of the back-up
tape pick list, further automation of the process is still needed.
The recovery pick list should be automated since all the critical
workloads are known and all the files associated with these workloads
can be identified. The improvements that were made make the process
more flexible in that the pick list can be generated outside the
SSA complex. Prior to this improvement, the tapes had to be selected
by a person located in the NCC complex. The improvements permit the
Office of Systems Design and Development and the Office of Telecommunications
and Systems Operations personnel to select tapes from a remote site
using a lap top computer and a modem.
Back to top