Implementing Transparent Subsystem Failure Recovery within the NOAA Jason Ground System
16th Ground System Architectures Workshop
The National Oceanic and Atmospheric Administration (NOAA) Jason Ground System (NJGS) is a consolidated next-generation ground system that will support the simultaneous operation of the OSTM/Jason-2 and Jason-3 ocean surface topography missions. The NJGS will consist of several independent subsystems for spacecraft command and control, telemetry processing, and data archiving and distribution, and will provide the other Jason-3 mission partners — the National Aeronautics and Space Administration (NASA), the Centre National d'Etudes Spaciales (CNES), and the European Organisation for the Exploitation of Meteorological Satellites (EUMETSAT) — with access to mission operational data and science products.
To assure high availability and multi-level resilience against equipment failures, the NJGS will employ a subsystem redundancy scheme in which two or more independent instances of each subsystem provide fully redundant functionality. The redundancy mechanism selected for the NJGS is an enhancement of the one used in NOAA's existing Jason-2 Ground System (J2GS); the two approaches differ in that the J2GS provides redundancy among parallel sets of subsystems, whereas the NJGS will allow individual subsystems to be independently replaced.
The J2GS subsystem redundancy implementation imposed a number of limitations on the mission partners. Chief among these was the non-transparency of the subsystem replacement process; following any subsystem failure, all mission partners with access to the failed subsystem would be required to reconfigure their communications interfaces to access its replacement. The need for this explicit change in configuration imposed significant overhead on subsystem failure recovery, since any replacement required a coordinated international effort to complete.
The NJGS subsystem redundancy mechanism removes this constraint on partner interoperability by implementing subsystem replacement in a transparent manner. Multiple instances of externally-accessible NJGS subsystems provide common interfaces to the mission partners, and the integrity of these interfaces is maintained during the change-over from a failed subsystem to its standby counterpart. This method allows subsystem replacement to be completed without requiring any configuration changes on the part of the external mission partners.
This presentation discusses the key elements of the subsystem-level redundancy scheme and the mechanism through which the NJGS recovers from a subsystem failure. The presentation focuses on the implementation of common external interfaces for NJGS subsystems accessed by external partners, and the method by which these interfaces are preserved during subsystem failure recovery.