It's proven time and again that hardware and hot swap capabilities alone will not deliver the high availability levels required in today's telecommunication, wireless, and networking articles. Strong software architectures are also required to reach the five 9s (99.999%) and eventually six 9s (99.9999%) plateaus.
On the software front, the real-time operating system (RTOS) remains one of the key elements in making HA come to life. The RTOS holds the key for controlling CPUs, boards and more. At the same time, it's tasked with ensuring resource availability in both the space and time domains. Here's a look at how designers can optimize RTOS designs to deliver these capabilities.
Memory Protection: Where it All Begins
In any embedded communication design, fault tolerance begins with memory protection. For over 20 years, microprocessors have had memory management units (MMU) that enable an OS to run applications in their own protected address spaces. Yet most commercial RTOSes in use today turn off the MMU when they boot. As a result, all applications and the kernel itself reside in the same memory space and usually run in supervisor mode.
In traditional communication architectures, any application has direct access to the kernel's code and data and to the code and data of all the other applications in the system. Thus, a single errant pointer in the statistics-gathering program can trash the call processing program or corrupt the kernel and take down the entire system.
For a highly available and reliable system, the RTOS must be able to prevent code running in one address space from accessing memory of another address space. The RTOS uses the hardware MMU to enforce this protection, thereby rendering careless or malicious corruption between applications an impossibility. A bug in the statistics-gathering program must not be able to affect the operation of the call processing program or the kernel itself. The fact that non-memory protected operating systems are still used in complex embedded systems where reliability and availability are important is truly a wonder.
In HA systems, an RTOS must reliably detect and handle stack overflow errors. An RTOS that does not provide memory protection is unacceptable since it cannot detect when a task has written data past the end of its stack. In contrast, an RTOS that supports memory protection and virtual memory can leave a page's worth (or more) of memory below each stack area unmapped so that a task overflowing its stack will be caught by a hardware memory protection fault.
When employing this strategy, an RTOS that implements a virtual memory model has a distinct advantage over one that employs a flat memory model. In a flat memory model, each memory page represents actual RAM (or ROM), so unmapping a page effectively causes that chunk of physical memory to be lost. Embedded systems typically cannot afford to waste system resources in this manner.
By implementing a virtual memory model, the RTOS allows developers to leave a chunk of virtual addresses, rather than a chunk of physical memory, unmapped below each stack. When the stack's virtual memory is mapped to physical addresses, the RTOS does not need to assign space for the unmapped chunk of virtual memory. The physical memory consumed is limited to the actual size of the stack, thereby conserving system resources.
Supporting HA Architectures
Many critical systems ensure HA by employing a distributed architecture comprised of redundant nodes that take over when active nodes fail. Each redundant node in a high availability architecture must be able to quickly and reliably detect a failure in one of the active nodes in order to start processing the tasks usually handled by the "dead" node. This requirement holds true regardless of whether the active node and redundant node are running on the same processor or on different processors that are connected by a network or common backplane.
An RTOS can assist system designers in effectively meeting this requirement by providing a built-in heartbeat within each interprocessor and intraprocessor communication channel. In this scheme, once a communication channel is opened between a redundant node and an active node, the redundant node continually receives heartbeat messages from the active node during normal operation.
When the heartbeat fails to arrive, the redundant node can execute a failover operation in which it takes over the servicing usually handled by the active node. The communication channel can also be used to checkpoint application-specific state between the active and redundant servers.
The Same API Helps
An RTOS that uses the same API for all intertask communication, regardless of whether the tasks run on the same processor and regardless of the communication device used to pass messages between tasks located on different processors, is especially well suited for use in distributed HA systems.
First, by using common communication API, an RTOS provides designers with the ability to use a single method for communicating between tasks, regardless of whether the tasks are running on the same processor or not. Among other things, this enables a smooth, staged development process wherein designers can develop a distributed system on a single board during the prototype stage before incrementally moving the system's components to distributed boards in the architecture. With a common communication API, only minimal code changes must be made before moving components to different processors.
Second, a common communication API allows developers to use the same method of communicating between tasks without consideration of what type of communications network (TCP/IP, backplane protocol, custom network type, etc.,) the tasks are using to communicate. Not only does this simplify the initial design, but it also streamlines the process of porting an application to a new system configuration that uses different communications media.
In addition to using a common communication API, an RTOS should provide a mechanism that does not require clients to specify a target node location, but rather allows clients to communicate with a named service within a server cluster. Within this framework, system developers can create an HA architecture by providing multiple redundant servers, all publishing the same named service. If one server fails, the remaining servers can service clients without knowledge or care about which nodes in the cluster are active.
This named service model makes a redundant system both easier to implement and easier to modify while the system is running in the field. During development, the system designer is not required to provide a client with information about how many redundant servers will be available, nor what the addresses of those servers will be. This results in simpler code for the client, and also provides the designer with the ability to extend the system without recompiling any of the client's code.
For example, if a server processor is taken down for repairs, it can be replaced with a different processor running the server program, even though the alternate processor may have a different address (for example, be in a different backplane slot, have a different IP address on a TCP/IP network, etc.,). If the server is running on the same processor as the client, a new instance of the server can be loaded in another address space without having to modify the client with additional information about the new address space.
Guaranteeing Resource Availability
An RTOS used in HA systems must be able to guarantee that system resources in the space (memory) and time (CPU time) domains will always be available to critical processes no matter what is happening with other processes in the system. Let's take a look at guaranteeing resources in both domains, starting with the space domain.
To promote HA, a critical application cannot, as a result of malicious or careless execution of another application, run out of memory resources. In most RTOSes, memory used to hold task control blocks and other kernel objects comes from a central store. When a task running in the statistics gathering application creates a new task or another kernel object, the kernel carves off a chunk of memory from this central store to hold the data for this object.
When a task running in the call processing program needs to create a kernel object, the kernel satisfies this request from the same central store. A bug in or poor design of or inadequate testing of the statistics-gathering program could result in the program creating too many kernel objects and exhausting the central store. In this situation, the call-processing program is not able to create its needed kernel object, which can cause the program to fail disastrously. In Linux-based systems, the kernel handles all memory allocation requests from applications. It is not hard for one application to cause memory starvation of another application in a separate, memory-protected process; certain memory allocation requests can even crash the kernel!
In order to guarantee that this scenario cannot occur, the RTOS must provide a memory quota system wherein the system designer statically defines how much physical memory each address space has. For example, the statistics gathering application may be provided a maximum of 128 kbytes of memory while the call processing application receives an allocation of a maximum of 196 kbytes of memory. If the statistics gathering application encounters the aforementioned failure scenario, the program may exhaust its own 128 kbytes of memory. The call processing program and its 196 kbytes of memory, however, are wholly unaffected.
The RTOS should treat memory as a hard currency: when an application wants to create a kernel object, the application must provide a portion of its memory quota to satisfy the request. Each address space should have its own heap for its dynamic memory allocation requests. This kind of space domain protection must be part of the fundamental RTOS design. Any kind of central memory store or dynamically, discretionarily assigned limits are insufficient.
If an RTOS provides a memory quota system, dynamic loading of low criticality applications can be tolerated since the high criticality applications already running have the physical memory they require to run. In addition, the memory used to hold the new address spaces should come from the memory quota of the dynamic loader's address space. If this memory comes from a central store, then address space creation can fail if a malicious or carelessly written application attempts to create too many new address spaces.
Delivering Guarantees in the Time Domain
The vast majority of RTOS task schedulers are priority-based and preemptive. In this model, the highest priority task in the system always gets to run. If there are multiple tasks at the highest priority level, the tasks share runtime (called time slicing).
To support HA, the RTOSes scheduler should also have the ability to make tasks non-preemptible. This is known as cooperative multitasking. Through cooperative multitasking a task will run until it blocks itself (for example, by waiting on a semaphore), thereby allowing another task to run. The inherent flaw with these simple scheduling schemes is that they do not include any provision for guaranteeing runtime for critical tasks.
Consider the following scenario: a system includes two tasks at the same priority level. Task A is a non-critical, background task. Task B is a critical task that needs to get at least 40 percent of the runtime in order to get its work done. Because Task A and B are at the same priority level, the typical scheduler will time slice them so that both tasks get 50% of the runtime. At this point, Task B is able to get its work done.
Now suppose Task A spawns a new task at the same priority level. Consequently, there are three highest priority tasks sharing the runtime. Suddenly, Task B is only getting 33 percent of the runtime and cannot get its critical work done. For that matter, if the code in Task A has a bug or virus, it may spawn dozens or even hundreds of "confederate" tasks, causing Task B to get a tiny fraction of the runtime.
One solution to this problem is to enable the system designer to inform the scheduler of a task's maximum "weight" within the priority level. When a task spawns another equal priority task, the creating task must give up part of its own weight to the new task.
In our previous example, suppose the system designer had assigned weight to Task A and Task B such that Task A has 60 percent of the runtime and Task B has 40 percent of the runtime. When Task A spawns a third task, it must provide part of its own weight, say 30 percent. Now Task A and the new task each have 30 percent of the runtime, but critical Task B's 40 percent of the runtime remains intact.
In the example above, Task A can spawn many confederate tasks without affecting the ability of Task B to get its work done; Task B's processing resource is thus guaranteed. A task scheduler that provides this kind of guaranteed resource availability in addition to the standard scheduling techniques is a requirement of many critical embedded systems.
A problem inherent in all task schedulers is that they are ignorant of the application or address space in which the tasks reside. Augmenting our previous example, suppose that Task A executes in the statistics gathering address space while critical Task B executes in the call processing address space. The two applications are partitioned and protected in the space domain, but not in the time domain.
Designers of HA systems require the ability to guarantee that the runtime characteristics of the statistics gathering application cannot possibly affect the runtime characteristics of the call processing system. Task schedulers simply cannot make this guarantee. Consider a situation in which Task B normally gets all the runtime it needs by making it higher priority than Task A or any of the other tasks in the statistics gathering application. Due to a bug or poor design or improper testing, Task B may lower its own priority (the ability to do so is available with practically all task schedulers), causing the task in the statistics gathering application to gain control of the processor's runtime. Similarly, Task A may raise its priority above the priority of Task B with the same effect. The only way to guarantee that the tasks in different criticality address spaces cannot affect each other is to provide an address space level, or partition, scheduler.
Designers of safety critical software have known about the problems with task schedulers for a long time. to solve this problem, these designers have developed the partition-scheduling concept, which is a major part of ARINC Specification 653, an Avionics Application Software Standard Interface. The ARINC 653 partition scheduler runs partitions, or address spaces, according to a timeline established by the system designer. Each address space is provided one or more windows of execution within the repeating timeline. During each window, all the tasks in the other address spaces are not runnable; only the tasks within the currently active address space are runnable (and typically are scheduled according to the standard task scheduling rules). When the call processing application's window is active, its processing resource is guaranteed; the statistics gathering application cannot run and take away processing time from the critical application.
Although not specified in ARINC 653, a prudent addition to the implementation is to provide the concept of a background partition. When there are no runnable tasks within the active partition, the partition scheduler should be able to run background tasks, if any, in the background partition instead of idling. An example background task might be a low priority diagnostic agent that runs occasionally but does not have hard real-time requirements.
Attempts have been made to add partition scheduling on top of commercial off-the-shelf operating systems by selectively halting all the tasks in the active partition and then running all the tasks in the next partition. Thus, partition-switching time is linear with the number of tasks in the partitions: an unacceptably poor implementation. The RTOS must implement the partition scheduler within the kernel and guarantee that partition switching takes constant time and is as fast as possible.
Handling Faults
When a task faults, the kernel must provide a mechanism whereby notification can be sent to a health monitor that is in charge of performing fault recovery actions. This health monitor should run in its own address space since the data in the address space containing the faulted task may be corrupted.
The kernel of the RTOS must give the health monitor permission to close down the faulted task or its entire address space, and allow the heath monitor to restart appropriate address space. Because some failures may not directly cause a hardware exception (for example, a deadlock introduced by an application design flaw), it is important that the kernel provide a software watchdog capability whereby the health monitor is notified when a periodic task does not execute its expected code sequence.
In addition to recovery actions taken by the health monitor, the kernel must provide an event logging mechanism that captures the state of the system (for example, kernel service calls, task context switches, and interrupts) prior to the fault.
Protection from Service Calls
The kernel must also protect itself against improper service calls. Many kernels pass the actual pointer to a newly created kernel object, such as a semaphore, back to the application as a handle and then dereference this pointer when passed into subsequent kernel service calls made by the application. The application can pass in an invalid pointer with disastrous results.
No kernel service call must ever be permitted to cause the kernel to fail. The RTOS should employ opaque descriptors for application references to kernel objects, and the parameters to all service calls must be validated by the kernel, thereby making this kind of failure impossible.
Mandatory Access Control
Discretionary access control to critical system objects is insufficient in HA systems. An example of a discretionary access control is a UNIX file: a process or thread can, at its sole discretion, modify the permissions on a file, thereby permitting access to the file by another process in the system.
Discretionary access controls are useful for some kinds of objects in some kinds of systems. But an RTOS that is used in a HA system must be able to go one step further and provide mandatory access control of critical system objects.
For example, consider a communications device, access to which is controlled by a call processing application. The system designer must be able to set up the system statically such that the call processing program and only the call processing program has access to this device. Another application in the system cannot dynamically request and obtain access to this device. Additionally, the call processing program cannot dynamically provide access to the device to any other application in the system.
Virtual Device Driver
The kernel used in HA systems must be extremely stable and reliable. A good method of ensuring that the kernel remains stable throughout the development and maintenance of a critical system is to minimize the code that must be added to the kernel or the kernel's memory space.
In a virtual device driver, the device-specific code that must be in the kernel space is minimized (e.g., interrupt service routines). The bulk of the device driver work is handled in user space by an application that employs kernel APIs to manipulate the device. For block devices, it should be possible to map device buffers directly into the virtual device driver space for efficiency. For memory-mapped I/O, it also should be possible to provide the virtual device driver application access to physical memory as small as a single byte in order to maximize system reliability; access to physical memory that is not specifically required by the application should not be provided.
For example, suppose a critical system requires TCP/IP support. Why should the kernel assume the risks and maintenance requirements for the code needed to support TCP/IP communication? By running TCP/IP in its own virtual address space, a problem in the stack can be detected and handled without threat of causing the entire system to fail. This is an example of why a microkernel (as opposed to monolithic) RTOS architecture is ideal for HA systems. In monolithic systems such as Linux, complex system software such as TCP/IP stacks and file systems run in the same memory space as the kernel.
Virtual device drivers can minimize downtime by allowing developers to modify, upgrade, or replace system components without having to reboot the system. As long as the modified component still uses the API defined in the virtual device driver interface, the running kernel will be able to communicate with the component when it is loaded into the system. If the driver was built into the kernel instead of a loadable virtual address space, reconfiguration of the component would always require the CPU to be halted while the modified kernel was reloaded. By definition, high availability systems cannot afford to experience this downtime.
Wrap Up
Many RTOSes in use today were originally designed for software systems that were simpler and ran on microprocessors that did not have memory protection hardware. With the ever-increasing computing power and complexity of applications in today's embedded systems, fault tolerance and HA features in a modern real-time operating system have become a requirement. Especially stringent are the requirements for HA systems. Fault tolerance begins with memory protection but extends to much more, including support for high availability architectures and the ability to guarantee resource availability in the time and space domains.
About the Author
David Kleidermacher is director of engineering at Green Hills Software. He has a B.S. in Computer Science from Cornell University and can be reached at davek@ghs.com.