There are three distinct areas of research related to adaptive control of software processes. The first of these areas is the use of coprocessors in an asymmetric processing configuration. The second is in the area of intrusion detection and autonomic response for system self-protection. Finally, the third is the area of modeling of software behavior or activity.
The majority of the work in the asymmetric coprocessor arena has been devoted to the area of encryption and protection of the execution environment through the use of encryption techniques. The IBM corporation currently manufactures two processors, the 4758 and the 4764 processor, which provide total encryption capability for the zOS and the PC operating environments. In essence, the entire system and its operation is walled-off behind a security facade of encryption. Unfortunately, the activity of the executing code in this environment is not monitored by the security coprocessor. Backdoors and Trojans can still execute with impunity within the system.
Another distinct approach to the notion of a coprocessor that is also in the security arena is one developed by Helbig et al. In this work, the coprocessor is completely interposed between the main central processing unit (CPU) and the rest of the system on the computer. This configuration has the net effect of isolating the principal CPU from the rest of the conventional computer system. All control and data flow pass directly through the interposed security coprocessor.
Another significant approach to coprocessor security in the software systems environment is that suggested by Zambreno et al. In this approach, the development of a hardware coprocessor that will monitor program activity at the register level is proposed. This approach also examines the activity of a program at an extremely fine level of program granularity. It is also dependent on the knowledge of the operation of a specific compiler to model the program register utilization.
Within the domain of the work on autonomic response, it is clear that neither encryption nor obfuscation can prevent the leakage of control flow information. The focus of recent work in this area has been to prevent information leakage from the address bus. Yet another approach specifies a run time infrastructure to provide exploitation resistant communication and coordination services even in the face of distributed attacks. Still another approach contemplates that security attributes must be specified at the beginning of a software development process and then designed into the system.
The problem of maintaining the security of a software process can perhaps best be addressed by understanding that a computer program is, in reality, an abstract machine. The processes that occur within the abstract machine may be measured as the program executes. Through this measurement activity, it should be possible to build a mathematical model of the certified operating characteristics of that abstract machine. Once such a model has been established, it should be possible to monitor the activity of the software machine in a manner very similar to monitoring the activity of a physical hardware system. The various activities of the software machine will generate measurable characteristics. These characteristics can, in turn, be monitored while the software is performing known and certifiable activity. The data generated from this process might then be used as a basis for forming a mathematical description of the state space of normal or nominal program activity.
The activity of an executing software system is visible in the various subsystems of a computer. The CPU, for example, is dependent entirely on the software that is executing on that system. This means that the operation of a software abstract machine is made manifest in the physical world in the operation of the CPU, the bus traffic to and from the CPU and the contents of memory at various points in the execution of the software.
In addition to monitoring the execution of a software system in real time, it should also be possible to control the execution of the software. That is, the software system might be altered at run time in response to a detected abnormal condition. Perhaps the simplest example of the alteration of software at run time is represented in the abnormal termination of software by a monitoring/control system. The notion of imposing external control on an executing software system based on data derived from the operation of the software is thus a natural extension of the concept of process control. It should be possible to greatly extend this simple concept to provide substantial benefits in controlling the execution of a software program in response to conditions detected by monitoring the process being executed.
Accordingly, an exemplary novel approach has been developed to implement the dynamic monitoring and control of software processes. This approach represents a mechanism for the dynamic measurement of executing software systems, a mechanism for using the resulting measurement data to determine whether a software process is executing within a pre-established nominal framework, and a mechanism for modifying the execution of a software system if it is executing outside a certified execution framework. As used herein and in the claims that follow, it will be understood that the term “software process” is intended to be generally synonymous in its singular and plural forms respectively, with the term “software program.”
The activity of a software process can be monitored by an adjunct hardware system that tracks the effects of software execution on a principle CPU. This technique can be implemented by attaching a hardware decoder to one or more buses in the CPU. The hardware decoder can monitor these buses and send measurement telemetry based on the monitored data to an analytical system that determines whether the software is executing within an acceptable or certified range. This strategy is pure in that the monitoring and analysis function is implemented using additional hardware that performs the measurement and analytical functionality.
There is also an exemplary hybrid design embodiment, wherein the executable code as generated by a compiler includes observable events, such as a write to a specific memory location that may be detected by assisting hardware. In this case, the assisting hardware is watching for specific events, such as writes to specific memory locations, although it will be understood that in this novel approach, the assisting hardware can function to detect other types of defined events and is not limited to detecting a system writing to one or more specific memory locations.
Given the intrusive nature of a software monitoring process, it is clearly preferable to employ an unobtrusive approach for the measurement of executing software. In an exemplary methodology, the monitoring function is a separate monitoring environment that is implemented by a separate controller or analysis system. The basic structure of such a system being monitored includes a “monitored computer,” and a system that is performing the monitoring function, i.e., a “monitor engine.” On the monitored computer side, there are two distinct software systems that are being monitored, including the operating system executed by the system, and the set of application software (i.e., one or more software programs) that will run under the aegis of the operating system. On the monitor engine side, there is a single software system that can serve to implement the controller function, and that software system is referred to herein as an “analytical engine.” (or AE).
The general model of software process monitoring in accord with the present approach is shown in
Software that is to be monitored is certified. Each certified software process has one or more certificates associated with it. A certificate is a compact representation of expected state evolution. The analytical engine manages the set of certificates for each process to be executed and uses the information encoded by an associated certificate to characterize the validity of current program state as it executes. The current program state is deduced from telemetry data obtained during program execution. If the difference between the current program state and a state that is predicted by the certificate for the process increases above a pre-established threshold, the analytical engine notifies an adaptive engine that corrective action may need to be taken. The action taken by the adaptive engine is determined beforehand within the policy engine. The policy engine is under direct control of the system security administrator. Corrective actions that may be taken include, but are not limited to: program termination, priority reduction, and dynamic modification. In all cases, any corrective action taken is reported via the security administration interface of the policy engine as soon as it occurs.
The analytical engine, policy engine, and certificate store reside in an independent hardware-based system that cannot be influenced by the monitored system in any way that is not directly related to system reliability or security. The adaptive engine is implemented within a protected part of the monitored operating system in a way that safely enables any of a number of possible corrective actions to be taken.
The certificate associated with control of a process is a compact representation of expected process state evolution. Certificate data include, but are not limited to, probabilities of expected state sequences. In one exemplary embodiment, state sequences are encoded as words arising from an abstract alphabet representing function calls and returns. In this case, the telemetry data are obtained by instrumenting the software with brief instruction sequences that report calls and returns to the analytical engine interface. In another exemplary embodiment, telemetry data are obtained directly from the monitored processor hardware without affecting the monitored software process. Here the state evolution alphabet contains address values read from the processor instruction pointer when key control flow instructions are executed. In this case, the certificate is encoded in a fashion that facilitates proper address translation when any form of code relocation is employed by the monitored software system.
The foundation of the methodology is to utilize non-intrusive measurement methodology as the basis for monitoring, analyzing, and adapting the activity of monitored software system. It is possible to monitor, exclusively in software, the operation of the system, in order to determine if it is behaving normally. However, such monitoring imposes considerable overhead, both in time and space—this overhead is precisely a part of the system that designers would wish to eliminate in a final implementation of the software system, to reduce costs. Even if the monitoring overhead is retained in the final system, its very presence confounds its own ability to monitor the system, by adding complexity, and possibly introducing unwanted behaviors, such as additional interrupts, longer system call executions, undetectable vulnerabilities, etc.
This Summary has been provided to introduce a few concepts in a simplified form that are further described in detail below in the Description. However, this Summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
Various aspects and attendant advantages of one or more exemplary embodiments and modifications thereto will become more readily appreciated as the same becomes better understood by reference to the following detailed description, when taken in conjunction with the accompanying drawings, wherein:
Exemplary embodiments are illustrated in referenced Figures of the drawings. It is intended that the embodiments and Figures disclosed herein are to be considered illustrative rather than restrictive. No limitation on the scope of the technology and of the claims that follow is to be imputed to the examples shown in the drawings and discussed herein.
Central to the present novel approach is the concept that control flow, both within and between program modules of a software system, is important for understanding the execution of a software system. Basically, this control flow is made visible at the CPU level through software updates to an Instruction Pointer (IP) register during an instruction execution phase in the CPU. This IP register may be altered by a number of distinct program activities. For example, it may be directly updated by a jump, branch, or test instruction. Its contents may be saved and updated in a call-return sequence. Finally, the IP register may be altered by an external event represented by an interrupt.
In the normal flow of program control, the IP register is automatically updated during a fetch cycle. The above program activities will further update the IP register during the execute cycle. It is also important in the present approach that a special purpose decoder be introduced into the architecture of a system that will capture these changes to the IP register.
In this exemplary embodiment of non-intrusive software measurement and monitoring, program control flow is directly monitored within a central processing unit (CPU) of a computer system, virtually independent of program execution. For example,
The user's software is compiled to run within a CISC program space 44, which is organized in a fashion consistent with a von Neumann computer architecture 42. However, the software execution is governed by activities within the RISC that is organized more like a Harvard Architecture 56, where instructions and data reside in separate spaces. An ability to locate instructions and data anywhere in a shared program space is an attribute commonly associated with the von Neumann organization. The RISC has very little or no control over the memory address translations required to implement the instruction and data relocation implied by the von Neumann architecture. As a result, to accurately characterize program control flow using CPU instruction execution monitoring, it is necessary to also account for instruction address relocation in CISC program space 44. One way to achieve this requirement is to locate a portion of the analytical engine within the hardware that controls memory address translations. Another is to modify program certificates and/or their interpretation by the analytical engine.
A program's certificate encodes the statistical probabilities of its state evolution. In this example, program control flow is used to characterize state evolution. Here, the control flow is encoded as instructions and addresses—i.e., instructions that either change or refer to addresses in CISC program space 42. Since the von Neumann architecture allows instruction and data addresses to vary with time, it is necessary to account for this variation somewhere in the control flow analysis process performed by analytical engine 20. This function is accomplished either dynamically during program execution, or statically, when instruction blocks are loaded into memory. In a dynamic approach, a portion of analytical engine 20 resides in the memory management hardware where it continuously monitors IP 58 address translations to CISC program space addresses 42. In the exemplary embodiment of
Variations of instruction execution monitoring are also possible for other CPU organizations. For example, monitoring of pure RISC and Very-Long-Instruction-Word (VLIW) computer architectures would proceed in a fashion similar to that detailed in
CPU bus monitoring obtains program control flow information from outside the CPU via any of its associated bus interfaces. As shown in the simplified architecture illustrated in
In a variation of the architecture shown in
Peripheral assisted monitoring employs peripheral hardware to assist with the software monitoring task. For example, a monitoring system that is interfacing via the Peripheral Component Interconnect (PCI) or PCI express (PCIe) bus is shown in
Peripheral assisted monitoring is a hybrid approach that requires monitored software to be modified to co-operate with the monitoring peripheral. Here, the software systems to be monitored are modified so that the call and return instructions are logged as events in the assisting hardware. To achieve this function, a modified compiler is used to instrument the code with writes to a specific memory location that will be used for process monitoring. PMR 192 can then be accessed by analytical engine 20 for each corresponding call and return instruction. Typically, the additional code that must be added for each call and return is approximately equivalent to two assembly-level instructions on an Intel Corp. Pentium™ class processor. The first instruction writes to a register with the data to be written to the PMR, and the second instruction writes that data to the PMR.
At the point where a monitored process crosses the threshold of nominal operation, the analytical engine on the PCIe board generates an interrupt to transfer control from the executing process to the Linux kernel. This interrupt is trapped by the interrupt service routine, which in turn, passes control on to adaptive engine 16, so that the adaptive engine can manage the anomaly in the executing software.
An execution certificate for a program establishes the set point for the software process controller. It captures the statistical probabilities associated with certified program state evolution. At the point that any certified software process is executed, its execution certificate will be acquired by the process controller.
The precise structure of the execution certificate is a function of the measurement domains that are chosen to be monitored. The execution certificate is, at its core, a mathematical description of the certified activity of a program that is to be monitored. The execution certificate will typically reside in a secure and protected domain of the analytical engine. As each process is initiated by the operating system, the certificate associated with the process will be retrieved by the analytical engine. The activity of the monitored process is then followed by the analytical engine, as it analyzes the data arriving from the executing process.
If, for example, the executing process is to be monitored for all “between module” activity, the execution certificate would consist of a n-ary call tree that was constructed from the software calibration process. This n-ary tree would, in fact, comprise a subset of all possible arcs in the potential call tree of the program, i.e., just those that were observed when the program activity was being certified. Each of the possible paths in this n-ary tree would be represented by a word in the execution vocabulary of the program. Thus, the certificate would also contain a list of these words, together with the probability of their being encountered as the program is executing.
All certified software systems must have a stored execution certificate. This certificate may actually accompany the software system in an encrypted form in the software load module. Alternatively, the certificate may reside in read-only memory within the analytical engine. In an operational certified system in accord with this novel approach, no software will be enabled to execute without a valid execution certificate.
The underlying system architecture has two significant components. The analytical engine measures the continuing operation of the software system for nominal activity. When the system is diagnosed to be in an abnormal state, the analytical engine captures control and hands it to the adaptive engine to ameliorate, terminate, or correct the activity that drove the system to its abnormal state.
There are two distinct phases of operation for the analytical engine. The first mode of operation is the calibration phase. During this mode of operation, the system will be exercised in its normal mode of operation. The analytical engine will then build the model of normal system activity from the repertoire of functions that occur during this observation interval. When the system has been appropriately calibrated, it is then ready to be placed in its operational mode. In essence, then, there is a learning phase for a software process that precedes an operational phase in the use of the analytical engine to monitor that process. A fundamental concept is that there is no standard model of software activity. The normal activity of a system is entirely dependent on the role that the software system will be asked to play. When the same software system is deployed in a number of different operational contexts, the label of “normal” is defined by the specific context in which the software system is used. The obverse of this coin is that abnormal system activity is also context dependent. The same assault on a system will be expressed differently in system activity depending on the operations being performed on that system. A key concept to the success of this monitoring strategy is that it is adaptive and will function equally well in a host of different contexts.
The precise nature of the adaptive engine is dependent on the role that the monitor system is serving. If the issue at hand is one of reliability, safety, or survivability, the adaptive engine can be empowered to modify the flow of execution of the software application, thus removing the offending functionality. If, on the other hand, the criterion is a performance related issue, then the abnormal activity is indicative of the sub-optimal configuration of the hardware. In this context the adaptive engine may be empowered to alter the configuration of the principal CPU to reflect the change in the operating environment.
The adaptive engine also establishes the necessary tools to support the execution environment of safety critical applications. Perhaps the greatest threat to the reliable operation of a modem software system is the unanticipated demands placed on the system by the environment. As a consequence, the system may well shift from a reliable execution framework, to an uncertified/unreliable one.
As a program executes a particular user operation, it transfers control from one module to another. There is always a main program module that receives control as the program begins to execute. The structure of the executing program may be represented as a call tree, where the root of the call tree represents the main program module. Each program functionality is represented by one or more sub-trees of this call tree, depending on the number of operations that are implemented by that functionality.
Just as each node of the call tree provides an abstraction of the processor instructions executed when a module is instantiated, a sub-tree represents an abstraction of a function that is performed when its root module is instantiated. In
The key contribution of the monitoring architecture is the characterization of the reliability of a software system in terms of the system's certified activities as it executes its various operations and its implication on system survivability. The certified assessment of the program activity is accomplished dynamically while the program is executing, to identify changes in software activity directly attributable to a failure event or the execution of an unprecedented operation (e.g., an attack). It is understood that no software system can be thoroughly or exhaustively tested for all possible contingencies. However, it is possible to certify a range of software behaviors that represent the certified program activity of a correct design specified software system, for a defined context.
By incorporating the monitoring function directly into the system design methodology, it is possible to drastically shift away from the current paradigm that addresses reliability, security, and survivability in an add-on fashion, occurring at the end of the design cycle. Instead, a unified and integrated design methodology for monitored systems is outlined below. The essence of the new architecture is that it will provide the ability to reliably monitor an executing software system in real time. It will also provide the infrastructure to modify the executing process should anomalies in the activity of this system occur while it is executing.
The second aspect of the operation of the analytical engine is the nominal operational phase. This is the normal mode of execution monitoring for the analytical engine. During this phase of execution, each new execution word will be validated against the nominal distribution of the words in the execution alphabet. There are two aspects of this validation process. As each new word is formed on the call stack, the word must be part of the execution vocabulary. Second, if the probability of encountering a given word is very low, then the word must not occur with a high probability in future contexts.
In addition to the real time monitoring of process activity, the analytical engine must have the capability of managing abnormal software execution scenarios that occur in a very few instruction cycles. Again, the most pertinent example of such an attack is a buffer overflow. One of the possible symptoms of such an attack is that the program will attempt to fetch an instruction that is in the program data (D) space. In that case, there must be logic in the analytical engine to store instruction (I) and D space boundaries and to insure that all program fetches occur in the program I space.
Exemplary logical steps for implementing the present approach are shown in a flowchart 90 in
However, if the process is already certified in decision step 98, a step 104 loads a new certificate that is associated with the new process. Next, a step 106 provides for the monitoring engine to start monitoring the new process, e.g., by measuring the executing process in a step 108. A decision step 110 determines if the process execution is nominal. If so, the step of measuring the process continues in step 108 (note—although not shown, once the process is complete, the logic returns to step 96 to load a new process for execution). If the process execution is not within the nominal range, a decision step 111 determines if the policy requires termination of the of process (as being abnormal), and if so, a step 112 terminates the process. Next, a step 113 set an administrative alert to indicate that the previous process was terminated. The logic then returns to step 96, to load a new process.
Referring to decision step 100, if the new process just loaded is to be calibrated, a decision step 114 determines if the new process is a new uncertified process for which calibration has not yet been started. If so, a step 116 provides for initiating a new certificate. The new process is then permitted to continue until it reaches normal termination in a step 118. A decision step 122 then determines if the certification is complete. If not, a step 124 stores a working certificate for the process, and the logic returns to step 96. Conversely, if the certification is complete, a step 126 converts the working certificate to an actual certificate before also returning to step 96.
If decision step 114 determines that the process is not a new uncertified process, but instead, that the certification is being developed, a step 120 loads the certificate that is under development for the process. The logic then again proceeds to step 118 to enable the process to reach a normal termination.
In decision step 111 determines that the policy does not require termination of the process due to an abnormal execution, a step 128 implements a process adaptation as defined by the policy. Next, a step 129 sets an administrative alert to indicate that the process has been adapted in an attempt to correct the anomaly in the execution of the process. The logic then continues with step 108, to determine if the adaptation was successful.
Once the logic in flowchart 90 is complete, each process running on the monitored computer should have been provided with a corresponding certificate.
A task is, by definition, the smallest unit of a software process that can be scheduled by an operating system and is typically a main program or a thread of a main program. In a single processor system, each task is typically assigned a process identification (PID) when the task is initiated. Tasks may be active, actually executing on the CPU, or tasks may simply be ready, which means that they are awaiting execution in a process queue. Tasks may also be inactive, if they are awaiting the arrival of services to be delivered by the operating system. In a multiprocessing system, there will be two or more CPUs, and the operating system must then bind each task to a particular CPU.
In regard to the concept of a task, a block diagram of a more augmented view of an exemplary embodiment of a software process control system in a monitored computer 11 is illustrated in
The real function of the task messenger is to monitor the task switching function in the operating system. The task messenger is a software process embedded in the operating system to track the assignment of software processes to CPUs. Each process is in turn, identified by its PID and its name. The task messenger builds a vector <PID, Process-Name, CPU#>, at each context switch in the operating system.
The process controller, or analytical engine, can only monitor one process at a time. The function of binding executing processes to the set points or certificates that represent their nominal activity is the function of the task monitor. As each new vector representing a task switch is generated by the task messenger, task monitor 142 can use this vector to bind the process controller to the certificate representing that task.
The adaptive engine component of the system can reside in the memory of the monitored computer, for example, as an attachment to the operating system environment. This component will be invoked at the discretion of the analytical engine through the system interrupt structure, and the adaptive engine can alternatively be invoked by the process controller. In the last analysis, the adaptive engine will likely be the mechanism that captures control when an abnormal condition arises within the process controller.
In the advent that an abnormal activity is observed by the process controller, data indicating the nature of the abnormality can be transmitted by the process controller to the adaptive engine. The response of the adaptive engine to the departure from certified behavior can be determined, a priori, by an operational policy set by the system administrator. The simplest possible response by the adaptive engine to a noted departure from normal process execution would be to instruct the operating system to terminate the aberrant process.
Each abnormal condition is carefully articulated in the operational policy. For each abnormal condition, the system can take appropriate action as dictated by the underlying policy. Thus, a key component in an effective embodiment of the adaptive engine will be the design of the protocol for the policy that will in turn, govern the operation of the adaptive engine.
There are two issues related to the operational policy for the system. The first policy component specifies when to take action, and the second component specifies the action that is to be taken. The analytical engine, for example, may be instructed to take action when a new word is observed on the execution call stack. The associated action would likely be to generate an interrupt of the system bus so that the adaptive engine can acquire control of the system. On the other side, the adaptive engine would receive control as a result of this interrupt. It can then implement a predefined policy action associated with the recovery from the unexpected program activity.
The precise set of immediate response policy requirements should be carefully articulated during the initial design stages. Also these immediate response policies should be part of the system policy architecture.
A system administrator can establish policy. A very important role for the policy engine is to interface with the system administrator through a user interface, such as user interface 136 in
It is possible to measure quite a large number of program attributes at the level of the computer that is actually executing a program. One objective is to discover a minimal set of such measures that will provide the resolution necessary to determine whether an active software process is performing in a normal/certified manner.
In the simplest form, the operation of a software process might be described by a single measure or variable. In such a model, a decision might be made, for example, to measure the number of procedure calls during a fixed time interval. Such a single variable measurement system would be denoted as a univariate process control system. Alternatively, the measurement space may include multiple distinct variables that will be measured simultaneously. A process control system based on multiple measures would be known as a multivariate process control system.
One of the single most important aspects of the concept of software measurement is that it is possible to construct a mathematical description or model of normal behavior. The foundation of this model is derived from work in dynamic software measurement. To lay the foundation for a measurement-based, dynamic monitoring system that permits the real time assessment of software reliability, it is necessary to establish a conceptual foundation for program execution that lends itself to a suitable instrumentation for the monitoring and failure analysis processes.
A software system in operation will distribute its activity across a set of distinct operations. Thus, it is possible to define more precisely the notion of the activity of the system in regards to the executing software system.
To lay the foundation for a measurement based, dynamic monitoring system that permits the real time assessment of software reliability, it is necessary to establish a model for program execution that lends itself to a suitable instrumentation for the monitoring and failure analysis processes. In the subsequent discussion of program operation, it is useful to make the description of program specification, design and implementation somewhat more precise by introducing several notational conveniences. This discussion can begin by observing the fact that there are really two distinct abstract machines or models that define the implementation in the development of any software system.
The first abstract machine is an operational machine, which is a machine that interfaces directly with a hardware interface. The embedded system provides a suite of services to the hardware system. Each of these services cause the operational machine to perform a series of actions called operations. Each of these operations, in turn, causes the operational machine to perform some specific action. It is the purpose of this operational machine to articulate exactly what the software system must do to provide the necessary services dictated by the embedded software system requirements.
The second abstract machine is a functional machine, which is animated by a set of functionalities that describes exactly how each system operation is implemented. Whereas the operational abstract machine articulates what the software system will look like to the hardware system in which it is embedded, the functional abstract machine is the entity that is actually created by the software design process. Turning now to the precise relationship between the operational abstract machine and the functional abstract machine, it is quite conceivable that a system could be constructed wherein there is a one-to-one mapping between a user's operational model and the functional model. That is, for each user operation, there might be exactly one corresponding functionality. In most cases, however, there may be several discrete functionalities that must be executed to express the system services provided by the operational abstract machine.
Each operational machine includes a set, O, of operations that animate it. Similarly, each functional system has a set, F, of functionalities that animate it. For each operation, o ∈ O, that the system may perform, there will be a subset, F(o) ⊂ F, of functionalities that will implement it. It is possible, then, to define a relation IMPLEMENTS over O×F such that IMPLEMENTS (o,f) is true if functionality f is used in the implementation of an operation, o. Within each operation, one or more of the system's functionalities will be expressed. For a given operation, o, these expressed functionalities are those with the property F(o)={f: F|IMPLEMENTS(o,f)}.
Each functionality exercises a particular aspect of the functional machine. As long as the system operational profile remains stable, the manner in which the functional machine actually executes is also stable. However, when there is a major shift in the operational profile by the system, then there will be a concomitant shift in the functional profile as well, which redistributes the activity of the functional machine and results in uncharacteristic behavior of the functional machine. This change in the usage of the system constitutes anomalous system activity. It should be noted that this definition of anomalous system activity is much more precise than that used in intrusion detection, i.e., anomaly detection.
Let M be a set of program modules for a system. The software design process is then basically a matter of assigning functionalities f ∈ F to specific program modules m ∈ M. The design process may be thought of as the process of defining a set of relations, ASSIGNS over F×M such that ASSIGNS(f, m) is true if functionality f is expressed in module m.
Each operation in O is distinctly expressed by a set of functionalities. If a particular operation, o, is defined by functionalities fa and fb, then the set of program modules that are bound to operation o is M(o)=M(f
There is a distinct mapping from the set of operations to the set of program modules. Each operation is associated with a distinct set of functionalities. These individual functionalities are, in turn, associated with a distinct set of modules. The mappings are explained using an example 150 shown in
The modules can be organized into a program call tree as a function of the program design process. The call tree for the above example will look like exemplary tree 160 shown in
As this hypothetical program executes, it will distribute its activity across the arcs of the call tree. This activity is characterized in terms of execution paths that begin at the root and end at all interior and leaf nodes of the tree. In Table 1, above, the possible execution paths of
In this hypothetical execution, interval functionality f3 was invoked during the execution of operation o1, but it was not invoked during the execution of operation 02, as is evidenced by the fact that p(w11)=p(w12)=p(w13)=0.0.
Any of the root nodes of the sub-trees of the call tree are potential candidates for threads of execution. Accordingly, it is appropriate to associate execution threads with functionalities.
During its development, and particularly, during the test process, a software system will be subjected to a wide range of activity. The upshot of this testing activity is that the software will have been exercised over a subset of its possible operational space. Having tested the software through a set of user operations that induced a set of measurable activity on the software, the developer might then be willing to certify the operation of the software as long as it was used in a manner similar to the way that it was tested. To this end, the mathematical description of nominal activity is referred to herein as a “software certificate.” It will become the set point for the process control system. As long as the software is used in the same manner that it was tested, that is, as long as its use does not depart from the certified activity represented by the software certificate, it is reasonable to expect that it would work reliably or that it is not being compromised.
The greatest threat to the reliable operation of the system is the unanticipated operational demands placed on the system by a changed operational environment, which can happen in one of two distinct ways. From a security perspective, the unanticipated operational demand can occur if a system vulnerability has been exploited. From a reliability perspective, the system may have been driven into an untested and uncertified module domain. As a consequence, the system will shift from a reliable operational profile (i.e., as certified by the software developer), to an uncertified profile. In this event, it will be important to understand whether the new behavior can be tested and possibly certified as reliable. To do so, the software system should be suitably instrumented to provide sufficient information to reconstruct the system activity and validate the correctness of that behavior and the associated system components. The main objective of the dynamic measurement methodology is to capture any activity that is considered uncertified, in real time. It is possible, from the observed activity of the software modules, to determine with a certain level of confidence, the reliability of a system under one or more certified operational profiles.
A key assumption is that uncertified software activity has important consequences. The departure of a software system from its underlying certificate is a likely indication that the software/hardware component has failed, a malicious attack has occurred, or that the user(s) have initiated a sequence of uncertified operations. The failure event itself is made tangible through the execution of uncertified system activity. The monitor engine functions by noting the difference between the current state of the system and a model that represents the normal, or certified, execution environment. Once a determination has been made by the analytical engine as to the specific nature of the departure from the certificate or set point, corrective action may then be initiated by the adaptive engine.
The basic notion of certified software execution is that the certificate specifies the range of acceptable values that the measurement attributes may take while the target software system is executing. When an executing software system is observed to be operating outside the range of the certificate, then the adaptive engine will alter the course of execution of the software process. This alteration may include, for example, the elimination of some of the software functionality, which will in turn, cause the software to constrain its activity to a subset of possible observation values in the controlled variables, or may result in the termination of the software process, or other limitation on its operational scope.
The attribute space of dynamic software measurement is very large. However, these measurements can be partitioned into measurements that are taken at the module level and above, and those that are taken within each program module. The set of measurements taken at the module level of granularity and above are referred to herein as “Between Module Measurements.” The set of measurements that are taken when a single module is executing are referred to herein as “Within Module Measurements.”
In order to provide a clear explanation, it will be necessary to define the concept of a program module. For the purposes of this discussion and as used herein, a program module is a set of machine instructions that can be accessed through the use of a machine CALL instruction. The set of machine instructions begins with an instruction that is the object (destination) of the CALL instruction. It is delimited by a machine RETURN instruction. In other words, control is passed to a program module by a CALL instruction, and control is relinquished by the module through the use of a RETURN instruction.
An executing program may be represented structurally in one of two distinct ways. By design, the modules are linked together into a call graph data structure. In this representation, a node in the graph represents a module. Incoming arcs to this node represent calls to the module, and the outgoing arcs from the node represent calls to other program modules. There are a number of distinct measures that may be developed from this call graphical representation. The major problem with this representation is that a module or a sub-graph representing a functionality may be used in a number of different contexts. It might be perfectly normal for the module (or the root of the sub-graph) to be invoked in one or two different contexts but completely abnormal for it to be invoked in any other context.
An alternate means of representing a program is as a call tree. In this approach, each program module is represented by a node in the call tree. However, the indegree, i.e., the number of entering edges, is restricted to be one. Thus, if a program module is invoked by several different modules, it will be represented by many different nodes in the call tree—one for each module invoking that module.
Beginning with the main program module, the called program module names may be placed on a call stack. Each program module will be a letter in the potential execution alphabet of the program. At any point in the program execution, the instantaneous description of the program call stack contains an ordered set of letters from the program execution alphabet. This set of letters represents a word, wi, from the execution vocabulary, W, of the program. The key to the success of this approach of real time monitoring of an executing process is quite simple. There is a vast disparity between the cardinality of the set, W, i.e., of all possible words in the execution vocabulary and the cardinality of the set of words, WC, that actually occurs in a certified execution context. That is, the number of elements in the set WC is very much smaller than the number of elements in W.
The distribution of words in an execution vocabulary is also directly dependent on the execution context. Let, pi=p(wi) represent the probability of encountering word wl ∈ WC during the normal execution of a program. The underlying probability distribution of the pi is a distinct attribute of the program execution in a particular context. The nominal activity, then, of a program in execution is embodied in the set, WC and the probability distribution associated with each of the elements of this set. The key concept, here, is that when the execution framework of a program changes, then the distribution of the pi will also change. It is also possible that a new word, wj, will appear on the call stack, where wj ø WC.
The distribution of the wi is far from uniform. Just as is the case with the English language, some words will occur very frequently, while others may not be expressed at all. This concept is described by the notion of entropy. Accordingly, the entropy, h, of an application in this context is given by h=−Σip(wi)logp(wi). Thus, it is a relatively simple matter to ascertain the underlying distribution of the pi. It is a key feature of most software systems that they are very low entropy applications, which in turn, means that the calibration phase of a typical application is also very short.
If there is a significant change in the entropy of a system due to its usage patterns, when it is placed in service, from that determined during its calibration, then there is a clear indication that the system is being exercised in a very different manner from that for which the calibration was established. There are two possible cases here. Either the entropy will rise above the calibration entropy, or it will be lower. In either event, the change in entropy will reflect the fact that the software is being exercised in a different manner from the calibration activity. System entropy, then, is also a dynamic measure of system activity—and thus, is an indicator of abnormal system activity.
It is also possible to measure the distribution of dwell time for each word in the execution vocabulary. The actual real time (measured in processor cycles) can easily be measured for each instantiation of each word. If the actual execution time associated with a program module when the program is placed in service is at variance with the calibrated execution time, then the program module is likely being exercised in a manner different from its calibration suite of activities.
It is further possible to monitor the flow of data into and out of each program module. From a mathematical perspective, a program module performs a functional transformation on a point, a, in an argument space, to a new point, b, in this space. In this case, a=<a1, a2, . . . an>, where ai is the value of the ith argument in a call to a function module and where the dimensionality of the argument space for the program module is given by m. Each of the ai is defined on a finite set of integral values typically represented by a bit string in computer memory.
Following the same logic, b=<b1, b2, . . . bm>. The set of certified values for a constitutes the range on which a is defined. The set of certified values for b constitutes the domain for b. Thus for the jth program module, this functional transformation can be represented by b=fj(a).
If a function is passed a point in the argument space that is outside the range of certified argument points, then it is possible that the execution of the module/function will produce anomalous results. Similarly, if a function transforms a certified point, a, in the argument space to a new point, e.g., b′, which is not in the certified domain of values, then the function can be said to have produced an anomalous result.
Possible within Module Measures
Perhaps the most significant measures that can be taken on the internal operation of a module relate to the flow of control within the module. The “within module” structure is best explained with a flow graph representation. This flow graph includes a set of nodes and edges. The nodes represent activity events in the program flow, such as processing or decisions, and the edges represent program flow from one node to another.
A control flow graph of a program module is constructed from a directed graph representation of the program module that can be defined as follows:
The flow graph representation of a program, F=(E′, N′, s, t), is a directed graph that satisfies the following properties.
All other nodes are members of exactly one of the following three categories.
If (a, b) is an edge from node a to node b, then node a is an immediate predecessor of node b, and node b is an immediate successor of node a. The set of all immediate predecessors for node a is denoted as IP(a). The set of all immediate successors for node b is denoted as IS(b). No node may have itself as a successor. That is, a may not be a member of IS(a). In addition, no processing node may have a processing node as a successor node. All successor nodes to a processing node must be either predicate nodes or receiving nodes. Similarly, no processing node may have a processing node as its predecessor.
From this control flow graph representation, two essential control flow primitive metrics emerge:
A path P in a flow graph F is a sequence of edges <{right arrow over (a1a2)},{right arrow over (a2a3)}, . . . , {right arrow over (aN−1aN)}> where all ai (I=1, . . . , N) are elements of N′. P is a path from node a1 to node an. An execution path in F is any path P from s to t.
Another very important feature of a flow graph, the representation of program iteration constructs, must be considered. A program may contain cycles of nodes created by if statements, while statements, and so forth. These iterative structures are called cycles as opposed to the more familiar concept of a programming loop.
A path through a flow graph is an ordered set of edges (s, . . . , t) that begins on a starting node s and ends on a terminal node t. A path may contain one or more cycles. Each distinct cycle cannot occur more than once in a sequence. That is, the sub-path (a, b, c, a) is a legal sub-path, but the sub-path (a, b, c, a, b, c, a) is not, because the sub-path (a, b, c, a) occurs twice.
The Total Path Set of a node a is the set of all paths (s, a) that goes from the start node to node a itself The Cardinality of the set of paths of node a is equal to the total path count of the node a. Each node singles out a distinct number of paths to the node that begin at the starting node and end with the node itself. The path count of a node is the number of such paths. The module path count is set to the number of total paths from s to t.
Cycles are permitted in paths. For each cyclical structure, exactly two paths are counted: (a) one that includes the code in the cycle; and, (b) one that does not. In this sense, each cycle contributes a minimum of two paths to the total path count.
As each program module is executed, it will excise a subset of the possible paths in the module flow graph. The cardinality of the excised subset is a measure that can be taken on the execution activity. When the software has been calibrated for the purposes of certification, the sub-flow graph representing the totality of module paths from the certification process for that module will constitute the certified sub-flow graph. When the program module is placed into service, it is possible to assert that the path selected by the most recent module execution is a member of a set of certified paths from a certified sub-flow graph. The nature of uncertified departure of the module execution may be measured by the cardinality of new module paths and new nodes that are not in the certified sub-flow graph.
Within the certified sub-flow graph, there is a set of processing nodes A′={a1,a2, . . . , am} where A′ ⊂ A and A represent the complete set of processing nodes from the actual module flow graph. It is possible to define a probability distribution pi=Pr(ai), such that pi represent the probability of executing the code in processing node ai under a certified distribution of processing node activity. When the program containing the module is placed into service at some future time, the new observed distribution of processing node activity can be represented by p′i=fi/F, where fi is the observed frequency of execution of processing node ai, and F represents the cumulative count of the execution of all processing blocks with that module. It is possible to measure the disparity of the current distribution of activity within the module processing blocks with the distance function
The certified processing node set is the set of all processing nodes that will be executed during the certification process. At run time, it is possible to enumerate the number of processing nodes that are executed, which are not members of the certified processing node set.
The arguments passed to each program module when control is passed to it may be represented as a vector of dimensionality n. Each time that a program module is invoked, it will be supplied with a new argument list, again, represented as a vector. Each of these vectors represents a point in an n-dimensional space. During the certification process, the set of all such argument vector points will constitute a cluster of values in the n-dimensional space. This cluster may be represented by a single point, which corresponds to the centroid of the cluster. The centroid may be computed in a number if different ways. Perhaps the most useful of such representations is to define the centroid,
The preceding discussion is not intended to provide an exhaustive list of exemplary “within module” measures. Instead, it simply indicates some of the within module measures that can be employed for monitoring the run time activity within program modules.
The actual telemetry from an executing software system may be gathered either in software through the use of software probes or it may be gathered directly from the hardware system on which the software is executing. The software being monitored may be modified to incorporate such software probes. Typically, a software probe will include a Call statement inserted into the software at one or more predefined points. When the Call statement is encountered at run time, control is passed by the Call to a monitor routine that records the event.
Experience has shown, however, that the act of monitoring a software system in software involves a substantial amount of computational overhead. In the case of real time embedded systems, this overhead may dramatically affect the ability of the monitored software to respond within critical time frames dictated by the temporal design constraints of the embedded software itself. One value of the exemplary software implementation is that it clearly demonstrates the viability of the modeling effort, even if it may not be as desirable as using hardware for monitoring a software process.
Although the concepts disclosed herein have been described in connection with the preferred form of practicing them and modifications thereto, those of ordinary skill in the art will understand that many other modifications can be made thereto within the scope of the claims that follow. Accordingly, it is not intended that the scope of these concepts in any way be limited by the above description, but instead be determined entirely by reference to the claims that follow.