Other objects and further features of the present invention will become more apparent from the following detailed description when read in conjunction with the accompanying drawings:
With reference to figures, an embodiment of the present invention will now be described.
A trouble task detecting program as an embodiment of the present invention provides a function to detect a state that an application program operating on a multitask OS, which has such a function that a plurality of tasks having respective priority levels operate, enters an infinite loop operating state by some cause.
That is, according to the embodiment of the present invention, when a CPU's 100% load state occurs continuously upon operation of the multitask OS, it is possible to determine whether a cause thereof is illegal operation (infinite loop operation or such), or is merely temporary continuation of a high load state due to regular high load processing. Then, when it is determined that illegal operation of the program has caused the situation, tasks which are candidates of the actual cause thereof (refereed to as ‘suspicious task’, hereinafter) are specified.
Further, when it is determined that the illegal operation has caused the situation, a notification is generated externally that a trouble state has occurred.
Further, when it is determined that the illegal operation has caused the situation, a countermeasure thereto is selected, and is set.
Further, when a continuation of a high-load state is detected, information of the task acting as the cause thereof or candidates thereof is obtained as a history, and after that, the history is readable.
Further, when a continuation of a high-load state is detected, and also, this situation does not corresponds to a temporary event caused by regular high-load processing but corresponds to an event in which data exchange continues infinitely between a plurality of tasks, i.e., so-called ping-pong phenomenon, this fact is detected.
In the embodiment of the present invention, it is assumed that the OS has the following four functions i), ii), iii) and iv):
i) The respective tasks are executed according to their predetermined task priorities (see
ii) When switching of the task to be executed (so-called ‘task switching’) has occurred, the corresponding task is identified (in
iii) A currently executed state of the task is obtained (see
iv) A message transmission/reception state between the tasks (see
The above-mentioned function i) corresponds to such a function that, when the task priority is previously given to each task, each task (i.e., an application task) operates according to the priority.
The above-mentioned function ii) corresponds to the function 2 of
The above-mentioned function iii) corresponds to a function determining which of predetermined three types of execution states the currently executed task belongs to (see
For
Dispatch: operation of giving an execution right, thereby causing another task to enter a state upon execution, and entering itself a state executable.
Preemption: operation of receiving the execution right and entering a state upon execution.
Receive: operation of entering a state waiting for execution for waiting for receiving a message.
Send, Start: operation of a task in a state waiting for execution transmitting a predetermined message, and entering a state executable or a state upon execution.
Stop: operation of entering a state waiting for execution from a state executable in a predetermined condition.
Each task state will now be described:
State upon execution (Running):
A task which can enter the Running state within a given time is only one, for one processor;
The task in the Running state executes an instruction of a given program.
The task scheduler causes the task to wait until there are no tasks in the Ready states having the priority higher than the currently executed task.
The task scheduler carries out context switch (i.e., task switching) immediately when another task having the higher priority enters the Ready state, and thus, the task having the higher priority is to be executed earlier.
When the currently executed task is blocked by a system call or such, the process state is changed in the Waiting state. At this time, the scheduler selects the task having the higher priority, causes the same to enter the Ready state, and also, causes the same to be executed.
State executable (Ready):
The task is executed when all the tasks having the higher priorities have finished.
State waiting for execution (Waiting):
The task in the Waiting state either waits for occurrence of a specific event, or has already entered a stop state.
The task in the Waiting state does not require the CPU in this stage.
A system call causing the task to enter the Waiting state is called a blocking system call.
The task may enter the Waiting state by the following reasons:
1) It waits for arrival of a signal message;
2) It waits for elapse of a predetermined delay time;
3) It waits for a semaphore;
4) It waits for a high-speed semaphore;
5) It waits for completion of the system call;
6) It has been explicitly stopped by the system call (‘suspend’ or such);
7) It has reached a breakpoint.
Next, an example of transition of the task state will be described for each case:
Transition from the Running state:
Running→Ready (an arrow of Dispatch in
When the task of the higher priority than that of the own task currently executed is executed, the execution right is dispatched thereto.
Running→Waiting (an arrow of Receive)
It occurs when the currently executed task enters the signal message waiting state, the delay time elapse waiting state, the semaphore waiting state or such.
Transition from the Ready state:
Ready→Running (an arrow of Preemption)
The execution right is preempted when there is no tasks in the Running/Ready states of the higher priorities than that of the own task currently executed.
Ready→Waiting (an arrow of Stop)
When the task in the Ready state is forcibly suspended by means of the system call, the task enters the Waiting state (the suspended task returns to the original state when being resumed).
Transition from the Waiting state:
Waiting→Running (an arrow of Send, Start):
When the own task is in the message waiting state and has the priority higher than that of the currently executed process (in the Running state), and then, the other task sends the message which the own task receives, or the task itself is created or started (create&start), the own task enters the Running state.
Waiting→Ready (an arrow of Send, Start):
When the own task is in the message waiting state and has the priority lower than or the same as that of the currently executed task (in the Running state), and then, the other process sends the message which the own task receives, or the task itself is created or started (create&start), the own task enters the Ready state.
The above-mentioned function iv) corresponds to a function to obtain information (a message queue or such) such as a message destination, during message transmission/reception between the tasks, such as that shown in
The trouble task detecting program according to the embodiment of the present invention is configured to have instructions to cause a computer to execute the following functions 1 (F1), 2 (F2), 3 (F3) and 4 (F4).
Function 1: CPU load monitoring function;
Function 2: task switching history obtaining function;
Function 3: trouble suspicious task extracting function; and
Function 4: trouble suspicious task monitoring function
The function 1 monitors whether or not the CPU's 100% load state continues, and, executes processing of the function 3 when detecting that the CPU's 100% load state continues more than a predetermined time.
The function 2 is a function to obtain a corresponding task ID and system time (ideally, granularity thereof being not more than 1 millisecond) as history information at the time when task switching has occurred.
The function 3 is started up when the function 1 has detected the CPU's 100% load state continuation for the predetermined time, and, based on the history information obtained by the function 2, the function 3 extracts the tasks which are highest ones in a list of those having values more than a predetermined threshold, i.e., those of larger numbers of execution times, those of longer execution times, or such, as the suspicious tasks for the trouble task. When there are no tasks of more than the above-mentioned predetermined threshold, execution of the function 1 is returned to.
The function 4 periodically monitors the execution states of the suspicious tasks extracted by the function 3 for a predetermined time, and checks whether or not an infinite loop operation state has occurred there.
When the function 4 has not found that the suspicious tasks enter the states waiting for execution, this means that the suspicious tasks have not released their execution rights. Accordingly, the function 4 determines that these tasks has entered the infinite loop operation states, and thus, executes predetermined trouble responding processing, i.e., restarts the corresponding tasks, carries out system restart, or such.
On the other hand, when it can be determined that the suspicious tasks have entered the states waiting for execution, it is determined that these tasks have not entered the infinite loop operation states, and thus, remove them from the monitoring targets. That is, these tasks are excluded from the suspicious tasks.
When there are thus no suspicious tasks to be monitored, the function 4 is finished. Further, when the function 1 has detected that the CPU load falls during the monitoring by the function 4, the function 4 is also finished.
Further, when the function 4 has found the tasks entering the infinite loop operation states, the function 4 notifies of this fact externally. That is, output to a console or such, is carried out.
Furthermore, when the function 4 has found the tasks entering the infinite loop operation states, the trouble responding processing for recovery of the tasks may be selected.
Further, a function 5, i.e., a suspicious task history obtaining function, is provided such that, while the function 4 stores the information of the tasks extracted as the suspicious tasks as the history, the same may be read by the function 5 according to a predetermined command or such.
When all the extracted tasks are excluded from the suspicious tasks and also the function 1 detects that the CPU's 100% load state continues for a long time during the monitoring operation by the function 4, there is a possibility that the above-mentioned ping-pong phenomenon has occurred rather than the infinite loop operation states of the specific tasks. Therefore, the task which executes the function 4 is provided with the following function 6, i.e., a ping-pong phenomenon monitoring function, by which existence/absence of the ping-pong phenomenon is determined.
The function 6 reads the history information of the suspicious tasks obtained by the function 5, and, when the plurality of tasks appear in the history, the function 6 reads the message transmission/reception states (i.e., the message queue information or such) of these suspicious tasks. Thus, it is determined whether or not the destinations of the messages are those between the suspicious tasks. When it is determined, as a result, that the message transmission/reception by the suspicious tasks corresponds to the message transmission/reception between the suspicious tasks, it is determined that a program trouble has occurred due to a ping-pong phenomenon. As a result, the predetermined trouble responding processing, such as system restart or such, is carried out.
By providing the above-described configuration according to the embodiment of the present invention, the trouble task detecting program according to the embodiment of the present invention provides the following advantages:
That is, in the related art, when a CPU enters a high-load situation, erroneous determination that a trouble has occurred may be made as mentioned above. In contrast thereto, according to the present embodiment, it is possible to determine, with a high accuracy, whether or not the CPU high-load state continuation corresponds to merely a temporary event caused by regular high-load processing, or corresponds to actually problematic high-load state continuation due to the program trouble such as the ping-pong phenomenon.
Further, in the related art, even when the high-load state continuation due to the ping-pong phenomenon has actually occurred, it may not be possible to positively distinguish it from a temporary high-load state due to regular high-load processing. In contrast thereto, according to the embodiment, it is possible to accurately detect the program trouble due to the ping-pong phenomenon.
The above-mentioned ping-pong phenomenon will now be described in detail.
For example, as shown in
Next, the above-mentioned respective functions of the trouble task detecting program according to the embodiment of the present invention will be described in further detail.
The function 1 (F1) determines whether or not the CPU's 100% load state continues.
This operation is, as illustrated in
As shown in
In
On the other hand, when the timer outputs a time-out (‘time-out’ of Step S2), the continuous time-out counter counts up (Step S5), and the function 3 is executed (Step S6). It is noted that the task A executes the function 3.
In the example of
Next, in the above-mentioned function 2 (F2), all the logs are collected always when task switching occurs. This function is executed each time the task switching occurs, and operation shown in
That is, being triggered by occurrence of the task switching, the system time (in the granularity of 1 millisecond) is obtained from the OS, and a corresponding task ID is obtained. Then, the thus-obtained information is recorded in sequence in a format shown in
This function 2 is executed by a handler function of the OS, i.e., for example, by a SwapIn handler function in a case of OSE (Office Server Extension). Accordingly, this function is not executed by the task but is started up and executed by the OS itself by means of the program function activity.
Next, assuming that the infinite loop operation states may have occurred on the specified task as a cause of the CPU's 100% load state continuation, the function 3 (F3) extracts corresponding candidates as the suspicious tasks.
Specifically, a flow chart of
That is, from the maximum 2000 logs, a total operation time, which indicates how long time (milliseconds) each task has operated, is calculated, in task ID units (Step S31 of
As shown in
On the other hand, when some corresponding tasks occur (Yes in Step S34), they corresponding to the suspicious tasks, a predetermined message is sent to another task (one corresponding to the task T3 in
The function 4 is a function to determine whether or not the infinite loop operation state has occurred. The function 4 is executed with the priority higher than those of the application task group (see
The task executing the function 4 is a separate task (one corresponding to the task T3 in
Immediately after the start of the execution of the function 4, the information of the list of the suspicious tasks extracted by the function 3 as mentioned above is logged by the function 5 (Step S41). After the logging, it is determined whether or not the CPU's 100% load state monitored by the function 1 still continues. When it does not continue, it is determined that no mal-operation (illegal processing) such as the infinite loop operation or such has occurred, and merely a regular over-load situation has occurred. Then, the execution of the function 4 is finished (No in Step S41). On the other hand, when it is determined that the CPU's 100% load state still continues (Yes in Step S41), Step S43 is then executed.
In Step S43, the states of the suspicious tasks are obtained by the program function activity executed by the OS. For example, in the above-mentioned case of OSE, the function of get_pcb is used. The states of the tasks may be any ones of the above-mentioned three types, shown in
When the tasks are in the states waiting for execution (Yes in Step S45), this means that the corresponding tasks are in the states waiting for messages or such. As a result, it can be determined that no infinite loop operation has occurred. Accordingly, the tasks waiting for execution are excluded from the suspicious tasks, and thus, are excluded from those to be further monitored (Step S46).
When the corresponding tasks are in the states other than those waiting for execution, this means that these tasks continue operation. Accordingly, these tasks are left in the suspicious tasks (No in Step S45).
The same test is carried out on each of all the tasks included in the suspicious tasks (a loop of Steps S44 and S45 (as well as S46 if applicable)). After the test has been completed for all the suspicious tasks (Yes in Step S47), Step S48 is executed.
For all the suspicious tasks still left, a check counter is provided for each thereof, and it counts up by one. Next, in Step S49, it is determined whether or not the count value of each counter has reached a predetermined threshold, i.e., 600 times (changeable).
When there is the suspicious task having the count value of the check counter of 600 times (Yes in Step S49), this task is determined as the trouble task, and it is determined that the infinite loop operation has occurred by this task. Then, the predetermined trouble responding processing is started (Step S50).
On the other hand, when each suspicious task does not have the count value of the check counter of 600 times (No in Step S49), it is determined that the monitoring should be further continued. As a result, after an elapse of a predetermined retry time, i.e., 100 milliseconds (changeable) (Step S51), operation of the function 4 is carried out again from the beginning (Steps S42 through S49).
The test is thus repeated maximum 600 times every period of the above-mentioned 100 milliseconds. As a result, the test by the function 4 continues for total 1 minute.
A case can be assumed where the operation for the test by the function 4 is repeated, it is determined that none of the suspicious tasks is problematic (i.e., No in Step S45→S46), and thus, no suspicious tasks are left consequently. In such a case, it is possible to either finish the operation of the function 4 upon determination that no infinite loop operation has occurred, or start a state for executing the above-mentioned function 6 upon determination that the ping-pong phenomenon may have occurred. It is possible to set either alternative arbitrarily.
The above-mentioned function 5 (F5) is a logging function (Step S41 of
In this logging function, logging information as shown in
In each time of the logging operation, updating of the counter (Counter) (Step S61), recording of the apparatus time (Time) (Step S62), recording of the apparatus system time (SystemTimer) (Step S67) and recording of the suspicious task list (TaskList) at the time (Step S68) are carried out at once.
The above-mentioned function 6 (F6) is a function to determine whether or not the ping-pong phenomenon has occurred, when the function 4 determines that no infinite loop operation has occurred. This function 6 executes operation of a flow chart shown in
In
In Step S73, in the logging information recorded by means of the execution of the function 5, the last 5 times of the logs are read, and it is determined whether or not the same task ID occurs every time there.
In the example of
When no plurality of tasks meeting the requirements of Step S73 can be found out (No), it is determined that no ping-pong phenomenon has occurred, and the execution of the function 6 is finished. On the other hand, when a plurality of tasks meeting the requirements have been found out, Step S74 is executed.
In Step S74, the tasks found out in Step S73 are regarded as ping-pong suspicious tasks. That is, in this example, the tasks 0x000B and 0x000C are regarded as the ping-pong suspicious tasks. After that, the states of these ping-pong suspicious tasks are analyzed.
In this example, the task states of the above-mentioned tasks 0x000B and 0x000C are obtained. At this time, for example, the above-mentioned get_pcb function is used, and the queue information of the corresponding signals are read. In the queue, messages transmitted to the tasks are stored, and the transmission source information of each message is read. When the transmission source task of the message thus read corresponds to the respective one of the ping-pong suspicious tasks, i.e., the tasks of 0x000B and 0x000C in this example (Yes in Step S75), this means that these ping-pong suspicious tasks exchange the messages therebetween. Accordingly, in this case, it is determined that the ping-pong phenomenon has actually occurred. As a result, the previously set trouble responding processing is started (Step S76).
In the trouble responding processing, operation of a flow chart of
First, setting as to whether or not the trouble contents should be notified of, is read (Step S81). When the notification is required (Yes), notifying processing according to setting previously made by a command is carried out (Step S82). After that, designated predetermined trouble operation is executed (Step S83).
Below, a list of parameters set for execution of each of the above-mentioned functions 1 through 6 is shown, as well as specific set values in the embodiment are shown enclosed by brackets:
Function 1:
the continuous time-out counter (started from 0);
the keep alive notification generating period (10 seconds);
the set time in the timer (5 minutes)
Function 2:
the set maximum number of times of logging (2000)
Function 3:
the set number of the list highest tasks to extract (6);
the CPU occupancy threshold (15%)
Function 4:
the set times in the check counter (600 times);
the retry waiting time (100 milliseconds)
Function 5:
none
Function 6:
the function valid/invalid setting (valid);
the set high load-state continuation time (25 minutes=5 histories)
Next, the settings in the above-mentioned trouble responding processing are shown below:
Trouble responding processing:
the notification required/non-required setting (required);
the specific notification method (the following item 2) is selected):
1) notify to another task;
2) output to the consol;
3) make a trap (TRAP) notification;
4) generate an alarm (ALM)
Trouble operation (the following item 5) is selected):
1) delete the trouble task;
2) delete and re-generate the trouble task;
3) suspend the trouble task and start operation thereof again;
4) stop the system;
5) restart the system;
6) do nothing
As shown in
The OS of the computer 100 is a multitask OS, and has the above-mentioned functions i), ii), iii) and iv).
Further, the above-described trouble task detecting program in the embodiment of the present invention is stored in the nonvolatile memory 113 such as the flash memory, or downloaded through the network via the interface card 120 and the communication device 114, and then, is stored in the SDRAM 12.
After that, the CPU 111 executes the trouble task detecting program, and thus, executes out the above-mentioned functions 1 through 6 described above with reference to
The present invention may also be applied for an OS not only of a stand-alone computer, but also various built-in OS for computers provided for controlling an automobile and so forth.
The present invention is not limited to the above-described embodiment, and variations and modifications may be made without departing from the basic concept of the present invention claimed below.
The present application is based on Japanese Priority Application No. 2006-285343, filed on Oct. 19, 2006, the entire contents of which are hereby incorporated herein by reference.
Number | Date | Country | Kind |
---|---|---|---|
2006-285343 | Oct 2006 | JP | national |