When monitoring a computer system, such as a server or a personal computer, a software-based monitor is used, which in general provides snapshot descriptions of the system state at various times. In one typical arrangement that has been in use for many years on a variety of platforms, the monitor periodically collects information about the system state, such as every few seconds or minutes, and then (optionally) stores the information in association with a timestamp in a persistent store. The collected information, which generally comprises the values of system counters at the time of sampling, such as for measuring CPU operations, disk operations and so forth, may then be analyzed.
In general, to collect the samples for a given test, the monitor sleeps for a defined interval, or a timer is used to trigger the monitor to collect the next sample set at the next interval. The sleep technique fails to account for the time taken to collect the data; the timer technique factors in this collection time, but still does not account for other delays, which may be cumulative. As a result, with either technique, software-based monitoring suffers from accuracy problems, including misleading data and lost samples.
In some measuring/monitoring environments, such inaccuracy (e.g., computed as a relative error percentage) is acceptable. However, some environments require more accurate monitoring where such inaccuracy is not acceptable. For example, lost samples poses a problem when comparing data from consecutive days, because each set has a different number of data samples and a different effective average sampling time. Thus, when seeking accurate measurements, including when measuring at a relatively high rate of sampling, existing monitoring mechanisms are not acceptable in certain environments.
This Summary is provided to introduce a selection of representative concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used in any way that would limit the scope of the claimed subject matter.
Briefly, various aspects of the subject matter described herein are directed towards a technology by which the quality of data collected during a computer system monitoring test is improved for subsequent analysis. In one example aspect, monitoring includes collecting data corresponding to the computer system's state. An interval is computed based upon an actual start time associated with this current iteration and a desired interval. A subsequent data collection iteration is performed after waiting for the computed interval. The computed interval may be further based on an elapsed data collection time that accounts for any delay in collecting the data. In another example aspect, computing the interval may include adjusting a sleep time based on a prediction obtained from historical data, e.g., of actual past iteration start times.
By computing the interval based on an actual system time to dynamically adjust the sleep time, samples are not lost, as data collection is more evenly performed at a steadier rate, and is performed closer to the desired interval. Further, a prediction based on historical data moves the start time closer to that desired. Either dynamic adjustment or prediction, or a combination of both, improves data quality.
In another aspect, by recording an elapsed data collection time in association with the data collected in each iteration, the elapsed data collection time may be used as a measure of error when later analyzing the data collected in that current iteration. The elapsed data collection time may also be used in estimating a time value corresponding to when each part (e.g., counter) of the data collection process was actually read. The elapsed data collection time may be further used to estimate a number of processor time slices taken to collect the data; the number (of one or more processor time slices) may be used in estimating the time value for when each counter was actually read, and/or in computing a measure of error associated with reading a given counter.
Other advantages may become apparent from the following detailed description when taken in conjunction with the drawings.
The present invention is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements and in which:
Various aspects of the technology described herein are generally directed towards improving the quality and accuracy of data collected by a system monitor, and thereby provide for improved data analysis of the collected data. In one aspect, one or more timing mechanisms adjust sampling intervals to provide more timely and consistent sample sets for subsequent analysis. In another aspect, the uncertainty of errors associated with collected data is reduced and otherwise computed to facilitate better data analysis.
While many of the examples herein are described with respect to a computer system such as a server or personal computer, it is understood that these are only examples, and that any computing device or set of devices capable of system state monitoring for data analysis may benefit from the technology described herein. As such, the present invention is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any of the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used various ways that provide benefits and advantages in software-based measurement and monitoring in general.
Turning to
In general, a software monitor 104 periodically wakes up or is woken up by a timer (at an interval set by a test person for example, hereinafter a “tester”), and includes a data recording mechanism 106 that gets the system time 108, collects the counter values C1-Cn in a sample set, and records the sample set along with a timestamp corresponding to the system time in a data store 110; (in general, this is referred to as “sampling” herein) The software monitor 104 then goes back to sleep until the next sampling iteration. However, as described below, the software monitor 104 does not actually wake up at the exactly scheduled time, but rather is subject to system delays.
More particularly, because the software monitor program has to share the computer system with other programs, the time at which the sampling occurs does not exactly match the requested sampling interval. By way of a simplified example, if sampling is to occur once every second, (scheduled to awake at exactly 1.0 second in this example), in actuality the sampling may not be started until 1.1 seconds because other processes may delay starting the sampling process by 100 milliseconds (ms). In general, the shorter the sampling interval that is chosen by the tester, the more that this delay becomes problematic.
As a result, one accuracy-related problem arises from the delays in the sampling times, caused by various artifacts of the operating system scheduling that cause sampling to start later than expected. This creates uneven effective time sampling intervals between samplings. However, some analytical methods assume or prefer time-series data that are evenly spaced in time.
Moreover, not only may the starting time of each sample be slightly time delayed, but these time delays may be cumulative. For example, if a sample (to collect a sample set of data) is taken 10 ms later than planned, the timing for every sample set taken thereafter may be shifted by these 10 ms. If there are three delays of 100 ms on three consecutive sample sets, the fourth sample set will be taken 300 ms later than expected, plus that fourth sample set's delay.
Such timing problems often lead to lost sample sets, even when a much larger sampling interval is chosen. For example, given an average delay of 50 ms and a fifteen second sampling interval, at the end of a day 288 seconds are lost, and 19 sample sets less than expected will have been collected. On heavily-loaded servers, as much as ten percent of sample sets may be lost. This prevents or greatly complicates certain types of data analysis, such as when comparing sample data collected over different days.
As shown in
To this end, constant system time rather than the fixed interval time is used to compute the desired start time for the next sampling. This may be accomplished by setting a sleep time or by setting an external (variable) sleep timer 114 (as shown in
Steps 202, 204 and 206 are generally one-time initialization operations, beginning at step 202 which represents preparing a list of counters to be collected, and step 204 which reads the value of the requested interval, e.g., as set by the tester. Step 206 initializes a variable representative of the next starting time, which is the current system time 108 plus the requested interval.
Step 208 begins the sampling iteration, including setting the timestamp for this iteration. Note that the loop beginning at step 208 essentially loops as long as required by the tester; thus although
Step 210 represents reading the counter values C1-Cn from the system and storing the results. Note that the tester can specify which set of subset of counters to read, and as described below, may specify a read order. Further note that not only are the counter values stored, but also the timestamp indicating when the sampling began, as well as (optionally, as described below) the amount of time taken to collect the data, e.g., a current system time after collection. For example, this may be the same current system time as used in step 212 (below), minus the current system time when collecting began.
Step 212 determines the sleep time, based on the nextStartTime variable previously determined (either during initialization or in a previous iteration) minus the current system time, which has changed since the time before reading began that corresponds to the timestamp. In other words, before initially starting data collection, or from a previous iteration once looping has begun, the current system time was read, added to the value of the interval, and set to the variable (nextStartTime). After data collection, the algorithm computes how much time remains until the end of the next interval (the start of the next collection), as measured by the system clock. Then the nextStartTime variable is updated with the interval value, at step 214.
Step 216 generally corresponds to the predictive mechanism 116 of
Steps 218 and 220 represent sleeping for the sleep time, that is, as stored in the computed sleepTime variable at step 212. Note that although
By way of a numerical example, consider a test starting at system time of 10,000 ms with a requested interval of 1,000 ms. The nextStartTime is initially set to 10,000 ms+1,000 ms=11,000 ms. In this example, consider that the elapsed time taken to read the counter values is 100 ms in the first execution of the loop, whereby the system time read after that (at step 210) is 10,100 ms. The sleep time is thus calculated as 11000 ms−10100 ms=900 ms, and the nextStartTime is thus 12,000 ms.
Continuing with the example, the monitor thus goes to sleep for 900 ms (rather than the interval of 1,000 ms), but for this iteration, because this time the scheduler adds a 50 ms delay, the monitor actually wakes up after 950 ms. In the second execution of the loop, the system time read (at step 208) is thus 12,050 ms because of the scheduler delay. The nextStartTime is then 13,000 ms. If reading the counter values takes 300 ms in this iteration, the system time read at step 212 is 12,350 ms; the sleep time is thus computed at step 212 as 13,000 ms−12,350 ms=650 ms.
In this manner, the sleep time after data collection accounts for the any delays, including delays in both the scheduling time and the data collection time. In one example, resulting time intervals appear as represented in
As can be seen in
In this manner, there is thus achieved the correct number of samples per time period as specified by the tester. In addition, the actual sampling times are closer to the beginning of each interval. Note that this does not remove the delay caused by scheduling, but rather removes the additive effect of such delay. Further, not all intervals are equal because the delay caused by scheduling still remains, however, the effective intervals between samples oscillate around the requested interval, not around some load-dependent value and/or system-dependent value larger than the interval. This allows comparing data from different days, because regardless of differences in load, the mean effective interval is the same as the requested interval.
Turning to a further explanation of step 216, for many data analysis algorithms, it is better if the starting times are as evenly spaced as possible. In one example implementation described herein, this may be implemented by the prediction mechanism 116 of
Using recent delay times as the historical data, for example, if the last m delay times are kept as the recent history, the sleep time computed for the next interval may be adjusted based on this history so as to aim for sampling to begin earlier than an exact interval boundary. As a more particular example, if m is three and each of the last three delays was 100 ms while the next interval is expected to start at 5000 ms, the next sampling start time is moved forward 100 ms (by lowering the sleep time by 100 ms) to compensate for the predicted delay of 100 ms, that is, to start at 4900 ms. An average of the previous m delay times is one very straightforward way to predict the next delay time, but as can be readily appreciated, virtually any suitable mathematical computation may be used for the prediction; any of various known methods of making statistically valid predictions of an expected delay may be employed.
If the delay occurs as predicted, sampling starts 100 ms late, exactly at 5000 ms, and it is seen that the estimate was correct. If less than the full predicted delay occurred, the next sampling starts earlier than expected; however, the history changes whereby the estimate of the delay is updated for the next iteration, so that the next prediction corresponds to a smaller (or eventually no) delay. Had the delay been larger than predicted, starting will have occurred slightly later, but this will increase the prediction time and thereby further reduce the sleep time, whereby the next sampling attempts to start even earlier.
For example in comparing collection start times in
Note that the prediction mechanism 116 of
As can be seen, dynamic adjustment and/or prediction improve data quality by keeping the sampling rate consistent, eliminating cumulative delays and thereby eliminating lost sample sets, and/or reducing unevenness in the starting times. As a result, each sampled data set is closer to its recorded timestamp. Further, statistical consideration issues that exist when trying to compare data from the server on several consecutive days are resolved. Still further, the reduction of the uneven effective time sampling intervals between samplings facilitates the use of analytical methods that assume or prefer time-series data that are evenly spaced in time.
Turning to another aspect of improving sampled data quality for subsequent data analysis, it is considered herein that the data collection time for each sampling may be different between sample sets. For example, if a collection of counters takes several hundred milliseconds of elapsed time (e.g., because the computer is heavily loaded or for other reasons), the monitor needs to consider that the last-collected collected values were likely obtained several hundred milliseconds later than the first collected values. As a result, even though all counter values in that sample are stored and marked with the same timestamp, they do not represent the actual values that existed in the system at the same time. This creates the potential for misleading interpretation of the data.
Thus, a second set of accuracy-related problems is caused by the variability of finite times required to collect the sample data from one sampling to another. In general, this is because data collection may take longer than one processor time slice to complete, whereby any number of processor time slices used by other processes may be in between any two processor time slices that are used for data collection. In practice, it has been seen that data collection times can vary among sample sets from tens to hundreds of milliseconds.
To mitigate the adverse effects of variable data collection times, while taking the sample, the elapsed time taken to collect the sample is recorded (e.g., at step 210 of
As a first type of compensation, the elapsed time (the recorded taken to collect the sample) may be used as a measure of error, as generally represented in
As another type of compensation, instead of using the single timestamp for all counters of a sampling set, the relative position of a counter in the counter list and the elapsed data collection time recorded for that sampling may be used to estimate a more realistic time that any given counter was really collected. To this end, the differences in time between collecting the individual counters in the sample are assumed to be mathematically related (e.g., proportional) to the position of that counter in the monitor counter list. For example, if it takes two seconds to collect the samples and there are one-hundred counters on the list, is can be estimated that the counter number one-hundred was collected two seconds after the counter number one. This data may be used to analyze the overlap of intervals between two different counters, and to interpolate a more likely actual time for each counter.
Step 604 represents interpolating such a time for each counter, (although as can be readily appreciated, grouped subsets of counters may be treated together, e.g., counters one through ten may have one timestamp, counters eleven through twenty another, and so forth. Linear interpolation is one straightforward type of time compensation, although as can be readily appreciated, other mathematical methods may be used.
As also represented by step 604, the error of any time estimation itself may be estimated and associated with its corresponding data. That is, the estimate of the time of when a counter was really collected, and the elapsed time (or its portion), may be used as a measure of error associated with when the counter was really collected; for example, the first counter has close to zero error.
Compensation may also use an estimate of how many processor slices were used in the collection of that data sample. Note that the processor slice time may be obtained from the operating system, such as during initialization, and saved with the data set. Thus, for any counter that was collected, the number of slices into which data collection was split may be used as a factor in the estimate of time error, as generally represented by step 606. Moreover, the estimate of the time error for any given counter may be computed by combining one or more of steps 602, steps 604 and 606, e.g., using the elapsed time to collect the sample and the estimated number of processor slices that it took to collect the sample as measures of error.
Note that step 604 may be a relatively rough interpolation estimate that assumes linearity of sample collection times (if proportionality is used as the mathematical relationship). Straight linear interpolation is somewhat inaccurate when the monitor request is sliced into several processor time slices; nevertheless, it is still more beneficial than not in data analysis.
However, the data quality can be improved by knowing and compensating for the processor time slices, because with this information an estimate into how many slices the collection request was cut (from the ratio of elapsed time for each collection request versus the CPU slice time) may be made.
More particularly, further reasoning may be made from the spectrum of data collection times on given computer. The shortest data collection times are the ones collected uninterrupted, such as in a single time slice or in consecutive time slices. The longest data collection times are those with the largest number of interrupts, in two or more (but likely several) processor slices. For each sampling, an estimate may be made as to how many processor time slices the data collection required.
With this number-of-slices estimate, an estimate may be made as to which counters were read in which time slice. For example, if a four-hundred counter sampling took two time slices, the counters numbered one to two hundred can be considered in the first time slice, and counters numbered two-hundred one to four-hundred in the second time slice; the interpolated times for counters numbered two-hundred one to four-hundred can thus be adjusted with an offset value computed or interpolated for that second time slice. Step 608 represents adjusting the interpolated times based on time slice estimates. Note that this is only an estimate, although statistically valid.
In the above example, the counters closer to counter number one or closer to number four hundred are more likely to have been read in the properly guessed time slice, whereas it is more uncertain for counters closer to number two hundred whether a given counter was read in the first or second time slice. Thus, at step 608, the timing error value that may be associated with each counter (e.g., at step 606) may be recomputed or adjusted based on its position relative to the time slices.
Continuing with the same example, the uncertainty increases from zero or near-zero error at counter number one (because it is highly probable that counter 1 was read in the first time slice) to its highest uncertainty value around counter number two hundred, and decreases back towards zero for counter number four hundred (because it is highly probable that counter number four hundred was read in the second time slice). The time errors for each counter can be computed and/or adjusted based on this uncertainty value.
Thus, because it is not known where an allotted time slice ends with respect to reading the counters, any time-slice adjustment introduces error. However, this can be mitigated to an extent by considering the reading order. For example, putting the most important counters first (or last) makes so that will be read during the first time slice reduces error with respect to the most important counters.
Further, the reading order may be varied to control the error distribution. For example, given enough samplings, a random reading order distributes the error evenly among counters. Counters may be read in a backwards order every other time. Counters may be read with a different starting counter, e.g., if there are one hundred counters, counters number one to one hundred may be read in that order in the first iteration, counters two to one-hundred and followed by counter one in the second iteration, and so forth.
Further, a combination of the above ordering techniques may be employed. For example, if at least forty counters are assured to be read in the first time slice, the most important forty counters may always be read first, with the other counters beginning at counter forty-one read randomly, read alternating between forwards and backwards and/or read with a varied starting counter. In this manner, the first forty counters have zero error, with the time-slice estimation error more evenly distributed among the remaining counters.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to: personal computers, server computers, hand-held or laptop devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote computer storage media including memory storage devices.
With reference to
The computer 710 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer 710 and includes both volatile and nonvolatile media, and removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer 710. Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above should also be included within the scope of computer-readable media.
The system memory 730 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 731 and random access memory (RAM) 732. A basic input/output system 733 (BIOS), containing the basic routines that help to transfer information between elements within computer 710, such as during start-up, is typically stored in ROM 731. RAM 732 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 720. By way of example, and not limitation,
The computer 710 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media, described above and illustrated in
The computer 710 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 780. The remote computer 780 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 710, although only a memory storage device 781 has been illustrated in
When used in a LAN networking environment, the computer 710 is connected to the LAN 771 through a network interface or adapter 770. When used in a WAN networking environment, the computer 710 typically includes a modem 772 or other means for establishing communications over the WAN 773, such as the Internet. The modem 772, which may be internal or external, may be connected to the system bus 721 via the user input interface 760 or other appropriate mechanism. A wireless networking component 774 such as comprising an interface and antenna may be coupled through a suitable device such as an access point or peer computer to a WAN or LAN. In a networked environment, program modules depicted relative to the computer 710, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
An auxiliary subsystem 799 (e.g., for auxiliary display of content) may be connected via the user interface 760 to allow data such as program content, system status and event notifications to be provided to the user, even if the main portions of the computer system are in a low power state. The auxiliary subsystem 799 may be connected to the modem 772 and/or network interface 770 to allow communication between these systems while the main processing unit 720 is in a low power state.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.