Devices that run software fail at varying rates over time. Failures are unavoidable occurrences that often stem from the inherent imperfectability of complex hardware and software systems. It has been a longstanding practice to identify software failures by storing records of failures when they occur on devices, and then collecting those failure records in a central repository for analysis and issue identification. However, this approach has recently become less effective and less convenient for improving the experiences of device users. Software developers take advantage of increasing hardware capabilities and write code to capture larger amounts of failure data with finer granularity. Moreover, devices with high levels of network connectivity may be subjected to frequent updates, software installations, and configuration changes, which tends to increase software failure rates.
These factors have led to a proliferation of failure data, which can cause problems. Increasing amounts of failure data require additional network bandwidth and power to transmit from a device to a collection service. For resource-limited devices such as mobile phones, this can have varying degrees of impact on battery life, network usage fees, available processor cycles, etc. In addition, increasing volume, granularity, and frequency of debugging data received by a software provider's collection system can make it difficult to prioritize issues that are occurring on devices. It has not previously been appreciated that the expansion of failure data and corresponding range of issues being reported makes it difficult to identify the issues that have the greatest impact on the actual usability of devices.
Described below are techniques related to reducing amounts of failure data while improving the content of the failure data to enable rapid identification of issues that are having the greatest individual or collective impact on users.
The following summary is included only to introduce some concepts discussed in the Detailed Description below. This summary is not comprehensive and is not intended to delineate the scope of the claimed subject matter, which is set forth by the claims presented at the end.
Embodiments relate to a device ecosystem in which devices collect and forward failure data to a control system that collects and analyzes the failure data. The devices record, categorize, transform, and report failure data to the control system. Failures on a device can be counted and also correlated over time with tracked changes in state of the device (e.g., in use, active, powered on). Different types of Mean Time To Failure (MTTF) statistics are efficiently computed in an ongoing manner. A pool of statistical failure data pushed by devices can be used by the control system to select devices from which to pull detailed failure data.
Many of the attendant features will be explained below with reference to the following detailed description considered in connection with the accompanying drawings.
The present description will be better understood from the following detailed description read in light of the accompanying drawings, wherein like reference numerals are used to designate like parts in the accompanying description.
Embodiments discussed below relate to improving failure reporting and issue analysis. Discussion will begin with an overview of a device ecosystem in which devices collect and forward failure data to a control system that collects and analyzes the failure data. Covered next will be software embodiments to run on a device to record, transform, and report failure data. Examples of categories of failures and details of how related failure data can be derived and summarized are then discussed. This is followed by explanation of types of failure statistics and how they can be efficiently computed and maintained over potentially long periods of time. Described next are techniques to capture and incorporate, into failure data, data about device state that can relate failure issues to likelihoods or degrees of negative effects on users. Finally, central collection and employment of failure data is described, including how a large pool of statistical failure data pushed by devices can inform how a control system select devices from which to pull detailed failure data.
A telemetry framework is implemented at the devices 104 and at a control system 105. Telemetry instrumentation on the devices 104 collects failure data and pushes failure reports 106 across a network 108 to a telemetry collection service 110 of the control system 105. The control system 105 can be implemented as software running on one or more server devices. The collection service 110 receives the failure reports 106, parses them for syntactic correctness, extracts the failure data, and stores their contents in a telemetry database 114. The collection failure reports 106 might be structured documents and the collection service 110 can be implemented as an HTTPS (hypertext transfer protocol) server servicing file upload requests or HTTP posts. Techniques for reporting and collecting diagnostic data are known and details thereof may be found elsewhere. The control system 105 may also have a telemetry controller 116. As described further below, the telemetry controller 116 uses the failure data in the telemetry database 114 to select devices for acquisition of detailed failure data and sends pull requests 118 to those devices.
The telemetry collector 110 stores the device's failure data into the telemetry database 114, which is used by the telemetry controller 116. The telemetry controller 116 queries the telemetry database 114 and obtains the device's failure data. If the failure data indicates a sufficient impairment of the device 104 or usability thereof, the telemetry controller 116 transmits a pull request 118. The device 104 responds to the pull request 118 by transmitting detailed failure data 119 or debugging data to the telemetry controller 116 or another collection point such as a debugging system.
The failure events 162 are recorded as failure records 164 in a failure log 166. Failure reporting and recording can be implemented in known ways. However, the failure log 166 can possible contain a large number of failure records 164 covering a wide range of issues of varying significance to the user. Consequently, simply sending the failure log 166 to the control system 105 would be inefficient and of limited value. To improve the quality and information density of the failure data that is ultimately sent in a failure report 106, several techniques are used on the device 104.
To filter and condense the failure records 164, an event filter 168 is configured to recognize different categories or types of failure records 164, determine which failure category they are associated with, and store them (or indications thereof such as timestamps and failure-type identifiers) in corresponding failure logs 170. As an example, consider an application generating a first failure record that identifies an internal logic error and a second failure record that indicates an erroneous termination of the application. Perhaps a system service fails and a corresponding failure record is generated. The event filter 168 might: skip the first failure record, identify the second failure record as a first category of failure and store the first failure record (or a portion of its information) in a first failure log 170, and recognize that the third failure record belongs to a second category of failures and store the second failure record in a second failure log 170. The result is that the failure logs 170 accumulate select categories of failure records. The failure records may include typical diagnostic information such as timestamps, identification of the source of the failure, the type of failure or failure event, state of the device or software thereon when the failure occurred, etc.
As noted above, the agent software also collects time computation information that can be incorporated into the failure data to improve the meaningfulness of statistical calculations such as mean time to failure (MTTF). As observed only by the inventors, not all recorded failures on a device are failures that affect a user of the device. As further observed by the inventors, some failures are unlikely to be noticed by a user because they occur while the failing software is running in the background or is not visible to the user. Moreover, as first observed by the inventors, some failures occur while a device is powered on but is not being actively used and those failures are therefore less likely to have affected the user. As further observed by the inventors, the amount of time that a device is powered on and/or in active use can significantly affect the predictive value of failure statistics such as MTTF. By capturing the right type of data, user-affecting failure statistics can be computed. That is to say, a statistic such as “mean time to user-noticeable failure” or the like can be computed.
To that end, a time computation monitor 172 logs the times of various types of occurrences on the device 104 or of various changes of a state of the device 104. Time events can be obtained from any source, such as hooks 174 into the kernel, applications, a windowing system, system services, the failure log 166, other logs such as boot logs, and so forth. In one embodiment, the time computation monitor 172 captures boundaries of types of time periods such as uptime and active use time. Beginnings of uptime periods are bounded by any indications of the device being powered on and/or booted. Ends of uptime periods can be identified from information corresponding to: the device being powered off by the user, the operating system being shut down or restarted cleanly, a type of failure that is usually accompanied by a restart of a device, any arbitrary last timestamp in any log that precedes a significant time without timestamps, etc.
In a similar vein, the time computation monitor 172 can capture boundaries of periods of active use of the device. A period of active use can be identified by recognizing when certain types of activities are “live” or ongoing. Because activities that are monitored can be concurrent (overlap), activity periods (periods when any activity type occurs) can be recognized by (i) identifying a start of an activity period by detecting when there is currently no activity in progress when an activity of any type begins, and (ii) identifying the end of that activity period by detecting when there ceases to be an activity of any type in progress. In other words, a period of activity corresponds to a period of time during which there was continuously at least one activity in progress; a long activity period can be defined by sequences of perhaps short overlapping activities. Time periods can be marked by start times and end times.
Following are some examples of occurrences that can be used to identify different types of activities, any mix of which can indicate a period of active use:
(i) backlight is powered on, then
(ii) backlight is powered off;
(i) speaker starts playing for >5 seconds, then
(ii) speaker stops playing audio for >5 seconds;
(i) headphone jack starts playing for >5 seconds, then
(ii) headphone jack stops playing for >5 seconds;
(i) bluetooth radio starts transmitting a phone call, music or other persistent audio signal for >5 seconds, then
(ii) bluetooth radio stops transmitting a phone call, music or other persistent audio signal for >5 seconds;
(i) an application starts running under the lock screen, then
(ii) an application stops running under the lock screen.
To summarize, the time computation monitor 172 records one or more types of time-computation periods (e.g., periods of being powered up, periods of active use, etc.) by storing corresponding start/end timestamps for different types of time-computation periods in a time computation event log 175.
Returning to
Finally, a report generator 180 periodically (e.g., every 24 hours) uses the observation log 178 to add up the statistics for each failure type during the most recent report period (the time since a last failure report was generated).
Incremental observations can performed by keeping track of which portions of the time and failure logs have not been processed. Each time the observation logger executes, it consumes the portions of the logs that have not been processed, and then updates the logs accordingly.
Returning to
The report generator generates a report observation for each failure type, each of which is stored in a report log 212, file, telemetry report package, etc. Conceptually, the report generator computes the same types of statistics that the observation logger computes, but for longer intervals, and by combining the statistics in the observation log rather than by parsing failure logs 170 and the time computation event log 175. Specifically, at step 206, the report generator generates a report observation by obtaining and combining the observations in the observation log for each failure type, for the current reporting cycle (e.g., for all observations that have not yet reported). That is, a report observation includes a report entry—a set of failure counts and time durations—for each failure type. In addition to periodically computing the report observations, the report generator keeps cumulative statistics for each failure type. At step 208, those cumulative statistics are updated per the new report observation, and at step 210 the new observation report, with cumulative statistics, is stored in the report log 212 or some other container such as a report 106 for transmission to the telemetry collector.
In practice, each failure type will have a similar failure entry that is generated and reported by each execution of the report generator (see
With the database 114 containing accumulated failure data 331 from respective devices for possibly long periods of time up to nearly current time, the control system 105 performs a process 332 for pulling additional failure or debugging data, if needed. The process 332 starts with an initial dataset from a set of one or more devices. The dataset can be filtered based on a variety of query conditions, such as device type, date or duration, software installed, software or operating system version, firmware, or any other data associated with devices. In one embodiment, rich device data can be linked in from other systems that track devices. In another embodiment, device information is provided in the failure reports 212. In any case, given a dataset of devices, the corresponding failure data for each device is obtained. Any of the MTTF calculations described herein are performed for each device using the corresponding data from the database 114 (how to combine sequences of statistics for a device is discussed below with reference to
The control system 105 also has a process 334 for pulling detailed telemetry or failure data from the devices identified as having significant MTTF values. The process 334 can be an ongoing process that pulls data from any device that enters the queue. Any time process 332 is run, the process 334 will begin sending pull requests 336 to devices as they enter the queue, even while the process 332 is running. Alternatively, the process 334 can be a batch process that communicates with devices after process 332 has finished. The control system 105 sends a pull request 336 to a selected device via the network 108. The agent or telemetry software on the targeted device performs a process 338, which involves receiving the request 336, collecting the requested data such as debugging logs, binary crash dumps, crash reports, execution traces, or any other information on the device. The detailed telemetry data 340 is then returned to the control system 105 or another collection point such as a bug management system. In one embodiment, the telemetry data 340 can include information such as a failure log 170 for a failure category whose MTTF triggered the request 336 for additional telemetry data. The detailed telemetry data 340 can also be included in the next report that will be sent by the device.
As noted above, if statistics in reports from a device are stored as received, i.e., if the statistics of a device for each report (e.g., daily) are stored, MTTF statistics can be computed for arbitrary sequences of those time periods. For instance if the database 114 is storing N days' worth of statistics for a device, then an MTTF for an arbitrary period from day J to day K can be computed by combining the statistics of those days. Alternatively, the stored statistics can be consolidated into larger time units, such as weeks or months, which trades granularity for less storage use. The granularity of a device's statistics can be graduated, where granularity decreases with age; daily reports are stored for the last 30 days, which are later consolidated into weekly statistics for the last 6 months, which are later consolidated into monthly statistics for the last year, etc. When a new month arrives, for example, the weekly MTTF statistics for that month can be summed and MTTF values for that month can be calculated therefrom.
As discussed in U.S. patent application Ser. No. 14/676,214, a software updating system 420 can be constructed to use device telemetry data to inform which devices should receive which available operating system or application updates. The MTTF failure data and techniques for identifying problematic devices can be used to select which devices to update and/or which updates to use. The MTTF failure data of an individual device has (or can be linked to) update-relevant information about the device, for instant a device model or make, a software version, a type of CPU, an amount of memory, a type of cellular network, a cellular provider identity, or anything else.
An update monitor 422 receives an indication from the control system 105 that a particular device is to be targeted for possible updating. The update monitor 422 optionally passes update-selection data to a diagnostic system (not shown). The update-selection data might be any information about the device and/or the MTTF that triggered its selection, such as: identity of the device, the relevant MTTF type, a failure event type that contributed to the MTTF value, etc. Information about the device's configuration such as software version, model, operating system, etc., can be passed with the update-selection information, or such information can be obtained by the diagnostic system. The diagnostic system in turn determines a best update and informs the update monitor 422 accordingly. The update monitor 422 then informs an update distributor 424 of the identified device and the identified update, and the update monitor 422 causes the update to be sent to the device.
The system architecture is not important. What is significant is leveraging the MTTF data to automatically determine prioritize which devices should receive updates or to automatically determine which devices should be updated and/or which updates to apply to which devices. Instead of sending an update to a selected device, a notification can be provided to the device, or the identity of the update can be associated with the device, for example at a website or software distribution service regularly visited by the device. When a device visits a page of the website or communicates with the software distribution service, the device displays information about the update associated with the device.
The MTTF data can also be used by a tool 430 such as a client application. The tool 430 accesses the MTTF data from the control system 105. The tool 430 then displays user interfaces 432 for visualizing and exploring the MTTF data.
The computing device 450 may have a display 452, a network interface 454, as well as storage 456 and processing hardware 458, which may be a combination of any one or more: central processing units, graphics processing units, analog-to-digital converters, bus chips, Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application-specific Standard Products (ASSPs), or Complex Programmable Logic Devices (CPLDs), etc. The storage 456 may be any combination of magnetic storage, static memory, volatile memory, etc. The meaning of the term “storage”, as used herein does not refer to signals or energy per se, but rather refers to physical apparatuses, possibly virtualized, including physical media such as magnetic storage media, optical storage media, memory devices, etc., but not signals per se. The hardware elements of the computing device 450 may cooperate in ways well understood in the art of computing. In addition, input devices may be integrated with or in communication with the computing device 450. The computing device 450 may have any form factor or may be used in any type of encompassing device. The computing device 450 may be in the form of a handheld device such as a smartphone, a tablet computer, a gaming device, a server, a rack-mounted or backplaned computer-on-a-board, a system-on-a-chip, or others.
Embodiments and features discussed above can be realized in the form of information stored in volatile or non-volatile computer or device readable media. This is deemed to include at least media such as optical storage (e.g., compact-disk read-only memory (CD-ROM)), magnetic media, flash read-only memory (ROM), or any current or future means of storing digital information. The stored information can be in the form of machine executable instructions (e.g., compiled executable binary code), source code, bytecode, or any other information that can be used to enable or configure computing devices to perform the various embodiments discussed above. This is also deemed to include at least volatile memory such as random-access memory (RAM) and/or virtual memory storing information such as central processing unit (CPU) instructions during execution of a program carrying out an embodiment, as well as non-volatile media storing information that allows a program or executable to be loaded and executed. The embodiments and features can be performed on any type of computing device, including portable devices, workstations, servers, mobile wireless devices, and so on.