This disclosure relates generally to monitoring telemetry data, and, more particularly, to methods and apparatus to monitor telemetry data associated with computing devices.
Users of computing devices may report performance issues and/or other device problems to device manufacturers, service providers, etc., to aid in troubleshooting the issues/problems. Computing devices may also make such reports automatically, and include telemetry data with the reports. However, when a performance issue is reported by a user of a computing device, or the computing device itself automatically, the manufacturers/providers often do not have sufficient information (e.g., telemetry data) to recreate the complex environment in which the performance issue occurred.
The figures are not to scale. In general, the same reference numbers will be used throughout the drawing(s) and accompanying written description to refer to the same or like parts. As used herein, connection references (e.g., attached, coupled, connected, and joined) may include intermediate members between the elements referenced by the connection reference and/or relative movement between those elements unless otherwise indicated. As such, connection references do not necessarily infer that two elements are directly connected and/or in fixed relation to each other.
Unless specifically stated otherwise, descriptors such as “first,” “second,” “third,” etc. are used herein without imputing or otherwise indicating any meaning of priority, physical order, arrangement in a list, and/or ordering in any way, but are merely used as labels and/or arbitrary names to distinguish elements for ease of understanding the disclosed examples. In some examples, the descriptor “first” may be used to refer to an element in the detailed description, while the same element may be referred to in a claim with a different descriptor such as “second” or “third.” In such instances, it should be understood that such descriptors are used merely for identifying those elements distinctly that might, for example, otherwise share a same name.
As noted above, users of computing devices may report performance issues and/or other device problems to device manufacturers, service providers, etc., to aid in troubleshooting the issues/problems. The computing devices may additionally or alternatively make such reports automatically, and include telemetry data with the reports. In either case, a performance issue or other problem (generally referred to as an “issue” herein) in the field is first detected, and then reported to appropriate contact and via the appropriate channel. Assuming a case is made to investigate the issue, resources are spent to reproduce and root cause the issue in the lab. However, reproduction and root causing may not be possible due to a lack of sufficient information (e.g., telemetry data) and the reported issue is merely classified as a sighting. Although techniques such as instrumentation or telemetry can be used to help detect field issues, they may not provide a holistic view of the system and/or account for interaction among the various components of the system.
In examples disclosed herein, a telemetry monitoring tool collects telemetry data upon the detection of an operational condition. Example implementations of the telemetry monitoring tool, which can be implemented in hardware, software, and/or a combination thereof, include a telemetry collector to collect a first set of telemetry data to form a telemetry data timeline and detect operational conditions, an actuator to collect a second set of telemetry data corresponding to a particular detected operational condition, an annotator to annotate the telemetry data timeline, a data reporter to report telemetry data from the telemetry collector and annotator, and a policy file updater to implement an updated policy file containing instructions to update the operation of the telemetry monitoring tool. By obtaining the second set of telemetry data along with the telemetry data timeline, complex execution contexts seen in the field can be reproduced.
Example disclosed herein allow device manufacturers and/or providers to collect improved telemetry data, enabling them to better understand operational conditions that occur in the field. Additionally, these device manufacturers and/or providers can use this better understanding to create products that minimize the impact of operational conditions in the field, which results in enhanced performance of overall systems. Furthermore, this tool can be licensed and implemented on several platforms allowing for licensed companies to better understand operational conditions that occur on their computing devices.
The example computing device 100 generates telemetry data (e.g. application data, system data, etc.) that can be monitored and collected. In the illustrated example, the computing device 100 communicates to the example backend server 110 through the example network 105. However, the example computing device 100 could alternatively communicate to the example backend server 110 directly. Furthermore, in some examples, the example computing device 100 may communicate directly to an example lab environment 115.
The example network 105 creates a pathway for client device data to be communicated to the example backend server 110. The example network 105 of the illustrated example of
The example backend server 110 of the illustrated example of
The example lab environment 115 recreates the complex environment seen on an example computing device 100 by recreating the execution process of the example computing device before, during, and after the operational condition is detected. The example lab environment 115 performs root cause analysis using the data received at the example backend server 110. The example lab environment 115 also generates an example updated policy file 120 to improve data collection before, after, and in response to operational conditions and operational condition detection.
In the example system of
In the illustrated example of
The example telemetry collector 305 of the illustrated example of
In the illustrated example of
CPU hot-spot analysis may include the amount of time a CPU spent executing one or more functions within program code, which can be used to determine if execution of the program code went as expected, and where future program code optimizations may be focused. GPU hot-spot analysis, similar to CPU hot-spot analysis, may include the amount of time a GPU spent executing one or more functions within program code, which can be used to determine if execution of the code went as expected, and where future code optimizations may be focused. CPU and GPU hot-spot analysis are used when there is known context that leads to where operational condition may be stemming from. System-wide micro-architectural analysis is used when there is no such context. This analysis analyzes the entire system, generating data that may provide a possible cause of the operational condition or where to spend additional resources to improve the functionality of the system.
The examples provided above are examples of operational conditions and the further analysis and/or measurements to be conducted by the actuator 310 in response to the example telemetry collector 305 detecting one of these operational conditions. If multiple detected operational conditions coincide (e.g., overlap, occur simultaneously, occur within a window of time of each other, etc.), the actuator 310 may access and execute the policy file instructions for the operational conditions in an order specified in the policy file 330. In some alternative examples, the actuator 310 could collect analysis data for multiple operational conditions in parallel, both of which to be included in the second set of telemetry data. In some examples, the actuator 310 collect a second and a third set of telemetry data, the second set of telemetry data to include analysis data from a first operational condition, the third set of telemetry data to include analysis data from a second operational condition.
The actuator 310 of the illustrated example of
In the illustrated example of
In the illustrated example of
In the illustrated example of
While an example manner of implementing the telemetry monitoring tool of
Flowcharts representative of example hardware logic, machine readable instructions, hardware implemented state machines, and/or any combination thereof for implementing the example telemetry monitoring tool 125 of
The machine readable instructions described herein may be stored in one or more of a compressed format, an encrypted format, a fragmented format, a compiled format, an executable format, a packaged format, etc. Machine readable instructions as described herein may be stored as data or a data structure (e.g., portions of instructions, code, representations of code, etc.) that may be utilized to create, manufacture, and/or produce machine executable instructions. For example, the machine readable instructions may be fragmented and stored on one or more storage devices and/or computing devices (e.g., servers) located at the same or different locations of a network or collection of networks (e.g., in the cloud, in edge devices, etc.). The machine readable instructions may require one or more of installation, modification, adaptation, updating, combining, supplementing, configuring, decryption, decompression, unpacking, distribution, reassignment, compilation, etc. in order to make them directly readable, interpretable, and/or executable by a computing device and/or other machine. For example, the machine readable instructions may be stored in multiple parts, which are individually compressed, encrypted, and stored on separate computing devices, wherein the parts when decrypted, decompressed, and combined form a set of executable instructions that implement one or more functions that may together form a program such as that described herein.
In another example, the machine readable instructions may be stored in a state in which they may be read by processor circuitry, but require addition of a library (e.g., a dynamic link library (DLL)), one or more shared objects, a software development kit (SDK), an application programming interface (API), etc. in order to execute the instructions on a particular computing device or other device. In another example, the machine readable instructions may need to be configured (e.g., settings stored, data input, network addresses recorded, etc.) before the machine readable instructions and/or the corresponding program(s) can be executed in whole or in part. Thus, machine readable media, as used herein, may include machine readable instructions and/or program(s) regardless of the particular format or state of the machine readable instructions and/or program(s) when stored or otherwise at rest or in transit.
The machine readable instructions described herein can be represented by any past, present, or future instruction language, scripting language, programming language, etc. For example, the machine readable instructions may be represented using any of the following languages: C, C++, Java, C#, Perl, Python, JavaScript, HyperText Markup Language (HTML), Structured Query Language (SQL), Swift, etc.
As mentioned above, the example processes of
“Including” and “comprising” (and all forms and tenses thereof) are used herein to be open ended terms. Thus, whenever a claim employs any form of “include” or “comprise” (e.g., comprises, includes, comprising, including, having, etc.) as a preamble or within a claim recitation of any kind, it is to be understood that additional elements, terms, etc. may be present without falling outside the scope of the corresponding claim or recitation. As used herein, when the phrase “at least” is used as the transition term in, for example, a preamble of a claim, it is open-ended in the same manner as the term “comprising” and “including” are open ended. The term “and/or” when used, for example, in a form such as A, B, and/or C refers to any combination or subset of A, B, C such as (1) A alone, (2) B alone, (3) C alone, (4) A with B, (5) A with C, (6) B with C, and (7) A with B and with C. As used herein in the context of describing structures, components, items, objects and/or things, the phrase “at least one of A and B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, and (3) at least one A and at least one B. Similarly, as used herein in the context of describing structures, components, items, objects and/or things, the phrase “at least one of A or B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, and (3) at least one A and at least one B. As used herein in the context of describing the performance or execution of processes, instructions, actions, activities and/or steps, the phrase “at least one of A and B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, and (3) at least one A and at least one B. Similarly, as used herein in the context of describing the performance or execution of processes, instructions, actions, activities and/or steps, the phrase “at least one of A or B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, and (3) at least one A and at least one B.
As used herein, singular references (e.g., “a”, “an”, “first”, “second”, etc.) do not exclude a plurality. The term “a” or “an” entity, as used herein, refers to one or more of that entity. The terms “a” (or “an”), “one or more”, and “at least one” can be used interchangeably herein. Furthermore, although individually listed, a plurality of means, elements or method actions may be implemented by, e.g., a single unit or processor. Additionally, although individual features may be included in different examples or claims, these may possibly be combined, and the inclusion in different examples or claims does not imply that a combination of features is not feasible and/or advantageous.
The example telemetry collector 305 then determines if an operational condition has been detected (block 410). A first example of an operational condition is a scenario in which a UX-distress signal is asserted consecutively for more than a threshold of seconds. A second example of an operational condition is a scenario in which a moving average of the frames per second (FPS) measured for a foreground window drops below a threshold. Yet another example of an operational condition is a scenario in which a moving average of the FPS measured for a CPU cores temperature goes above a threshold. If the example telemetry collector 305 determines that an operational condition has been detected (e.g. block 410 returns a result of YES), the example telemetry collector 305 outputs a trigger indicative of that operational condition and continues to collect telemetry data (block 415). If the example telemetry collector 305 determines that an operational condition has not occurred (e.g. block 410 returns a result of NO), the example telemetry collector 305 continues to collect telemetry data (block 405). In some examples, multiple operational conditions are detected coincidentally (e.g., overlap, occur simultaneously, occur within a window of time of each other, etc.). As a result, multiple triggers are output coincidentally and handled by the example actuator 310. Regardless of whether or not an operational condition is detected, the example telemetry collector 305 continuously collects telemetry data.
The example actuator 310 then determines if it has detected a trigger signal (block 510). If the example actuator 310 has not detected a trigger (e.g. block 510 returns a result of NO), then the example actuator 310 continues to wait for a trigger (block 505). If the example actuator 310 has detected a trigger (e.g. block 510 returns a result of YES), the example actuator 510 then collects a second set of telemetry data (block 515). In examples discloses herein, instructions within the example policy file 330 dictate the collection of the second set of telemetry data. The policy file 330 indicates the measurements and/or analysis to be performed and collected in response to each specific trigger. These indicated measurements and/or analysis are targeted at gathering trigger-specific information crucial to environment recreation in the backend. However, any other approach of gathering information specific to a trigger may additionally or alternatively be used.
The example annotator 315 then annotates the telemetry data timeline created by the example telemetry collector 305 (Block 520). An example annotation made by the annotator on the telemetry data timeline could indicate the time at which a trigger was output by the example telemetry collector 305, the time at which the trigger was detected by the example actuator 310, the type of analysis performed or collected within the second data set, a list of applications running within the example processor platform 800 at the time the trigger was output or detected, or any other data specific to the complex environment of the example processor platform 800.
The data reporter 320 then determines if it has detected a report request (block 610). If the data reporter 320 has not detected a request to report data (e.g. block 810 returns a result of NO), then the data reporter 320 continues to wait for a request to report data. (Block 605). If the data reporter 320 has received a request to report data (e.g. block 610 returns a result of YES), the data reporter 320 prepares the telemetry data timeline and second set of telemetry data to be reported (block 615). In some examples, the data reporter 320 omits data points within the telemetry data timeline to only include data within a threshold period relative to when an operational condition was detected. In some examples, the data reporter 320 omits data points within the second set of telemetry data to only include data points within a threshold period relative to when an operational condition was detected.
The data reporter 320 then reports the telemetry data timeline and second sets of telemetry data to the backend server 110 (block 620). In this example, the data reporter 320 interacts with an example network interface 220 to communicate with the backend server 110. However, any other approach of communicating data may additionally or alternatively be used. After the data has been reported, the example data reporter waits for a request to report data (block 605).
The lab environment 115 is prepared to receive the first and second sets of telemetry data from the backend server 110 for lab evaluation (block 710). In some examples, preparing the data for lab evaluation includes reducing the total amount of data to reduce the amount of resources used during lab evaluation. In some examples, there is not enough data received from the backend server 110 for lab evaluation. In some examples, preparation for lab evaluation is not necessary, and the data remains unchanged.
The lab environment 115 is prepared to recreate the complex environment as seen in the field using the data prepared for lab analysis (block 715). The environment is recreated using the data received at the example backend server 110, which includes the first and second sets of telemetry collected by the example telemetry collector 305 and example actuator 310.
The lab environment 115 is prepared to perform root cause analysis using the recreated complex environment to determine the cause of the operational condition experienced in the field (block 720). Using the results of the root cause analysis, the lab environment 115 is prepared to generate an example updated policy file 120 containing updated instructions (block 725).
The example processor platform 800 of the illustrated example includes a processor 812. The processor 812 of the illustrated example is hardware. For example, the processor 812 can be implemented by one or more integrated circuits, logic circuits, microprocessors, GPUs, DSPs, or controllers from any desired family or manufacturer. The hardware processor may be a semiconductor based (e.g., silicon based) device. In this example, the processor implements the telemetry monitoring tool 125. In this example, the processor implements the telemetry collector 305 to collect a first set of telemetry data to be stored in local memory 813. In this example, the processor also contains trigger logic to detect an operational condition within the first set of telemetry data. The processor also to implement the actuator 310 by collecting a second set of telemetry data to be stored in local memory 813. In this example, the processor also annotates the first set of telemetry data, implementing the annotator 315. The example processor interacts with an interface circuit 820 and network 805 to report the telemetry data timeline and second data sets to the backend server 110, implementing the data reporter 320. The processor is able to receive updated policy files, implementing the policy file updater 325.
The processor 812 of the illustrated example includes a local memory 813 (e.g., a cache). The processor 812 of the illustrated example is in communication with a main memory including a volatile memory 814 and a non-volatile memory 816 via a bus 818. The volatile memory 814 may be implemented by Synchronous Dynamic Random Access Memory (SDRAM), Dynamic Random Access Memory (DRAM), RAMBUS® Dynamic Random Access Memory (RDRAM®) and/or any other type of random access memory device. The non-volatile memory 816 may be implemented by flash memory and/or any other desired type of memory device. Access to the main memory 814, 816 is controlled by a memory controller.
The example processor platform 800 of the illustrated example also includes an interface circuit 820. The interface circuit 820 may be implemented by any type of interface standard, such as an Ethernet interface, a universal serial bus (USB), a Bluetooth® interface, a near field communication (NFC) interface, and/or a PCI express interface.
In the illustrated example, one or more input devices 822 are connected to the interface circuit 820. The input device(s) 822 permit(s) a user to enter data and/or commands into the processor 812. The input device(s) can be implemented by, for example, an audio sensor, a microphone, a camera (still or video), a keyboard, a button, a mouse, a touchscreen, a track-pad, a trackball, isopoint and/or a voice recognition system.
One or more output devices 824 are also connected to the interface circuit 820 of the illustrated example. The output devices 824 can be implemented, for example, by display devices (e.g., a light emitting diode (LED), an organic light emitting diode (OLED), a liquid crystal display (LCD), a cathode ray tube display (CRT), an in-place switching (IPS) display, a touchscreen, etc.), a tactile output device, a printer and/or speaker. The interface circuit 820 of the illustrated example, thus, typically includes a graphics driver card, a graphics driver chip and/or a graphics driver processor.
The interface circuit 820 of the illustrated example also includes a communication device such as a transmitter, a receiver, a transceiver, a modem, a residential gateway, a wireless access point, and/or a network interface to facilitate exchange of data with external machines (e.g., computing devices of any kind) via a network 105. The communication can be via, for example, an Ethernet connection, a digital subscriber line (DSL) connection, a telephone line connection, a coaxial cable system, a satellite system, a line-of-site wireless system, a cellular telephone system, etc.
The example processor platform 800 of the illustrated example also includes one or more mass storage devices 828 for storing software and/or data. Examples of such mass storage devices 828 include floppy disk drives, hard drive disks, compact disk drives, Blu-ray disk drives, redundant array of independent disks (RAID) systems, and digital versatile disk (DVD) drives.
The coded instructions 832 of
A block diagram illustrating an example software distribution platform 905 to distribute software such as the example computer readable instructions 832 of
From the foregoing, it will be appreciated that example methods, apparatus and articles of manufacture have been disclosed that allow device manufacturers and/or providers to collect improved telemetry data, enabling them to better understand operational conditions that occur in the field. Always collecting primary and secondary telemetry data would spend all the systems' resources on telemetry collection. Thus, the selective collection of secondary telemetry data allows all systems' resources to be focused on productive work as opposed to being used extensively on telemetry collection, reducing the resources necessary to obtain and report the telemetry data. Additionally, these device manufacturers and/or providers can use this better understanding to create products that minimize the impact of operational conditions in the field, which results in enhanced performance of overall systems. Furthermore, this tool can be licensed and implemented on several platforms allowing for licensed companies to better understand operational conditions that occur on their computing devices. The disclosed methods, apparatus and articles of manufacture allow for the complex environment seen in the field to be recreated in a lab setting for further root cause analysis. The resulting analysis gives manufacturers/providers an improved understanding of operational conditions in the field, thus increasing the efficiency and scalability future technologies. The disclosed methods, apparatus and articles of manufacture are accordingly directed to one or more improvement(s) in the functioning of a computer.
Example 1 includes an apparatus to perform telemetry monitoring. The apparatus of example 1 includes a telemetry collector to: collect a first set of telemetry data to form a telemetry data timeline associated with a computing device, the first set of telemetry data collected based on a policy file, and output a trigger indicative of an operational condition specified in the policy file. The apparatus of example 1 also includes an actuator to collect a second set of telemetry data associated with the computing device in response to the trigger, the second set of telemetry data collected based on the policy file. The apparatus of example 1 further includes a data reporter to report the telemetry data timeline and the second set of telemetry data to a server in response to a request.
Example 2 includes the apparatus of example 1, wherein the second set of telemetry data includes at least one of (a) central processing unit (CPU) hot-spot analysis data, (b) graphics processing unit (GPU) hot-spot analysis data, or (c) system-wide micro-architectural analysis data.
Example 3 includes the apparatus of example 1 or example 2, and further includes an annotator to annotate the telemetry data timeline with an annotation to indicate a time at which the operational condition was detected.
Example 4 includes the apparatus of any of examples 1 to 3, wherein the request is generated via a user input of the computing device.
Example 5 includes the apparatus of any of examples 1 to 3, wherein the request is generated automatically by the computing device.
Example 6 includes the apparatus of any of examples 1 to 5, wherein the operational condition is a first operational condition, the trigger is a first trigger, the telemetry collector is to output a second trigger indicative of a second operational condition that coincides with the first operational condition, and the actuator is to collect a third set of telemetry data associated with the computing device in response to the second trigger, the actuator to collect the second set of telemetry data and the third set of telemetry data based on an order specified in the policy file.
Example 7 includes the apparatus of any of examples 1 to 6, wherein the policy file is a first policy file, and further including a policy file updater to retrieve a second policy file from a server, and replace the first policy file with the second policy file.
Example 8 includes at least one non-transitory computer readable medium comprising instructions, which, when executed, cause at least one processor to at least: (i) collect a first set of telemetry data to form a telemetry data timeline associated with a computing device, the first set of telemetry data collected based on a policy file, (ii) generate a trigger indicative of an operational condition specified in the policy file, (iii) collect a second set of telemetry data associated with the computing device in response to the trigger, the second set of telemetry data collected based on the policy file, and (iv) report the telemetry data timeline and the second set of telemetry data to a server in response to a request.
Example 9 includes the at least one non-transitory computer readable medium of example 8, wherein the second set of telemetry data includes at least one of (a) central processing unit (CPU) hot-spot analysis data, (b) graphics processing unit (GPU) hot-spot analysis data, or (c) system-wide micro-architectural analysis data.
Example 10 includes the at least one non-transitory computer readable medium of example 8 or example 9, wherein the instructions, when executed, cause the at least one processor to annotate the telemetry data timeline with an annotation to indicate a time at which the operational condition was detected.
Example 11 includes the at least one non-transitory computer readable medium of any of examples 8 to 10, wherein the request is generated via a user input of the computing device.
Example 12 includes the at least one non-transitory computer readable medium of any of examples 8 to 10, wherein the request is generated automatically by the computing device.
Example 13 includes the at least one non-transitory computer readable medium of any of examples 8 to 12, wherein the operational condition is a first operational condition, the trigger is a first trigger, and the instructions, when executed, cause the at least one processor to: (i) generate a second trigger indicative of a second operational condition that coincides with the first operational condition, and (ii) collect a third set of telemetry data associated with the computing device in response to the second trigger, the collection of the second and third sets of telemetry data based on an order specified in the policy file.
Example 14 includes the at least one non-transitory computer readable medium of any of examples 8 to 13, wherein the policy file is a first policy file, and the instructions, when executed, cause the at least one processor to: (i) retrieve a second policy file from a server, and (ii) replace the first policy file with the second policy file.
Example 15 is a method that includes collecting, by executing an instruction with at least one processor, a first set of telemetry data to form a telemetry data timeline associated with a computing device, the first set of telemetry data collected based on a policy file. The method of example 15 also includes generating a trigger indicative of an operational condition specified in the policy file. The method of example 15 further includes collecting, by executing an instruction with the at least one processor, a second set of telemetry data associated with the computing device in response to the trigger, the second set of telemetry data collected based on the policy file. The method of example 15 also includes reporting the telemetry data timeline and the second set of telemetry data to a server in response to a request.
Example 16 includes the method of example 15, wherein the second set of telemetry data includes at least one of (a) central processing unit (CPU) hot-spot analysis data, (b) graphics processing unit (GPU) hot-spot analysis data, or (c) system-wide micro-architectural analysis data.
Example 17 includes the method of example 15 or example 16, and further includes annotating the telemetry data timeline with an annotation indicating a time at which the operational condition was detected.
Example 18 includes the method of any of examples 15 to 17, wherein the request is generated at least one of (a) via a user input of the computing device, or (b) automatically by the computing device.
Example 19 includes the method of any of examples 15 to 18, wherein the operational condition is a first operational condition, the trigger is a first trigger, and further includes generating a second trigger indicative of a second operational condition that coincides with the first operational condition, and collecting a third set of telemetry data associated with the computing device in response to the second trigger, the collection of the second and third sets of telemetry data based on an order specified in the policy file.
Example 20 includes the method of any of examples 15 to 19, wherein the policy file is a first policy file, and further includes retrieving a second policy file from a server, and replacing the first policy file with the second policy file.
Although certain example methods, apparatus and articles of manufacture have been disclosed herein, the scope of coverage of this patent is not limited thereto. On the contrary, this patent covers all methods, apparatus and articles of manufacture fairly falling within the scope of the claims of this patent.
The following claims are hereby incorporated into this Detailed Description by this reference, with each claim standing on its own as a separate embodiment of the present disclosure.