This disclosure relates in general to the field of computer processing, and more particularly, though not exclusively, to runtime processor optimizations.
The demand for high-performance and power-efficient computer processors is continuously increasing. Existing processor architectures, however, are unable to efficiently adapt to actual workload patterns encountered at runtime, thus limiting their ability to be dynamically optimized to achieve maximum performance and/or power efficiency.
The present disclosure is best understood from the following detailed description when read with the accompanying figures. It is emphasized that, in accordance with the standard practice in the industry, various features are not necessarily drawn to scale, and are used for illustration purposes only. Where a scale is shown, explicitly or implicitly, it provides only one illustrative example. In other embodiments, the dimensions of the various features may be arbitrarily increased or reduced for clarity of discussion.
The following disclosure provides many different embodiments, or examples, for implementing different features of the present disclosure. Specific examples of components and arrangements are described below to simplify the present disclosure. These are, of course, merely examples and are not intended to be limiting. Further, the present disclosure may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed. Different embodiments may have different advantages, and no particular advantage is necessarily required of any embodiment.
Example embodiments of this disclosure will now be described with more particular reference to the attached FIGURES.
The various components in the illustrated example of computing system 100 will now be discussed further below.
Edge devices 110 may include any equipment and/or devices deployed or connected near the “edge” of a communication system 100. In the illustrated embodiment, edge devices 110 include end-user devices 112 (e.g., desktops, laptops, mobile devices), Internet-of-Things (IoT) devices 114, and gateways and/or routers 116, among other examples. Edge devices 110 may communicate with each other and/or with other remote networks and services (e.g., cloud services 120) through one or more networks and/or communication protocols, such as communication network 150. Moreover, in some embodiments, certain edge devices 110 may include the processor optimization functionality described throughout this disclosure.
End-user devices 112 may include any device that enables or facilitates user interaction with computing system 100, including, for example, desktop computers, laptops, tablets, mobile phones and other mobile devices, and wearable devices (e.g., smart watches, smart glasses, headsets), among other examples.
IoT devices 114 may include any device capable of communicating and/or participating in an Internet-of-Things (IoT) system or network. IoT systems may refer to new or improved ad-hoc systems and networks composed of multiple different devices (e.g., IoT devices 114) interoperating and synergizing for a particular application or use case. Such ad-hoc systems are emerging as more and more products and equipment evolve to become “smart,” meaning they are controlled or monitored by computer processors and are capable of communicating with other devices. For example, an IoT device 114 may include a computer processor and/or communication interface to allow interoperation with other components of system 100, such as with cloud services 120 and/or other edge devices 110. IoT devices 114 may be “greenfield” devices that are developed with IoT capabilities from the ground-up, or “brownfield” devices that are created by integrating IoT capabilities into existing legacy devices that were initially developed without IoT capabilities. For example, in some cases, IoT devices 114 may be built from sensors and communication modules integrated in or attached to “things,” such as equipment, toys, tools, vehicles, living things (e.g., plants, animals, humans), and so forth. Alternatively, or additionally, certain IoT devices 114 may rely on intermediary components, such as edge gateways or routers 116, to communicate with the various components of system 100.
IoT devices 114 may include various types of sensors for monitoring, detecting, measuring, and generating sensor data and signals associated with characteristics of their environment. For instance, a given sensor may be configured to detect one or more respective characteristics, such as movement, weight, physical contact, temperature, wind, noise, light, position, humidity, radiation, liquid, specific chemical compounds, battery life, wireless signals, computer communications, and bandwidth, among other examples. Sensors can include physical sensors (e.g., physical monitoring components) and virtual sensors (e.g., software-based monitoring components). IoT devices 114 may also include actuators to perform various actions in their respective environments. For example, an actuator may be used to selectively activate certain functionality, such as toggling the power or operation of a security system (e.g., alarm, camera, locks) or household appliance (e.g., audio system, lighting, HVAC appliances, garage doors), among other examples.
Indeed, this disclosure contemplates use of a potentially limitless universe of IoT devices 114 and associated sensors/actuators. IoT devices 114 may include, for example, any type of equipment and/or devices associated with any type of system 100 and/or industry, including transportation (e.g., automobile, airlines), industrial manufacturing, energy (e.g., power plants), telecommunications (e.g., Internet, cellular, and television service providers), medical (e.g., healthcare, pharmaceutical), food processing, and/or retail industries, among others. In the transportation industry, for example, IoT devices 114 may include equipment and devices associated with aircrafts, automobiles, or vessels, such as navigation systems, autonomous flight or driving systems, traffic sensors and controllers, and/or any internal mechanical or electrical components that are monitored by sensors (e.g., engines). IoT devices 114 may also include equipment, devices, and/or infrastructure associated with industrial manufacturing and production, shipping (e.g., cargo tracking), communications networks (e.g., gateways, routers, servers, cellular towers), server farms, electrical power plants, wind farms, oil and gas pipelines, water treatment and distribution, wastewater collection and treatment, and weather monitoring (e.g., temperature, wind, and humidity sensors), among other examples. IoT devices 114 may also include, for example, any type of “smart” device or system, such as smart entertainment systems (e.g., televisions, audio systems, videogame systems), smart household or office appliances (e.g., heat-ventilation-air-conditioning (HVAC) appliances, refrigerators, washers and dryers, coffee brewers), power control systems (e.g., automatic electricity, light, and HVAC controls), security systems (e.g., alarms, locks, cameras, motion detectors, fingerprint scanners, facial recognition systems), and other home automation systems, among other examples. IoT devices 114 can be statically located, such as mounted on a building, wall, floor, ground, lamppost, sign, water tower, or any other fixed or static structure. IoT devices 114 can also be mobile, such as devices in vehicles or aircrafts, drones, packages (e.g., for tracking cargo), mobile devices, and wearable devices, among other examples. Moreover, an IoT device 114 can also be any type of edge device 110, including end-user devices 112 and edge gateways and routers 116.
Edge gateways and/or routers 116 may be used to facilitate communication to and from edge devices 110. For example, gateways 116 may provide communication capabilities to existing legacy devices that were initially developed without any such capabilities (e.g., “brownfield” IoT devices). Gateways 116 can also be utilized to extend the geographical reach of edge devices 110 with short-range, proprietary, or otherwise limited communication capabilities, such as IoT devices 114 with Bluetooth or ZigBee communication capabilities. For example, gateways 116 can serve as intermediaries between IoT devices 114 and remote networks or services, by providing a front-haul to the IoT devices 114 using their native communication capabilities (e.g., Bluetooth, ZigBee), and providing a back-haul to other networks 150 and/or cloud services 120 using another wired or wireless communication medium (e.g., Ethernet, Wi-Fi, cellular). In some embodiments, a gateway 116 may be implemented by a dedicated gateway device, or by a general purpose device, such as another IoT device 114, end-user device 112, or other type of edge device 110.
In some instances, gateways 116 may also implement certain network management and/or application functionality (e.g., IoT management and/or IoT application functionality for IoT devices 114), either separately or in conjunction with other components, such as cloud services 120 and/or other edge devices 110. For example, in some embodiments, configuration parameters and/or application logic may be pushed or pulled to or from a gateway device 116, allowing IoT devices 114 (or other edge devices 110) within range or proximity of the gateway 116 to be configured for a particular IoT application or use case.
Cloud services 120 may include services that are hosted remotely over a network 150, or in the “cloud.” In some embodiments, for example, cloud services 120 may be remotely hosted on servers in datacenter (e.g., application servers or database servers). Cloud services 120 may include any services that can be utilized by or for edge devices 110, including but not limited to, data storage, computational services (e.g., data analytics, searching, diagnostics and fault management), security services (e.g., surveillance, alarms, user authentication), mapping and navigation, geolocation services, network or infrastructure management, IoT application and management services, payment processing, audio and video streaming, messaging, social networking, news, and weather, among other examples. Moreover, in some embodiments, certain cloud services 120 may include the processor optimization functionality described throughout this disclosure.
Network 150 may be used to facilitate communication between the components of computing system 100. For example, edge devices 110, such as end-user devices 112 and IoT devices 114, may use network 150 to communicate with each other and/or access one or more remote cloud services 120. Network 150 may include any number or type of communication networks, including, for example, local area networks, wide area networks, public networks, the Internet, cellular networks, Wi-Fi networks, short-range networks (e.g., Bluetooth or ZigBee), and/or any other wired or wireless networks or communication mediums.
Any, all, or some of the computing devices of system 100 may be adapted to execute any operating system, including Linux or other UNIX-based operating systems, Microsoft Windows, Windows Server, MacOS, Apple iOS, Google Android, or any customized and/or proprietary operating system, along with virtual machines adapted to virtualize execution of a particular operating system.
While
On-Chip Processor Optimization
The primary obstacle to performing effective processor optimizations at runtime is accurately and reliably recognizing the different patterns or phases of the processing workloads encountered by a processor. Efficient and reliable workload phase recognition is crucial to building flexible processor architectures that can adapt on-the-fly in response to real-world circumstances and user needs. The embodiments described in connection with
Processor optimization unit 200 analyzes processor workloads in real-time to recognize and learn workload phases and adapt to real-world data variations at runtime. In some embodiments, for example, on-chip machine learning may be used to learn and recognize the signatures associated with different workload phases, enabling consistent and stable phase recognition even in unanticipated runtime conditions. Processor optimization unit 200 provides reliable phase recognition using various machine learning and statistical techniques, such as soft-thresholding, convolution, and/or chi-squared error models, as discussed further below. These statistical techniques are applied to streams of real-time performance event counters, enabling stable phase recognition across both fine-grained time scales of tens-of-thousands of instructions, and coarse-grained time scales of tens-of-millions of instructions. In this manner, a processor can be optimized or adapted based on the specific workload phases that are encountered, for example, by adjusting processor voltage to improve power efficiency, adjusting the width of the execution pipeline during periods with systematically poor speculation, tailoring branch prediction, cache pre-fetch, and/or scheduling units based on identified program characteristics and patterns, and so forth.
In order to adapt a processor to recurring patterns of a program state at a fine-grained scale, learning and recognition of workload phases must be reliably performed on-chip in a manner that is stable to unexpected runtime conditions. The embodiments described throughout this disclosure address various obstacles facing reliable on-chip workload phase recognition at runtime. First, small noisy variations in workload patterns (e.g., variations in architecture-level event counters) are amplified at short time-scales relative to the program-driven patterns that must be recognized. Next, small changes in the timing of pattern recurrence can cause unstable local recognition (e.g. oscillations) when applied to streaming processor event counter data. Finally, programs may produce data at runtime that was neither anticipated at design time nor captured during offline analysis, leading to unexpected phase recognition results and potentially poor adaptation decisions. To address these obstacles, various machine learning and statistical techniques can be implemented on-chip to model the event counter data, such as soft-thresholding to filter noise, convolution to provide invariance to small temporal shifts, and a chi-squared probability model to address out-of-set data detection.
The illustrated embodiment provides various tradeoffs in order to achieve reliable workload phase recognition even for noisy streaming workload data. For example, with respect to the universe of possible workload phases, the workload phases for which architectural optimizations are being targeted must be accurately recognized on-chip, while also guaranteeing accurate negative recognition of all other workload phases. Moreover, immediate and stable phase recognition must be achieved even without the flexibility to roll and analyze results into summary statistics over large volumes of data. Accordingly, the illustrated embodiment is designed to tolerate widely varying workload data without requiring prior training on a comprehensive dataset, coarse summary statistics, or offline computations.
For example, soft-thresholding can be used to implement a local rule for reducing small noise variations to a tolerable level, without needing to individually tailor or adjust the noise filtering threshold for different workloads. Moreover, convolutional pattern matching facilitates shift invariance in order to stabilize phase recognition within local windows of event counter data. Finally, chi-squared testing can then be used to recognize unexpected workload phases or program states based on a probability model of both the bias and magnitude of errors between new and previously recognized workload signatures.
In this manner, real-time learning and recognition of workload phases can be performed reliably without any tailored or manual parameter adjustments (e.g., per-workload parameter tuning, post-processing, or smoothing), which is a mandatory constraint for on-chip optimizations. This is accomplished by analyzing the distribution of differences in event counters between real-time workload data and known (e.g., previously recognized) workload signatures. This approach aligns closely with real-world workload patterns, as the differences in event counter values from one workload snapshot to the next often have a normal or Gaussian distribution, even though the actual workload event counts do not. Accordingly, this approach is more robust than other workload recognition approaches, such as those that simply employ a threshold associated with the magnitude of the differences in event counts.
In the illustrated embodiment, processor optimization unit 200 includes functionality for event monitoring 210, phase recognition 220, and runtime optimization 230. Event monitoring 210 is used to track, aggregate, and filter various performance-related event counters for each processing workload. Phase recognition 220 is then used to recognize or learn the phase of a particular workload based on the processed event counter data obtained during the event monitoring 210 stage. Runtime optimization 230 is then used to perform the appropriate processor optimizations based on the particular workload phase that is recognized using phase recognition 220.
First, various performance-related event counters 214 are tracked for each processing workload snapshot. The event counters 214 can include any operational or performance aspects tracked by a processor, such as the number of branch prediction hits and misses, the number of cache hits and misses, the number of loads from memory, the amount of data transmitted internally within a processor, the number of instructions issued to different parts of the instruction pipeline, among other examples. Moreover, these event counters 214 are tracked and processed separately for each processing workload snapshot. For example, a workload may be a configurable number of processor instructions (represented as trecognition processing instructions), such as 10,000 processor instructions. Accordingly, event counters 214 are tracked for each workload snapshot based on the defined workload size.
The event counters 214 associated with the current processing workload snapshot are first aggregated into an event vector 215. The event counter data in event vector 215 is then processed and/or filtered to reduce noise. In some embodiments, for example, “soft-thresholding” may be used to reduce the noise to a tolerable level. For example, using soft-thresholding, any event counters in event vector 215 whose values are below a particular threshold (θnoise) may be truncated to 0. The particular threshold (θnoise) used for soft-thresholding may be varied to control the degree of noise reduction applied to the event counter data.
After noise reduction is performed, the event vector 215 for the current workload may then be stored in an event buffer 216. In some embodiments, for example, an event buffer 216 may be used to store the event vectors for a configurable number of recent workload snapshots (defined by the workload window size, wrecognition). For example, if the workload window size is defined to be three workload snapshots (wrecognition=3), the event buffer 216 will maintain event vectors 218a-c for the three most recent workload snapshots (e.g., the current workload and the two preceding workloads). Phase recognition can then be performed using the event vectors 218 associated with the current processing window, as described further in connection with phase recognition functionality 220 of
In some embodiments, the various parameters used for monitoring and processing events may be configurable, including the number and type of event counters (tcounter), the noise reduction threshold (θnoise), the size of a workload snapshot (trecognition), and the size of the current workload window (wrecognition).
For example, the number and type of event counters tracked for phase recognition purposes (represented as tcounter total counters) may be adjusted to control the accuracy and/or speed of the phase recognition. Tracking a larger number of event counters may result in more accurate phase recognition, but may require more processing time. In some embodiments, for example, phase recognition may be performed using 600 or more event counters (e.g., tcounter=600), while other embodiments may track a reduced set of event counters while still achieving good phase recognition performance, such as 60 event counters (e.g., tcounter=60) or even as few as 20 event counters (e.g., tcounter=20).
As another example, the noise reduction threshold (θnoise) used for soft-thresholding may be varied to control the degree of noise reduction applied to the event counter data for a particular workload. Larger threshold values may filter more noise and thus may result in more accurate phase recognition, whereas smaller threshold values may admit more noise and thus may result in diminished phase recognition performance. In some embodiments, performing soft-soft-thresholding using a threshold value of at least 32 (θnoise=32) may be sufficient to filter event counter values that are statistically unstable. For example, if soft-thresholding is performed using a noise threshold of 32 (θnoise=32), any event counters in event vector 215 with values below 32 would be truncated to 0.
Finally, the size of a workload (trecognition) can be adjusted to control the minimum detectable phase size. Moreover, the size of the current workload window (wrecognition) can be adjusted to control the sensitivity for recognizing changes in phase. For example, a larger workload window may result in slower but more accurate reactions to phase changes, while a smaller workload window may result in faster but less accurate reactions to phase changes.
In the illustrated embodiment, phase recognition is performed using a nearest neighbor lookup technique based on convolutional chi-squared testing. Since phases may contain natural patterns that last longer than the size of a workload snapshot (trecognition) (e.g., longer than 10,000 instructions), a known phase is represented by a phase signature comprised of back-to-back event vectors or histograms. Each phase signature is comprised of a configurable number of histograms (wsignature), such as 3 histograms per signature. The number of histograms (wsignature) in each phase signature can be chosen to encompass the maximum expected duration of recurring patterns within any given phase. Representing phase signatures using a large number of histograms may result in coarse phase definitions that encompass multiple microarchitecture states, while using a small number of histograms may produce fine-grained phase definitions that repeat back-to-back. In some embodiments or configurations, the number of histograms in a phase signature may mirror the size of the workload processing window (e.g., wsignature=wrecognition).
Phase recognition can be performed by comparing the current workload window 217 to a library of known phases 221. For example, in the illustrated embodiment, convolutional chi-squared comparisons are used to compare the current workload window 217 to each known phase 221. For example, in order to compare the current workload window 217 to a particular known phase 221, each event vector 218 in the current workload window 217 is compared with each histogram 223 in the particular signature 222. This results in a number of comparisons equal to the workload window size multiplied by the number of histograms in the phase signature (e.g., # of comparisons=wrecognition*wsignature). Moreover, each comparison can be performed by computing the chi-squared distance between a particular event vector 218 and a particular phase signature histogram 223. These calculations are performed for each event vector 218 and each histogram 223 of each known phase 221. The results of these chi-squared calculations are then filtered to identify the known phase with the closest matching score. This process provides shift invariance by choosing the strongest match within the wrecognition window of most recent workload snapshots, against any of the wsignature phase signature histograms, regardless of order.
Using chi-squared calculations to perform these phase comparisons is based on a straightforward assumption about events during a phase: although the actual event counts may fluctuate, the differences in event counts from one workload snapshot to the next should be normally distributed. Extreme fluctuations are evidence that the workload has entered a different phase. Accordingly, a chi-squared test statistic is computed as the squared sum of differences between the current phase signature histogram u and recently measured data v, scaled by the variance of differences for that event, as illustrated by the following formula:
In the above formula, μu-v represents the average difference between two workload snapshots of each counter, and σ2u-v represents the variance between subsequent snapshots of each event type. These parameters are computed in advance and are fixed for all workloads. Finally, the probability that two event vectors represent a different phase can be determined by comparing the computed test statistic to the chi-squared distribution using a probability lookup table. For example, the lookup can be performed using a chi-squared cumulative distribution function (CDF), as illustrated below, where X2 represents the computed test statistic and k represents the number of non-zero counter values that remain after soft-thresholding is performed:
p=chi-squared_CDF(X2,k−1)
The computed probability p represents the likelihood that two event vectors represent a different phase. Accordingly, a phase match is identified when p is below a certain threshold (e.g., below 0.5). However, if the current processing window does not match any known phase signatures within that threshold, then it is determined that a new phase has been identified, and thus a new phase label is assigned.
In the illustrated embodiment, each chi-squared comparison 224 is performed using an arithmetic unit 225, accumulator 226, and probability lookup table 227. For example, the chi-squared test statistic identified above (X2) is calculated using arithmetic unit 225 and accumulator 226. Arithmetic unit 225 performs arithmetic on each pair of event counters in the current phase histogram (u) and the recent event vector data (v), while accumulator 226 sums the results. The resulting chi-squared test statistic is then converted into a corresponding probability using probability lookup table 227. A probability is determined in this manner for each histogram 223 in the signature 222 of a known phase 221. The probability that indicates the best match 228 is then output as the probability associated with the particular known phase 221. Once a probability has been determined in this manner for each known phase, the probabilities of the known phases are then compared to identify the known phase with the best match 229.
Finally, phase recognition must be performed efficiently in order to avoid any delay or latency in determining when a transition to a new phase has occurred. Assuming a workload snapshot size of trecognition=10,000 instructions and a maximum number of instructions-per-clock (IPC) of 7.0 instructions, phase recognition must be performed in approximately 1500 clock cycles. There are two primary sources of latency associated with the described embodiment of phase recognition: event monitoring and phase matching. With respect to event monitoring, since no preprocessing of event counter vectors is required other than soft-thresholding, the latency is simply the time required to route tcounter event counter values to the phase recognition 220 unit, resulting in a fixed delay. With respect to phase matching, the phase recognition approach described above requires wrecognition*wsignature chi-squared matching operations, where each matching operation is composed of parallel arithmetic operations on tcounter event counters followed by a probability table lookup. To provide an example of the phase recognition latency, assuming 16 known phases have been recognized, the workload window size and the phase signature histogram size are each set to 5 (wrecognition=wsignature=5), the number of event counters is 20 (tcounter=20), and the match computation time is 10 cycles, recognizing a phase requires a baseline of 800 cycles (e.g., 10 cycles*16 known phases*5 phase signature histograms). Moreover, because the phase matching operations are data parallel, the convolutional matching performed against each histogram of a known phase can be performed in parallel (as shown in
The flowchart may begin at block 402 by collecting performance data for the current processing workload. For example, in some embodiments, various performance-related event counters may be tracked for the current processing workload. The event counters can include any operational or performance aspects tracked by a processor, including the number of branch prediction hits and misses, the number of cache hits and misses, the number of loads from memory, the amount of data transmitted internally within a processor, and the number of instructions issued to different parts of the instruction pipeline, among other examples. Moreover, in some embodiments, these event counters may be tracked and processed separately for workload snapshots of a defined size (e.g., 10,000 instructions).
The flowchart may then proceed to block 404 to filter the performance data to reduce noise. In some embodiments, for example, “soft-thresholding” may be used to reduce the noise to a tolerable level. For example, using soft-thresholding, any event counters whose values are below a particular threshold (θnoise) may be truncated to 0. The particular threshold (θnoise) used for soft-thresholding may be varied to control the degree of noise reduction applied to the event counter data.
The flowchart may then proceed to block 406 to perform phase recognition, for example, by comparing the performance data for the current workload snapshot to a library of known phases. In some embodiments, phase recognition is performed using a nearest neighbor lookup technique based on convolutional chi-squared testing. For example, in order to compare the current workload snapshot to a particular known phase, the event data for the current workload window is compared to a signature for the known phase. The comparisons can be performed by computing the chi-squared distance between event data and a phase signature. The results of these chi-squared calculations are then filtered to identify the known phase with the closest matching score. This process provides shift invariance by choosing the strongest match within a window of recent workload snapshots, against any of the phase signatures, regardless of order.
The flowchart may then proceed to block 408 to determine whether the current workload snapshot matches a known phase. For example, in some embodiments, a match is detected if the closest chi-squared score is beyond a particular threshold. If a match is detected, the flowchart proceeds to block 410, where a known phase is recognized. Otherwise, if the current workload snapshot does not match any of the known phases, the flowchart proceeds to block 412, where a new phase is recognized and added to the library of known phases.
The flowchart may then proceed to block 414 to perform runtime optimizations based on the recognized phase. For example, a processor can be optimized or adapted based on the specific workload phases that are encountered, for example, by adjusting processor voltage to improve power efficiency, adjusting the width of the execution pipeline during periods with systematically poor speculation, tailoring branch prediction, cache pre-fetch, and/or scheduling units based on identified program characteristics and patterns, and so forth.
At this point, the flowchart may be complete. In some embodiments, however, the flowchart may restart and/or certain blocks may be repeated. For example, in some embodiments, the flowchart may restart at block 402 to continue collecting runtime information to optimize the performance of computing devices.
Cloud-Based Processor Optimization
An example embodiment of cloud-based processor optimization 500 is illustrated in
First, runtime data 502 (e.g., program and/or hardware states) is collected from processors 514 or other chips of user devices 510, and the runtime data 502 is uploaded to a cloud service 520. For example, in some embodiments, an optimization unit 516 of a processor 514 may collect runtime data 502 from certain components 518 of the processor, and the runtime data 502 may then provided to the cloud service 520. The cloud service 520 then uses the runtime data 502 to perform machine learning at a data-center-scale to recognize workload patterns and derive optimization related metadata 504 for the user devices 510. For example, in some embodiments, the cloud service 520 may derive optimization related metadata 504 using branch modeling 521, data access modeling 522, and/or phase identification 523. The cloud service 520 then distributes the optimization metadata 504 to the user devices 510, which the user devices 510 then use to perform appropriate runtime processor optimizations.
For example, in some embodiments, cloud service 520 may use machine learning to derive runtime hardware optimizations by: (1) collecting trace data from user devices 510 at runtime; (2) analyzing program structure using large-scale data-driven modeling and learning techniques; and (3) returning metadata 504 to the user devices 510 that can be used to adjust reconfigurable processor components 514 or other hardware. In this manner, processors and other hardware can be tailored to user applications 511 at runtime, providing improved flexibility and performance over approaches that only allow similar tuning to be performed during the development stage (e.g., profile-guided optimization techniques).
In general, performing “off-chip” modeling and machine learning is ideal for use cases where the delay and data transmission costs of transmitting data off-chip can be amortized by strong long-term performance on a small set of workloads. Example use cases include servers that repetitively execute high-performance workloads and/or devices that accelerate specific binaries as a performance differentiator.
The illustrated cloud-based learning service is designed to drive adjustments and optimizations on an ongoing basis and can be used with any reconfigurable processor component 518, including branch prediction units (BPU), cache pre-fetchers, and schedulers, among other examples. In this manner, processors and other hardware can be tailored to user applications 511 at runtime without requiring changes or access to source code, providing improved flexibility and performance over approaches that only allow similar tuning to be performed during the development stage, such as profile-guided optimization techniques. Moreover, the class of performance optimizations that can be derived by applying machine learning to runtime data is far more extensive than that of profile-guided optimization, which requires representative datasets at design time and realistic recompilation time. In particular, cloud-based computing enables processor optimizations to be derived using sophisticated machine learning techniques (e.g., convolutional neural networks and data dependency tracking) that cannot be implemented “on-chip” by a processor due to performance constraints. Leveraging cloud-based computing to adapt a processor to its workload at runtime can reduce application development time and cost, particularly when building highly-optimized applications. Moreover, cloud-based computing enables processors to be adapted to novel workloads in a manner that is orders-of-magnitude more powerful than on-chip adaptation mechanisms. For example, the limited-scope pattern matching used in on-chip branch predictors is unable to recognize and leverage long-term data-dependency relationships. Similarly, basic stride detection policies used in data pre-fetchers are unable to capture data access patterns over tens-of-thousands of instructions. By contrast, leveraging cloud-based tracing enables identification of long-term predictive relationships between data-dependent branches that are beyond the reach of on-chip learning mechanisms. These relationships can be translated into predictive rules used for performing runtime optimizations and improving processor performance. Finally, the performance of legacy code is still maintained even on new platforms and processors that support cloud-based processor optimization.
Use case 600 illustrates an example of using cloud-based computing to improve branch prediction for a processor, for example, by improving speculation for hard-to-predict branches. As explained further below, various runtime information associated with the processor (e.g., instruction, register, and memory data) is mined during execution of an application, and data-dependency tracking is then leveraged to derive custom prediction rules for hard-to-predict branches. For example, if a hard-to-predict branch is identified in the application, a snippet of the application preceding the hard-to-predict branch (e.g., the retired instructions and any registers or memory addresses that were accessed) is recorded and analyzed to identify relationships between data-dependent execution branches. The identified relationships can then be used, for example, to build custom prediction rules to improve speculation for a critical application on a customer machine.
The data-dependency analysis used for discovering relationships among branches is implemented using backward and forward search procedures. A backward search can be performed using information associated with a hard-to-predict branch. For example, a backward search can be instantiated using a starting point in a trace (e.g., the hard-to-predict branch), a minimum lookback window for terminating the search, and a storage location or data value of interest to be tracked (e.g., a data value used in the branch condition). The lookback window that precedes the specified starting point is then searched to identify the instruction pointer and position of the most recent instruction that modifies the tracked data value, along with any operands used in the modification. If a corresponding instruction within the lookback window is identified, the procedure recursively calls additional backward searches for each operand used in the modification.
A forward search can be performed using a starting point in a trace, a maximum look-ahead window for terminating the search, and a tracked data value known to be unmodified in the identified trace window. The look-ahead window that follows the specified starting point corresponds to a “stable” period in which the tracked data value is not modified. The stable period is searched to identify peer branches whose conditions check the tracked data value. For example, the forward search procedure first enumerates all conditional branches within the stable period, and then triggers a backward search for each conditional branch using search limits defined by the branch position and the original starting point of the forward search. The forward search then flags any branch whose backward search reveals the tracked data value to be a contributor to the branch condition.
Accordingly, a backward search can be performed for a hard-to-predict branch in a trace, and forward searches can then be performed for all stable periods identified in its execution path. In this manner, peer branches can be identified whose conditions rely on values that also affect the hard-to-predict branch. Statistically, the directions of the peer branches contain predictive information about the hard-to-predict branch, and thus can be used to train a custom predictor, such as decision tree. For example, a neural network can be trained for the hard-to-predict branches to determine if any improvements in prediction accuracy can be achieved. First, in the feature identification step, learned weights in the neural network can be used to determine correlated branches or features. These features can then be used to build feature vectors, which are used to train a classification model (e.g., a decision tree) to predict the branch outcome. In some embodiments, the classification model could be implemented using a decision tree, although other approaches can also be used.
Use case 600 illustrates an example snippet of instruction trace data 610 collected during execution of an application, which precedes a hard-to-predict branch in the application. The instruction trace data 610 is analyzed by a cloud service using the data dependency analysis described above in order to optimize branch prediction performance. In some cases, a user device executing a particular application may provide the instruction trace data 610 to the cloud service, or alternatively, the cloud service may execute the user application directly to obtain the instruction trace data 610.
In the illustrated example, a hard-to-predict branch is identified at instruction 47 (e.g., a jump zero instruction). Accordingly, in step one 601, a backward search is instantiated using the storage location of the branch condition (e.g., register dl) as the tracked data value, and a minimum lookback window that extends to the beginning of the trace. The backward search is used to identify the most recent modification to register dl and identify any prior dependencies. In the illustrated example, the backward search identifies instruction 33 and determines that memory location 99f80a8 is a prior dependency. At step two 602, a forward search is performed to enumerate branches found in the stable period between instructions 33 and 47, and branches are found at instructions 34, 39, and 44. At step three 603, local backward searches are performed to determine the dependencies of each branch in the stable period identified by the forward search (e.g., the branches at instructions 34, 39, and 44), and the results are checked for overlap with register dl. In this case, the original hard-to-predict branch at instruction 47 and the branch at instruction 34 are found to have interdependent conditions. Accordingly, the direction of the peer branch at instruction 34 can be used as predictive information for the hard-to-predict branch, and can be used to train a custom predictor to improve the branch prediction performance for the hard-to-predict branch.
In general, a map-reduce framework can be used to perform a given task using distributed and/or parallel processing, for example, by distributing the task across various servers in a cloud-based data center. A map-reduce framework provides well-supported infrastructure for large-scale parallel computations, including data distribution, fault tolerance, and straggler detection, among other examples. The illustrated map-reduce implementation 700 demonstrates the increase in analytical power that results from moving program analysis for hardware optimization to the cloud.
In the illustrated example 700, the branch prediction analysis from
First, a “map parent” procedure 701 is called to initiate a backward search for each hard-to-predict branch. The map parent procedure 701 emits a key-value pair identifying the hard-to-predict branch and the stable period, where the stable period is a triple containing the starting position, tracked data location, and ending position for a forward search.
Next, a “reduce parent” procedure 702 is called for each stable period emitted from a backward search performed by the map parent procedure 701. The reduce parent procedure 702 initiates a forward search, which emits the peer branches and the lower boundary of the stable period, which can subsequently be used to conduct local backward searches.
The “map peer” procedure 703 is called for each enumerated branch found in a stable period for a hard-to-predict branch (e.g., the branches emitted by the reduce parent procedure 702). The map peer procedure 703 performs a local backward search and determines whether the tracked data location from the reduce parent procedure 702 is in the list of dependent data locations. Whenever an interdependent peer branch is identified, the map peer procedure 703 emits a key-value pair identifying the hard-to-predict branch and the position of the peer branch instruction.
The “reduce peer” procedure 704 aggregates all interdependent peer branches associated with a hard-to-predict branch and then reports the aggregated branches for further analysis and branch prediction optimization.
Finally, the results of this analysis can be used to build or train a custom predictor for the targeted hard-to-predict branch. Various prediction approaches can be used depending on the reconfiguration options available for a particular processor, including a decision tree trained to associate the directions of the flagged peer branches with the direction of the hard-to-predict branch, or a custom indexing function used by a lookup-based predictor (e.g., a tagged geometric length (TAGE) based predictor).
The flowchart may begin at block 802 by receiving runtime data from a client device. In some embodiments, runtime data (e.g., program and/or hardware states) is collected by a client computing device, and the runtime data is then sent from the client device to a cloud service. For example, an optimization unit of a client processor may collect runtime data from certain components of the processor, and the runtime data may then be provided to the cloud service. Alternatively, the cloud service may obtain the runtime data by directly executing a particular client application.
The flowchart may then proceed to block 804 to analyze the runtime data. For example, the cloud service can use the runtime data to perform machine learning at a data-center-scale to recognize workload patterns and derive optimizations for the client device. For example, in some embodiments, the cloud service may analyze the runtime data using branch modeling, data access modeling, and/or phase recognition.
The flowchart may then proceed to block 806 to generate optimization metadata for the client device. The optimization metadata, for example, is derived from the analysis of runtime data, and contains information relating to processor optimizations that can be performed by the client device.
The flowchart may then proceed to block 808 to send the optimization metadata to the client device. For example, the cloud service sends the optimization metadata to the client device, enabling the client device to use the optimization metadata to perform the appropriate runtime optimizations. In this manner, processors and other hardware can be tailored to client applications at runtime, providing improved flexibility and performance over approaches that only allow similar tuning to be performed during the development stage (e.g., profile-guided optimization techniques).
At this point, the flowchart may be complete. In some embodiments, however, the flowchart may restart and/or certain blocks may be repeated. For example, in some embodiments, the flowchart may restart at block 802 to continue collecting runtime information to optimize the performance of computing devices.
Processor Optimization Using On-Chip and Cloud Learning
The flowchart may begin at block 902 by collecting runtime information associated with a computing device. The runtime information, for example, could include any performance or operational information associated with the computing device (or an associated processor or application), including performance related data (e.g., performance event counters for a processor), processor or application state information (e.g., instruction, register, and/or memory data from an application trace), and so forth.
In some cases, the runtime information may be collected by the computing device and/or an associated processor. In some cases, the runtime information may also be collected by a cloud optimization service. For example, in some cases, the computing device could transmit runtime information to the cloud optimization service, or alternatively, the cloud optimization service could execute the application associated with the computing device to collect the runtime information directly.
The flowchart may then proceed to block 904 to receive and/or or determine runtime optimization information for the computing device. The runtime optimization information may be determined, for example, using machine learning based on the collected runtime information. In some cases, the runtime optimization information may be determined by the computing device and/or an associated processor. The runtime optimization information may also be determined for the computing device by a cloud optimization service, and then transmitted from the cloud optimization service to the computing device.
In some cases, the runtime optimization information may be determined using phase recognition (e.g., as described in connection with
In some cases, the runtime optimization information may be determined using branch prediction learning to improve the branch prediction performance of the computing device (e.g., as described in connection with
The flowchart may then proceed to block 906 to perform one or more runtime optimizations for the computing device based on the runtime optimization information. For example, based on the runtime optimization information received at block 904, various optimizations can be performed to improve the performance of the computing device, such as adjusting processor voltage to improve power efficiency, adjusting the width of the execution pipeline during periods with systematically poor speculation, tailoring branch prediction, cache pre-fetch, and/or scheduling units based on identified program characteristics and patterns, and so forth.
At this point, the flowchart may be complete. In some embodiments, however, the flowchart may restart and/or certain blocks may be repeated. For example, in some embodiments, the flowchart may restart at block 902 to continue collecting runtime information to optimize the performance of computing devices.
Example Computer Architectures
Example Core Architectures
In
The front end unit 1030 includes a branch prediction unit 1032 coupled to an instruction cache unit 1034, which is coupled to an instruction translation lookaside buffer (TLB) 1036, which is coupled to an instruction fetch unit 1038, which is coupled to a decode unit 1040. The decode unit 1040 (or decoder) may decode instructions, and generate as an output one or more micro-operations, micro-code entry points, microinstructions, other instructions, or other control signals, which are decoded from, or which otherwise reflect, or are derived from, the original instructions. The decode unit 1040 may be implemented using various different mechanisms. Examples of suitable mechanisms include, but are not limited to, look-up tables, hardware implementations, programmable logic arrays (PLAs), microcode read only memories (ROMs), etc. In one embodiment, the core 1090 includes a microcode ROM or other medium that stores microcode for certain macroinstructions (e.g., in decode unit 1040 or otherwise within the front end unit 1030). The decode unit 1040 is coupled to a rename/allocator unit 1052 in the execution engine unit 1050.
The execution engine unit 1050 includes the rename/allocator unit 1052 coupled to a retirement unit 1054 and a set of one or more scheduler unit(s) 1056. The scheduler unit(s) 1056 represents any number of different schedulers, including reservations stations, central instruction window, etc. The scheduler unit(s) 1056 is coupled to the physical register file(s) unit(s) 1058. Each of the physical register file(s) units 1058 represents one or more physical register files, different ones of which store one or more different data types, such as scalar integer, scalar floating point, packed integer, packed floating point, vector integer, vector floating point, status (e.g., an instruction pointer that is the address of the next instruction to be executed), etc. In one embodiment, the physical register file(s) unit 1058 comprises a vector registers unit, a write mask registers unit, and a scalar registers unit. These register units may provide architectural vector registers, vector mask registers, and general purpose registers. The physical register file(s) unit(s) 1058 is overlapped by the retirement unit 1054 to illustrate various ways in which register renaming and out-of-order execution may be implemented (e.g., using a reorder buffer(s) and a retirement register file(s); using a future file(s), a history buffer(s), and a retirement register file(s); using a register maps and a pool of registers; etc.). The retirement unit 1054 and the physical register file(s) unit(s) 1058 are coupled to the execution cluster(s) 1060. The execution cluster(s) 1060 includes a set of one or more execution units 1062 and a set of one or more memory access units 1064. The execution units 1062 may perform various operations (e.g., shifts, addition, subtraction, multiplication) and on various types of data (e.g., scalar floating point, packed integer, packed floating point, vector integer, vector floating point). While some embodiments may include a number of execution units dedicated to specific functions or sets of functions, other embodiments may include only one execution unit or multiple execution units that all perform all functions. The scheduler unit(s) 1056, physical register file(s) unit(s) 1058, and execution cluster(s) 1060 are shown as being possibly plural because certain embodiments create separate pipelines for certain types of data/operations (e.g., a scalar integer pipeline, a scalar floating point/packed integer/packed floating point/vector integer/vector floating point pipeline, and/or a memory access pipeline that each have their own scheduler unit, physical register file(s) unit, and/or execution cluster—and in the case of a separate memory access pipeline, certain embodiments are implemented in which only the execution cluster of this pipeline has the memory access unit(s) 1064). It should also be understood that where separate pipelines are used, one or more of these pipelines may be out-of-order issue/execution and the rest in-order.
The set of memory access units 1064 is coupled to the memory unit 1070, which includes a data TLB unit 1072 coupled to a data cache unit 1074 coupled to a level 2 (L2) cache unit 1076. In one exemplary embodiment, the memory access units 1064 may include a load unit, a store address unit, and a store data unit, each of which is coupled to the data TLB unit 1072 in the memory unit 1070. The instruction cache unit 1034 is further coupled to a level 2 (L2) cache unit 1076 in the memory unit 1070. The L2 cache unit 1076 is coupled to one or more other levels of cache and eventually to a main memory.
By way of example, the exemplary register renaming, out-of-order issue/execution core architecture may implement the pipeline 1000 as follows: 1) the instruction fetch 1038 performs the fetch and length decoding stages 1002 and 1004; 2) the decode unit 1040 performs the decode stage 1006; 3) the rename/allocator unit 1052 performs the allocation stage 1008 and renaming stage 1010; 4) the scheduler unit(s) 1056 performs the schedule stage 1012; 5) the physical register file(s) unit(s) 1058 and the memory unit 1070 perform the register read/memory read stage 1014; the execution cluster 1060 perform the execute stage 1016; 6) the memory unit 1070 and the physical register file(s) unit(s) 1058 perform the write back/memory write stage 1018; 7) various units may be involved in the exception handling stage 1022; and 8) the retirement unit 1054 and the physical register file(s) unit(s) 1058 perform the commit stage 1024.
The core 1090 may support one or more instructions sets (e.g., the x86 instruction set (with some extensions that have been added with newer versions); the MIPS instruction set of MIPS Technologies of Sunnyvale, Calif.; the ARM instruction set (with optional additional extensions such as NEON) of ARM Holdings of Sunnyvale, Calif.), including the instruction(s) described herein. In one embodiment, the core 1090 includes logic to support a packed data instruction set extension (e.g., AVX1, AVX2), thereby allowing the operations used by many multimedia applications to be performed using packed data.
It should be understood that the core may support multithreading (executing two or more parallel sets of operations or threads), and may do so in a variety of ways including time sliced multithreading, simultaneous multithreading (where a single physical core provides a logical core for each of the threads that physical core is simultaneously multithreading), or a combination thereof (e.g., time sliced fetching and decoding and simultaneous multithreading thereafter such as in the Intel® Hyperthreading technology).
While register renaming is described in the context of out-of-order execution, it should be understood that register renaming may be used in an in-order architecture. While the illustrated embodiment of the processor also includes separate instruction and data cache units 1034/1074 and a shared L2 cache unit 1076, alternative embodiments may have a single internal cache for both instructions and data, such as, for example, a Level 1 (L1) internal cache, or multiple levels of internal cache. In some embodiments, the system may include a combination of an internal cache and an external cache that is external to the core and/or the processor. Alternatively, all of the cache may be external to the core and/or the processor.
Thus, different implementations of the processor 1100 may include: 1) a CPU with the special purpose logic 1108 being integrated graphics and/or scientific (throughput) logic (which may include one or more cores), and the cores 1102A-N being one or more general purpose cores (e.g., general purpose in-order cores, general purpose out-of-order cores, a combination of the two); 2) a coprocessor with the cores 1102A-N being a large number of special purpose cores intended primarily for graphics and/or scientific (throughput); and 3) a coprocessor with the cores 1102A-N being a large number of general purpose in-order cores. Thus, the processor 1100 may be a general-purpose processor, coprocessor or special-purpose processor, such as, for example, a network or communication processor, compression engine, graphics processor, GPGPU (general purpose graphics processing unit), a high-throughput many integrated core (MIC) coprocessor (including 30 or more cores), embedded processor, or the like. The processor may be implemented on one or more chips. The processor 1100 may be a part of and/or may be implemented on one or more substrates using any of a number of process technologies, such as, for example, BiCMOS, CMOS, or NMOS.
The memory hierarchy includes one or more levels of cache within the cores, a set or one or more shared cache units 1106, and external memory (not shown) coupled to the set of integrated memory controller units 1114. The set of shared cache units 1106 may include one or more mid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache, a last level cache (LLC), and/or combinations thereof. While in one embodiment a ring based interconnect unit 1112 interconnects the integrated graphics logic 1108, the set of shared cache units 1106, and the system agent unit 1110/integrated memory controller unit(s) 1114, alternative embodiments may use any number of well-known techniques for interconnecting such units. In one embodiment, coherency is maintained between one or more cache units 1106 and cores 1102-A-N.
In some embodiments, one or more of the cores 1102A-N are capable of multi-threading. The system agent 1110 includes those components coordinating and operating cores 1102A-N. The system agent unit 1110 may include for example a power control unit (PCU) and a display unit. The PCU may be or include logic and components needed for regulating the power state of the cores 1102A-N and the integrated graphics logic 1108. The display unit is for driving one or more externally connected displays.
The cores 1102A-N may be homogenous or heterogeneous in terms of architecture instruction set; that is, two or more of the cores 1102A-N may be capable of execution the same instruction set, while others may be capable of executing only a subset of that instruction set or a different instruction set.
Example Computer Architectures
Referring now to
The optional nature of additional processors 1215 is denoted in
The memory 1240 may be, for example, dynamic random access memory (DRAM), phase change memory (PCM), or a combination of the two. For at least one embodiment, the controller hub 1220 communicates with the processor(s) 1210, 1215 via a multi-drop bus, such as a frontside bus (FSB), point-to-point interface such as QuickPath Interconnect (QPI), or similar connection 1295.
In one embodiment, the coprocessor 1245 is a special-purpose processor, such as, for example, a high-throughput MIC processor, a network or communication processor, compression engine, graphics processor, GPGPU, embedded processor, or the like. In one embodiment, controller hub 1220 may include an integrated graphics accelerator.
There can be a variety of differences between the physical resources 1210, 1215 in terms of a spectrum of metrics of merit including architectural, microarchitectural, thermal, power consumption characteristics, and the like.
In one embodiment, the processor 1210 executes instructions that control data processing operations of a general type. Embedded within the instructions may be coprocessor instructions. The processor 1210 recognizes these coprocessor instructions as being of a type that should be executed by the attached coprocessor 1245. Accordingly, the processor 1210 issues these coprocessor instructions (or control signals representing coprocessor instructions) on a coprocessor bus or other interconnect, to coprocessor 1245. Coprocessor(s) 1245 accept and execute the received coprocessor instructions.
Referring now to
Processors 1370 and 1380 are shown including integrated memory controller (IMC) units 1372 and 1382, respectively. Processor 1370 also includes as part of its bus controller units point-to-point (P-P) interfaces 1376 and 1378; similarly, second processor 1380 includes P-P interfaces 1386 and 1388. Processors 1370, 1380 may exchange information via a point-to-point (P-P) interface 1350 using P-P interface circuits 1378, 1388. As shown in
Processors 1370, 1380 may each exchange information with a chipset 1390 via individual P-P interfaces 1352, 1354 using point to point interface circuits 1376, 1394, 1386, 1398. Chipset 1390 may optionally exchange information with the coprocessor 1338 via a high-performance interface 1339. In one embodiment, the coprocessor 1338 is a special-purpose processor, such as, for example, a high-throughput MIC processor, a network or communication processor, compression engine, graphics processor, GPGPU, embedded processor, or the like.
A shared cache (not shown) may be included in either processor or outside of both processors, yet connected with the processors via P-P interconnect, such that either or both processors' local cache information may be stored in the shared cache if a processor is placed into a low power mode.
Chipset 1390 may be coupled to a first bus 1316 via an interface 1396. In one embodiment, first bus 1316 may be a Peripheral Component Interconnect (PCI) bus, or a bus such as a PCI Express bus or another third generation I/O interconnect bus, although the scope of the present invention is not so limited.
As shown in
Referring now to
Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of such implementation approaches. Embodiments of the invention may be implemented as computer programs or program code executing on programmable systems comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.
Program code, such as code 1330 illustrated in
The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. The program code may also be implemented in assembly or machine language, if desired. In fact, the mechanisms described herein are not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language.
One or more aspects of at least one embodiment may be implemented by representative instructions stored on a machine-readable medium which represents various logic within the processor, which when read by a machine causes the machine to fabricate logic to perform the techniques described herein. Such representations, known as “IP cores” may be stored on a tangible, machine readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that actually make the logic or processor.
Such machine-readable storage media may include, without limitation, non-transitory, tangible arrangements of articles manufactured or formed by a machine or device, including storage media such as hard disks, any other type of disk including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk rewritable's (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs) such as dynamic random access memories (DRAMs), static random access memories (SRAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), phase change memory (PCM), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.
Accordingly, embodiments of the invention also include non-transitory, tangible machine-readable media containing instructions or containing design data, such as Hardware Description Language (HDL), which defines structures, circuits, apparatuses, processors and/or system features described herein. Such embodiments may also be referred to as program products.
Emulation
In some cases, an instruction converter may be used to convert an instruction from a source instruction set to a target instruction set. For example, the instruction converter may translate (e.g., using static binary translation, dynamic binary translation including dynamic compilation), morph, emulate, or otherwise convert an instruction to one or more other instructions to be processed by the core. The instruction converter may be implemented in software, hardware, firmware, or a combination thereof. The instruction converter may be on processor, off processor, or part on and part off processor.
The flowcharts and block diagrams in the FIGURES illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various aspects of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order or alternative orders, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The foregoing disclosure outlines features of several embodiments so that those skilled in the art may better understand various aspects of the present disclosure. Those skilled in the art should appreciate that they may readily use the present disclosure as a basis for designing or modifying other processes and structures for carrying out the same purposes and/or achieving the same advantages of the embodiments introduced herein. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the present disclosure, and that they may make various changes, substitutions, and alterations herein without departing from the spirit and scope of the present disclosure.
All or part of any hardware element disclosed herein may readily be provided in a system-on-a-chip (SoC), including a central processing unit (CPU) package. An SoC represents an integrated circuit (IC) that integrates components of a computer or other electronic system into a single chip. The SoC may contain digital, analog, mixed-signal, and radio frequency functions, all of which may be provided on a single chip substrate. Other embodiments may include a multi-chip-module (MCM), with a plurality of chips located within a single electronic package and configured to interact closely with each other through the electronic package. In various other embodiments, the computing functionalities disclosed herein may be implemented in one or more silicon cores in Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), and other semiconductor chips.
As used throughout this specification, the term “processor” or “microprocessor” should be understood to include not only a traditional microprocessor (such as Intel's® industry-leading x86 and x64 architectures), but also matrix processors, graphics processors, and any ASIC, FPGA, microcontroller, digital signal processor (DSP), programmable logic device, programmable logic array (PLA), microcode, instruction set, emulated or virtual machine processor, or any similar “Turing-complete” device, combination of devices, or logic elements (hardware or software) that permit the execution of instructions.
Note also that in certain embodiments, some of the components may be omitted or consolidated. In a general sense, the arrangements depicted in the figures should be understood as logical divisions, whereas a physical architecture may include various permutations, combinations, and/or hybrids of these elements. It is imperative to note that countless possible design configurations can be used to achieve the operational objectives outlined herein. Accordingly, the associated infrastructure has a myriad of substitute arrangements, design choices, device possibilities, hardware configurations, software implementations, and equipment options.
In a general sense, any suitably-configured processor can execute instructions associated with data or microcode to achieve the operations detailed herein. Any processor disclosed herein could transform an element or an article (for example, data) from one state or thing to another state or thing. In another example, some activities outlined herein may be implemented with fixed logic or programmable logic (for example, software and/or computer instructions executed by a processor) and the elements identified herein could be some type of a programmable processor, programmable digital logic (for example, a field programmable gate array (FPGA), an erasable programmable read only memory (EPROM), an electrically erasable programmable read only memory (EEPROM)), an ASIC that includes digital logic, software, code, electronic instructions, flash memory, optical disks, CD-ROMs, DVD ROMs, magnetic or optical cards, other types of machine-readable mediums suitable for storing electronic instructions, or any suitable combination thereof.
In operation, a storage may store information in any suitable type of tangible, non-transitory storage medium (for example, random access memory (RAM), read only memory (ROM), field programmable gate array (FPGA), erasable programmable read only memory (EPROM), electrically erasable programmable ROM (EEPROM), or microcode), software, hardware (for example, processor instructions or microcode), or in any other suitable component, device, element, or object where appropriate and based on particular needs. Furthermore, the information being tracked, sent, received, or stored in a processor could be provided in any database, register, table, cache, queue, control list, or storage structure, based on particular needs and implementations, all of which could be referenced in any suitable timeframe. Any of the memory or storage elements disclosed herein should be construed as being encompassed within the broad terms ‘memory’ and ‘storage,’ as appropriate. A non-transitory storage medium herein is expressly intended to include any non-transitory special-purpose or programmable hardware configured to provide the disclosed operations, or to cause a processor to perform the disclosed operations. A non-transitory storage medium also expressly includes a processor having stored thereon hardware-coded instructions, and optionally microcode instructions or sequences encoded in hardware, firmware, or software.
Computer program logic implementing all or part of the functionality described herein is embodied in various forms, including, but in no way limited to, hardware description language, a source code form, a computer executable form, machine instructions or microcode, programmable hardware, and various intermediate forms (for example, forms generated by an HDL processor, assembler, compiler, linker, or locator). In an example, source code includes a series of computer program instructions implemented in various programming languages, such as an object code, an assembly language, or a high-level language such as OpenCL, FORTRAN, C, C++, JAVA, or HTML for use with various operating systems or operating environments, or in hardware description languages such as Spice, Verilog, and VHDL. The source code may define and use various data structures and communication messages. The source code may be in a computer executable form (e.g., via an interpreter), or the source code may be converted (e.g., via a translator, assembler, or compiler) into a computer executable form, or converted to an intermediate form such as byte code. Where appropriate, any of the foregoing may be used to build or describe appropriate discrete or integrated circuits, whether sequential, combinatorial, state machines, or otherwise.
In one example, any number of electrical circuits of the FIGURES may be implemented on a board of an associated electronic device. The board can be a general circuit board that can hold various components of the internal electronic system of the electronic device and, further, provide connectors for other peripherals. More specifically, the board can provide the electrical connections by which the other components of the system can communicate electrically. Any suitable processor and memory can be suitably coupled to the board based on particular configuration needs, processing demands, and computing designs. Other components such as external storage, additional sensors, controllers for audio/video display, and peripheral devices may be attached to the board as plug-in cards, via cables, or integrated into the board itself. In another example, the electrical circuits of the FIGURES may be implemented as stand-alone modules (e.g., a device with associated components and circuitry configured to perform a specific application or function) or implemented as plug-in modules into application specific hardware of electronic devices.
Note that with the numerous examples provided herein, interaction may be described in terms of two, three, four, or more electrical components. However, this has been done for purposes of clarity and example only. It should be appreciated that the system can be consolidated or reconfigured in any suitable manner. Along similar design alternatives, any of the illustrated components, modules, and elements of the FIGURES may be combined in various possible configurations, all of which are within the broad scope of this specification. In certain cases, it may be easier to describe one or more of the functionalities of a given set of flows by only referencing a limited number of electrical elements. It should be appreciated that the electrical circuits of the FIGURES and its teachings are readily scalable and can accommodate a large number of components, as well as more complicated/sophisticated arrangements and configurations. Accordingly, the examples provided should not limit the scope or inhibit the broad teachings of the electrical circuits as potentially applied to a myriad of other architectures.
Numerous other changes, substitutions, variations, alterations, and modifications may be ascertained to one skilled in the art and it is intended that the present disclosure encompass all such changes, substitutions, variations, alterations, and modifications as falling within the scope of the appended claims.
The following examples pertain to embodiments described throughout this disclosure.
One or more embodiments may include a processor, comprising: a processor optimization unit to: collect runtime information associated with a computing device, wherein the runtime information comprises information indicating a performance of the computing device during program execution; receive runtime optimization information for the computing device, wherein the runtime optimization information comprises information associated with one or more runtime optimizations for the computing device, and wherein the runtime optimization information is determined based on an analysis of the collected runtime information; and perform the one or more runtime optimizations for the computing device based on the runtime optimization information.
In one example embodiment of a processor, the processor optimization unit to receive the runtime optimization information for the computing device is further to determine the runtime optimization information.
In one example embodiment of a processor, the runtime information comprises a plurality of event counters associated with a workload of the computing device.
In one example embodiment of a processor, the processor optimization unit to determine the runtime optimization information is further to perform phase recognition for the workload of the computing device.
In one example embodiment of a processor, the processor optimization unit to perform phase recognition for the workload of the computing device is further to perform noise reduction using soft-thresholding.
In one example embodiment of a processor, the processor optimization unit to perform phase recognition for the workload of the computing device is further to identify a phase associated with the workload using a convolutional phase comparison.
In one example embodiment of a processor, the processor optimization unit to perform phase recognition for the workload of the computing device is further to identify a phase associated with the workload using a chi-squared calculation.
In one example embodiment of a processor, the processor optimization unit to receive the runtime optimization information for the computing device is further to receive the runtime optimization information from a cloud service remote from the computing device.
In one example embodiment of a processor: the runtime information comprises instruction trace data associated with an application executed on the computing device, wherein the instruction trace data comprises a plurality of branch instructions; and the runtime optimization information is determined by identifying a relationship associated with the plurality of branch instructions to improve branch prediction performed by the computing device.
One or more embodiments may include at least one machine accessible storage medium having instructions stored thereon, the instructions, when executed on a machine, cause the machine to: collect runtime information associated with a computing device, wherein the runtime information comprises information indicating a performance of the computing device during program execution; receive runtime optimization information for the computing device, wherein the runtime optimization information comprises information associated with one or more runtime optimizations for the computing device, and wherein the runtime optimization information is determined based on an analysis of the collected runtime information; and perform the one or more runtime optimizations for the computing device based on the runtime optimization information.
In one example embodiment of a storage medium, the instructions that cause the machine to receive the runtime optimization information for the computing device further cause the machine to determine the runtime optimization information.
In one example embodiment of a storage medium: the runtime information comprises a plurality of event counters associated with a workload of the computing device; and the instructions that cause the machine to determine the runtime optimization information further cause the machine to perform phase recognition for the workload of the computing device.
In one example embodiment of a storage medium, the instructions that cause the machine to perform phase recognition for the workload of the computing device further cause the machine to perform noise reduction using soft-thresholding.
In one example embodiment of a storage medium, the instructions that cause the machine to perform phase recognition for the workload of the computing device further cause the machine to identify a phase associated with the workload using a convolutional phase comparison.
In one example embodiment of a storage medium, the instructions that cause the machine to perform phase recognition for the workload of the computing device further cause the machine to identify a phase associated with the workload using a chi-squared calculation.
In one example embodiment of a storage medium: the runtime information comprises instruction trace data associated with an application executed on the computing device, wherein the instruction trace data comprises a plurality of branch instructions; and the runtime optimization information is determined by identifying a relationship associated with the plurality of branch instructions to improve branch prediction performed by the computing device.
One or more embodiments may include a method, comprising: collecting runtime information associated with a computing device, wherein the runtime information comprises information indicating a performance of the computing device during program execution; receiving runtime optimization information for the computing device, wherein the runtime optimization information comprises information associated with one or more runtime optimizations for the computing device, and wherein the runtime optimization information is determined based on an analysis of the collected runtime information; and performing the one or more runtime optimizations for the computing device based on the runtime optimization information.
In one example embodiment of a method, receiving the runtime optimization information for the computing device further comprises determining the runtime optimization information.
In one example embodiment of a method, the runtime information comprises a plurality of event counters associated with a workload of the computing device; and wherein determining the runtime optimization information comprises performing phase recognition for the workload of the computing device.
In one example embodiment of a method, performing phase recognition for the workload of the computing device comprises performing noise reduction using soft-thresholding.
In one example embodiment of a method, performing phase recognition for the workload of the computing device comprises identifying a phase associated with the workload using a convolutional phase comparison.
In one example embodiment of a method, performing phase recognition for the workload of the computing device comprises identifying a phase associated with the workload using a chi-squared calculation.
In one example embodiment of a method: the runtime information comprises instruction trace data associated with an application executed on the computing device, wherein the instruction trace data comprises a plurality of branch instructions; and the runtime optimization information is determined by identifying a relationship associated with the plurality of branch instructions to improve branch prediction performed by the computing device.
One or more embodiments may include a system, comprising: a communication interface to communicate with a computing device over one or more networks; and a plurality of processors for providing a cloud service for computer optimization, wherein the plurality of processors is to: collect runtime information associated with the computing device, wherein the runtime information comprises information indicating a performance of the computing device during program execution; determine runtime optimization information for the computing device, wherein the runtime optimization information comprises information associated with one or more runtime optimizations for the computing device, and wherein the runtime optimization information is determined based on an analysis of the collected runtime information; and provide the runtime optimization information to the computing device to optimize performance of the computing device.
In one example embodiment of a system: the runtime information comprises instruction trace data associated with an application executed on the computing device, wherein the instruction trace data comprises a plurality of branch instructions; and the plurality of processors to determine the runtime optimization information for the computing device is further to identify a relationship associated with the plurality of branch instructions to improve branch prediction performed by the computing device.