In a computer architecture, a branch predictor can be a part of a processor that determines whether a conditional branch in the instruction flow of a program is likely to be taken or not taken. This may be called a branch prediction. Branch predictors are important in today's modern, superscalar processors for achieving a high performance, and can facilitate the processors to fetch and execute instructions without waiting for a branch to be resolved. Most of pipelined processors perform branch predictions of some form, because they should guess the address of the next instruction to fetch before the current instruction has been executed.
Branch prediction remains one of the important components of high performance in processors that exploit a single-threaded performance. Modern branch predictors can achieve high accuracies on many codes, but further developments are needed if processors are to continue improving the single-threaded performance. Accurate branch prediction shall remain important for general-purpose processors, especially as the number of available cores exceeds the number of available threads.
Neural branch predictors—a class of correlating predictors that make a prediction for the current branch based on the history pattern observed for the previous branches using a dot product computation—have shown some promise in attaining high prediction accuracies. Neural branch predictors, however, have traditionally provided poor power and energy characteristics due to the computation requirement. Certain proposed designs have reduced predictor latency at the expense of some accuracy, but such designs remain uncompetitive from a power perspective. The requirement of computing a dot product for every prediction, with potentially tens or even hundreds of elements may not be suitable for an industrial adoption in the current form.
The foregoing and other features of the present disclosure will become more fully apparent from the following description and appended claims, taken in conjunction with the accompanying drawings. Understanding that these drawings depict only several examples in accordance with the disclosure and are, therefore, not to be considered limiting of its scope, the disclosure will be described with additional specificity and detail through use of the accompanying drawings, in which:
In the following detailed description, reference is made to the accompanying drawings, which form a part hereof. In the drawings, similar symbols typically identify similar components, unless context dictates otherwise. The illustrative examples described in the detailed description, drawings, and claims are not meant to be limiting. Other examples may be utilized, and other changes may be made, without departing from the spirit or scope of the subject matter presented herein. It will be readily understood that the aspects of the present disclosure, as generally described herein, and illustrated in the Figures, can be arranged, substituted, combined, separated, and designed in a wide variety of different configurations, all of which are implicitly contemplated herein.
This disclosure is drawn to methods, apparatus, computer programs and systems related to branch prediction. Certain preferred embodiments of one such system are illustrated in the figures and described below. Many other embodiments are also possible, however, time and space limitations prevent including an exhaustive list of those embodiments in one document. Accordingly, other embodiments within the scope of the claims will become apparent to those skilled in the art from the teachings of this patent.
The figures include numbering to designate illustrative components of examples shown within the drawings, including the following: a computer system 100, a processor 101, a system bus 102, an operating system 103, an application 104, a read-only memory 105, a random access memory 106, a disk adapter 107, a disk unit 108, a communications adapter 109, an interface adapter 110, a display adapter 111, a keyboard 112, a mouse 113, a speaker 114, a display monitor 115, an analog branch predictor 200, a table of perceptrons 201, a branch history register 202, a hash function 203, a dot product 204, a bias weight 205, an updated weights vector 206, a weights vector 207, digital to analog converters 401, current splitters 402, current to voltage converters 403, comparators 404, a comparator output 411, training outputs 412 and 413, a magnitude line 422, current lines 423, a weight bias 424, a current source 450, a bias transistor 451, a ground 460, and an XOR function 465.
Referring to
Referring to
Input/Output (“I/O”) devices may also be connected to computer system 100 via a user interface adapter 110 and a display adapter 111. For example, a keyboard 112, a mouse 113 and a speaker 114 may be interconnected to bus 102 through user interface adapter 110. Data may be provided to computer system 100 through any of these example devices. A display monitor 115 may be connected to system bus 102 by display adapter 111. In this example manner, a user can provide data or other information to computer system 100 through keyboard 112 and/or mouse 113, and obtain output from computer system 100 via display 115 and/or speaker 114.
The various aspects, features, embodiments or implementations of the invention described herein can be used alone or in various combinations. The methods of the present invention can be implemented by software, hardware or a combination of hardware and software. A detailed description of a branch predictor design according to one example that may be implemented using processor 101 is provided below in connection with
Many neural branch predictors can be derived from a perceptron branch predictor. In this example context, a perceptron can be a vector of h+1 small integer weights, where h is the history length of the predictor. Referring to
As an example, to predict a branch, a perceptron (e.g., a weights vector) 207 may be selected using a hash function 203 of the branch program count (PC). The output of the perceptron 207 may be determined as a dot product 204 of the perceptron 207 and the history shift register 202, with the 0 (not-taken) values in the shift registers being interpreted as −1. Added to the dot product 204 may be an extra bias weight 205 in the perceptron 207, which can take into account the tendency of a branch to be taken or not taken, without regard for its correlation to other branches. If the dot-product 204 result is at least 0, then the branch is predicted as being taken; otherwise, it is predicted as being not taken. Negative weight values generally denote inverse correlations. For example, if a weight with a −10 value is multiplied by −1 in the shift register (i.e., not taken), the value −1·−10=10 will be added to the dot-product result, biasing the result toward a taken prediction since the weight indicates a negative correlation with the not-taken branch represented by the history bit. The magnitude of the weight may indicate the strength of the positive or negative correlation. As with other predictors, the branch history shift register 202 may be speculatively updated and rolled-back to the previous entry on a misprediction.
When the branch outcome becomes known, the perceptron 207 that provided the prediction may be updated [206]. The perceptron 207 may be trained based on a result of a misprediction or when the magnitude of the perceptron output is below a specified threshold value. Upon training, both the bias weight 205 and the h correlating weights can be updated. The bias weight 205 may be incremented or decremented if the branch is taken or not taken, respectively. Each correlating weight in the perceptron 207 may be incremented if the predicted branch has the same outcome as the corresponding bit in the history register (e.g., a positive correlation) and decremented otherwise (e.g., a negative correlation) using a saturating arithmetic procedure. If there is no correlation between the predicted branch and a branch in the history register, the latter's corresponding weight may tend toward 0. If there is a high positive or negative correlation, the weight may have a large magnitude.
Neural predictors, however, have traditionally shown poor power and energy characteristics due to certain computation requirements. Certain prior designs have somewhat reduced the predictor latency at the expense of some accuracy, but still remained unimpressive from a power perspective. As indicated above, the preference of determining a dot product for every prediction, with potentially tens or even hundreds of elements, not suitable for industrial adoption in their current form. Described herein below is an example of an analog implementation of such a neural predictor, which may significantly reduce the power requirements of the traditional neural predictor.
For example, DACs 401 can include binary current-steering DACs 401. With digital weight storage, DACs 401 may be required to used digital weight values to analog values that can be combined efficiently. Although the perceptron weights can be 7 bits, 1 bit may be used to represent the sign of the weight, and 6-bit DACs are generally utilized. There may be, e.g., one DAC 401 per weight, each possibly consisting of a current source 450 and a bias transistor 451, as well as one transistor corresponding to each bit in the weight. One example of a sample DAC 401 is illustrated in greater detail in block 420, which also shows sample components thereof.
This example can support a near-linear digital-to-analog conversion. For example, for a 4-bit base-2 digital magnitude, the width of the DAC 401 transistor may be set to 1, 2, 4 and 8, and can draw currents, e.g., I, 2I, 4I, and 8I, respectively, as shown in greater detail at block 420. A switch can be used to steer each transistor current according to its corresponding weight bit, where, e.g., a weight bit of 1 may steer the current to the magnitude line [422] and a weight bit of 0 can steer it to ground [460]. In this example, if the digital magnitude to be converted is 5, or 0101, currents I and 4I may be steered to the magnitude line, where 2I and 8I may be steered to ground [460]. Based on the properties of Kirchhoff's current law, the magnitude line [422] can contain the sum of the currents whose weights bits are 1, and thus may approximate the digitally stored weight. The magnitude value may then be steered to a positive line or negative line [423] based on the XOR [465] of the sign bit for that weight and the appropriate history bit 424, effectively multiplying the signed weight value by the history bit 424. The positive and negative lines [423] may be shared across all weights, and again based on Kirchhoff's current law, all positive values can be added together, while all negative values may also be added together [405].
Thereafter, the results can be provided to the current splitter 402. For example, the currents on the positive line and the negative line may be split roughly equally by e.g., three transistors of the current splitter 402 to allow for three circuit outputs: a one-bit prediction and two bits that may be used to determine whether training should occur [412 and 413]. Splitting the current, rather than duplicating it through additional current mirrors, can maintain the relative relationship of the positive and negative weights without increasing the total current draw, thereby likely avoiding or reducing an increase in power consumption.
The outputs of the current splitter can be provided to the current to voltage converter 403. For example, the currents from the splitters 402 can pass through resistors of the current to voltage converter 403, thus creating voltages that may be used as input to the voltage comparators 404. For example, track-and-latch comparators 404, the examples shown in
In addition to a one-bit taken or not-taken prediction [411], the example of the circuit may latch two signals [412 and 413] that can be used when the branch is resolved to indicate whether the weights are to be updated. Training may occur if, e.g., the prediction was incorrect or if the absolute value of the difference between the positive and negative weights is less than the threshold value. Rather than actually determining the difference between the positive and negative lines, which would likely require the use of more complex circuitry, the absolute value comparison may be split into two separate cases, e.g., one case for the positive weights being larger than the negative weights and the other case for the negative weights being larger than the positive ones. Instead of waiting for the prediction output P [411] to be produced, which may increase the total circuit delay, all three comparisons [411-413] may be performed in parallel, as is illustrated in
For example, “T” [412] is the relevant training bit if the prediction is taken, and “N” [413] is the relevant training bit if the prediction is not taken. To produce bit “T” [412], the threshold value may be added to the current on the negative line. If the prediction “P” [411] is 1 (taken) and the “T” [412] output is 0, which means the negative line (with the threshold value added) is larger than the positive line, then the difference between the positive and negative weights may be less than the threshold value and the predictor should train. Similarly, to produce bit “N” [413], the threshold value may be added to the current on the positive line. If the prediction “P” [411] is 0 (not taken) and the “N” [413] output is 1, which means the positive line (with the threshold value added) is larger than the negative line, then the difference between the negative and positive weights is less than the threshold value.
In particular,
Disclosed in some examples is a method for providing a branch prediction using at least one analog branch predictor, comprising obtaining at least one current approximation of weights associated with correlations of branches to the branch predictions, and generating the branch predictions based on the at least one current approximation. In other examples, obtaining at least one current approximation comprises selecting a first vector from a table of weights, selecting a second vector from a global history shift register, converting the first and second vectors from a digital format to an analog format, and computing a dot product of the vectors. In further examples, the method may include adding a bias weight to the dot product of the vectors. In other examples, the first vector is selected from a table of weights using a hash function. In still other examples, the first and second vectors are converted using one or more binary current steering digital to analog converters. While in further examples, the dot product of the first and second vectors is obtained using a current summation. In some examples, the method may further comprise converting the dot product using a comparator acting as an analog to digital converter to convert the dot product of the vectors. In other examples, the method may further comprise scaling the vector from the table of weights. In further examples, the scaling is accomplished using a scaling factor according to the equation f(i)=1/(0.1111+0.037i), where i is a position in the first vector, and f(i) is a value representing the scaling factor. In still further examples, the method may additionally comprise updating the vector from the table of weights based on an accuracy of a previous prediction.
Disclosed in other examples is a processing arrangement which when executing a software program is configured to obtain at least one current approximation of weights associated with correlations of branches to the branch predictions, and generate the branch predictions based on the at least one current approximation. In some examples, the configuration for obtaining at least one current approximation comprises a sub-configuration configured to select a first vector from a table of weights, select a second vector from a global history shift register, convert the first and second vectors from a digital format to an analog format, and compute a dot product of the vectors. In further examples, the arrangement may be further configured to add a bias weight to the dot product of the vectors. In yet further examples, the first vector is selected from a table of weights using a hash function. While in other examples, the first and second vectors are converted using one or more binary current steering digital to analog converters. In still other examples, the dot product of the first and second vectors is obtained using a current summation. While in other examples, the arrangement may be further configured to convert the dot product using a comparator acting as an analog to digital converter to convert the dot product of the vectors. And in other examples, the arrangement may be further configured to update the vector from the table of weights based on an accuracy of a previous prediction.
Disclosed in yet other examples is a computer accessible medium having stored thereon computer executable instructions for branch prediction within an analog branch predictor, wherein when a processing arrangement executes the instructions, the processing arrangement is configured to perform procedures comprising obtaining at least one current approximation of weights associated with correlations of branches to the branch predictions, and generating the branch predictions based on the at least one current approximation. In other examples, obtaining at least one current approximation comprises selecting a first vector from a table of weights, selecting a second vector from a global history shift register, converting the first and second vectors from a digital format to an analog format, and computing a dot product of the vectors.
The present disclosure is not to be limited in terms of the particular examples described in this application, which are intended as illustrations of various aspects. Many modifications and examples can be made without departing from its spirit and scope, as will be apparent to those skilled in the art. Functionally equivalent methods and apparatuses within the scope of the disclosure, in addition to those enumerated herein, will be apparent to those skilled in the art from the foregoing descriptions. Such modifications and examples are intended to fall within the scope of the appended claims. The present disclosure is to be limited only by the terms of the appended claims, along with the full scope of equivalents to which such claims are entitled. It is to be understood that this disclosure is not limited to particular methods, reagents, compounds compositions or biological systems, which can, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular examples only, and is not intended to be limiting.
With respect to the use of substantially any plural and/or singular terms herein, those having skill in the art can translate from the plural to the singular and/or from the singular to the plural as is appropriate to the context and/or application. The various singular/plural permutations may be expressly set forth herein for sake of clarity.
It will be understood by those within the art that, in general, terms used herein, and especially in the appended claims (e.g., bodies of the appended claims) are generally intended as “open” terms (e.g., the term “including” should be interpreted as “including but not limited to,” the term “having” should be interpreted as “having at least,” the term “includes” should be interpreted as “includes but is not limited to,” etc.). It will be further understood by those within the art that if a specific number of an introduced claim recitation is intended, such an intent will be explicitly recited in the claim, and in the absence of such recitation no such intent is present. For example, as an aid to understanding, the following appended claims may contain usage of the introductory phrases “at least one” and “one or more” to introduce claim recitations. However, the use of such phrases should not be construed to imply that the introduction of a claim recitation by the indefinite articles “a” or “an” limits any particular claim containing such introduced claim recitation to examples containing only one such recitation, even when the same claim includes the introductory phrases “one or more” or “at least one” and indefinite articles such as “a” or “an” (e.g., “a” and/or “an” should be interpreted to mean “at least one” or “one or more”); the same holds true for the use of definite articles used to introduce claim recitations. In addition, even if a specific number of an introduced claim recitation is explicitly recited, those skilled in the art will recognize that such recitation should be interpreted to mean at least the recited number (e.g., the bare recitation of “two recitations,” without other modifiers, means at least two recitations, or two or more recitations). Furthermore, in those instances where a convention analogous to “at least one of A, B, and C, etc.” is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., “a system having at least one of A, B, and C” would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, and/or A, B, and C together, etc.). In those instances where a convention analogous to “at least one of A, B, or C, etc.” is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., “a system having at least one of A, B, or C” would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, and/or A, B, and C together, etc.). It will be further understood by those within the art that virtually any disjunctive word and/or phrase presenting two or more alternative terms, whether in the description, claims, or drawings, should be understood to contemplate the possibilities of including one of the terms, either of the terms, or both terms. For example, the phrase “A or B” will be understood to include the possibilities of “A” or “B” or “A and B.”
In addition, where features or aspects of the disclosure are described in terms of Markush groups, those skilled in the art will recognize that the disclosure is also thereby described in terms of any individual member or subgroup of members of the Markush group.
As will be understood by one skilled in the art, for any and all purposes, such as in terms of providing a written description, all ranges disclosed herein also encompass any and all possible subranges and combinations of subranges thereof. Any listed range can be easily recognized as sufficiently describing and enabling the same range being broken down into at least equal halves, thirds, quarters, fifths, tenths, etc. As a non-limiting example, each range discussed herein can be readily broken down into a lower third, middle third and upper third, etc. As will also be understood by one skilled in the art all language such as “up to,” “at least,” “greater than,” “less than,” and the like include the number recited and refer to ranges which can be subsequently broken down into subranges as discussed above. Finally, as will be understood by one skilled in the art, a range includes each individual member. Thus, for example, a group having 1-3 cells or cores refers to groups having 1, 2, or 3 cells or cores. Similarly, a group having 1-5 cells or cores refers to groups having 1, 2, 3, 4, or 5 cells or cores, and so forth.
While various aspects and examples have been disclosed herein, other aspects and examples will be apparent to those skilled in the art. The various aspects and examples disclosed herein are for purposes of illustration and are not intended to be limiting, with the true scope and spirit being indicated by the following claims.
The invention was made with the U.S. Government support, at least in part, by the Defense Advanced Research Projects Agency, Grant number F33615-03-C-4106. Thus, the U.S. Government may have certain rights to the invention.