The present invention relates generally to quantum and classical digital superconducting, and more specifically to superconducting array circuits.
Superconducting digital technology has provided computing and/or communications resources that benefit from high speed and low power dissipation. For decades, superconducting digital technology has lacked random-access memory (RAM) with adequate capacity and speed relative to logic circuits. This has been a major obstacle to industrialization for current applications of superconducting technology in telecommunications and signal intelligence, and can be especially forbidding for high-end and quantum computing.
In the field of digital logic, extensive use is made of well-known and highly developed complementary metal-oxide semiconductor (CMOS) technology. CMOS has been implemented in a number of computer systems to provide digital logic capability. As CMOS has begun to approach maturity as a technology, there is an interest in alternatives that may lead to higher performance in terms of speed, power dissipation computational density, interconnect bandwidth, and the like. An alternative to CMOS technology comprises superconducting circuitry, utilizing superconducting Josephson junctions. Superconducting digital technology is required, as support circuitry in the refrigerated areas (4.2 degrees Kelvin and below), to achieve power and performance goals of quantum computers, among other applications.
Unlike CMOS circuits which exploit simple wires, the interconnect technologies for connecting superconducting logic circuits are highly constrained, and thus complex to implement. There are essentially two separate circuit solutions for interconnects available to superconducting logic and memory circuits. They are (1) Josephson transmission lines (JTLs) that serve as local interconnects, and (2) passive transmission lines (PTLs) that serve as long-haul (distance) interconnects. A JTL is a superconducting circuit that contains two Josephson junctions, inductors, and a transformer (to couple in its AC operating energy). In modern superconducting processes, the JTL can link logic gates or memory circuits less than approximately 55 μm apart; the JTL link can only be distributed across about four logic gates. Individual JTL delays are about 12 picoseconds (ps). PTLs, on the other hand, include a PTL driver circuit, a passive transmission line (which includes ground shielding), and a PTL receiver. The signal reach of these circuits is about 0.1 mm/ps on the passive transmission line, plus frontend driver and backend receiver delays. Thus, optimal circuit designs for a superconducting system would feature PTL connections over JTL connections, like those found in memories currently under development, that have radio frequency (RF)-transmission-line-based read paths (like Josephson magnetic random-access memory (JMRAM)), although such circuit designs have not been developed to date.
There is an interest in developing superconducting field-programmable logic arrays (FPLAs) and superconducting field-programmable gate arrays (FPGAs) to serve as programmable controllers and to perform more general-purpose functions for quantum computers. To this end, compelling designs for FPLAs and FPGAs are disclosed in U.S. Pat. No. 9,595,970, by W. Reohr and R. Voigt, and in a paper entitled, “Superconducting Magnetic Field Programmable Gate Array,” by N. Katam et. al., IEEE Transactions On Applied Superconductivity, Vol. 28, No. 2, March 2018, respectively, the disclosures of which are incorporated by reference herein in their entirety. Unfortunately, using current state-of-the-art approaches, magnetic Josephson junction (MJJ) technology and other superconducting circuitry will not mature by the time entangled quantum bits (“qubits”) cross the necessary integration threshold to provide measurable performance advantages over CMOS technology.
The present invention, as manifested in one or more embodiments, addresses the above-identified problems with conventional superconducting systems by providing both general and tailored solutions for a variety of quantum computing memory architectures. In this regard, embodiments of the present invention provide a superconducting system that can be configured to function as logic and/or memory. More particularly, one or more embodiments of the invention may be configured to perform operations across multiple subarrays, and regions of a subarray can be used in a “logic” mode of operation while other regions of the subarray may be used in a “memory” mode of operation.
In accordance with one embodiment of the invention, a cache circuit for use in a computing system includes at least one random-access memory (RAM) and at least one directory coupled to the RAM. The RAM includes multiple memory cells configured to store data, comprising operands, operators and instructions. The directory is configured to index locations of operands, operators and/or instructions stored in the RAM. An operator stored in the RAM is configured to perform one or more computations based at least in part on operands retrieved from the RAM, and to compute results as a function of the retrieved operands and inherent states of the memory cells in the RAM. The directory is further configured to confirm that at least one of a requested operand, operator and instruction is stored in the RAM.
The cache circuit may further comprise an address translator, the address translator being controlled by an operating system and configured to translate logical memory to physical memory, and at least one translation lookaside buffer (TLB) operatively coupled to the RAM and the directory, the TLB being configured to store translated addresses generated by the address translator. The operating system controls movement of page data between the cache circuit and a file system operatively coupled to the cache circuit.
As the term may be used herein, “facilitating” an action includes performing the action, making the action easier, helping to carry the action out, or causing the action to be performed. Thus, by way of example only and without limitation, in the context of a processor-implemented method, instructions executing on one processor might facilitate an action carried out by instructions executing on a remote processor, by sending appropriate data or commands to cause or aid the action to be performed. For the avoidance of doubt, where an actor facilitates an action by other than performing the action, the action is nevertheless performed by some entity or combination of entities.
One or more embodiments of the invention or elements thereof may be implemented in the form of a computer program product including a computer readable storage medium with computer usable program code for performing the method steps indicated. Furthermore, one or more embodiments of the invention or elements thereof can be implemented in the form of a system (or apparatus) including a memory, and at least one processor that is coupled to the memory and configured to perform the exemplary method steps.
Yet further, in another aspect, one or more embodiments of the invention or elements thereof can be implemented in the form of means for carrying out one or more of the method steps described herein; the means can include (i) hardware module(s), (ii) software module(s) stored in a computer readable storage medium (or multiple such media) and implemented on a hardware processor, or (iii) a combination of (i) and (ii); any of (i)-(iii) implement the specific techniques, or elements thereof, set forth herein.
Techniques of the present invention can provide substantial beneficial technical effects. By way of example only and without limitation, techniques according to embodiments of the invention may provide one or more of the following advantages, among other benefits:
These and other features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
The following drawings are presented by way of example only and without limitation, wherein like reference numerals (when used) indicate corresponding elements throughout the several views, and wherein:
It is to be appreciated that elements in the figures are illustrated for simplicity and clarity. Common but well-understood elements that may be useful or necessary in a commercially feasible embodiment are not necessarily shown in order to facilitate a less hindered view of the illustrated embodiments.
Thus, principles of the present invention, as manifested in one or more embodiments, will be described herein in the context of quantum and classical digital superconducting circuits, and more specifically to various embodiments of array systems configured to support, with programmability, data plane/path and control plane/path for entangled quantum bits (i.e., “qubits”) of a quantum computer. It is to be appreciated, however, that the invention is not limited to the specific devices, circuits, systems and/or methods illustratively shown and described herein. Rather, it will become apparent to those skilled in the art given the teachings herein that numerous modifications to the embodiments shown are contemplated and are within the scope of embodiments of the claimed invention. That is, no limitations with respect to the embodiments shown and described herein are intended or should be inferred.
Hot semiconductor embodiments work for many of the logical elements described herein so long as the logic requirements met for a memory cell, programmable switch, and fixed switch, also described herein and associated with a metamorphosing memory (MM) architecture according to aspects of the present disclosure, are implemented in an underlying array circuit. At present, such a semiconductor circuit with an “OR” (“AND”) read column line may not be area-efficient, so it might not see widespread use.
Including principally both arrays (short for cell arrays) and buses, a MM according to one or more embodiments of the inventive concepts can be implemented to perform logic operations, memory operations, or both logic and memory operations in a superconducting environment, such as in a reciprocal quantum logic (RQL) quantum computing environment. A given array may include a plurality of superconducting array cells, arranged in at least one row and at least one column, which may be configured to selectively (e.g., after programming) perform logic operations and/or memory operations. Outputs of multiple arrays can be logically ORed together, outside each “array,” in accordance with one or more embodiments of the present disclosure. It is to be appreciated that logical “OR” functionality may be implemented using either OR or AND gates, in accordance with known principles, such as De Morgan's theorem. Thus, “AND”-style data flows in array columns and at their outputs for alternative superconducting circuits are also contemplated, rather than “OR”-style data flows principally used throughout the Detailed Description. Such signals will preferably be conveyed through Josephson transmission lines (JTLs) or through passive transmission lines (PTLs), PTL drivers, and/or PTL receivers, given the extended distances that may be covered. A standard JTL may propagate a signal across less than about 120 μm, thus providing only relatively short-range communications.
In general, microwave signals, such as, for example, single flux quantum (SFQ) pulses, may be used to control the state of a memory cell in a memory array. During read/write operations, word lines (i.e., row lines) and bit lines (i.e., column lines) may be selectively activated by SFQ pulses, or RQL pulses arriving via an address bus and via independent read and write control signals. These pulses may, in turn, control row-line and column-line driver circuits adapted to selectively provide respective word-line and bit-line currents to the relevant memory cells in the memory array.
Many forms of memory are suitable for use in this invention including, but not limited to, RQL-based random-access memory (RAM) and Josephson magnetic random-access memory (JMRAM), among other memory topologies.
For logic operations, at least one input that is associated with a corresponding row may receive respective logic input signal(s), and at least one output that is associated with a respective column may correspond to a logical output based on a predetermined logic operation associated with the superconducting cell array logic circuit system. As described herein, the term “logic signal” (including “logic input signal” and “logic output signal”), with respect to a superconducting cell array logic circuit, is intended to refer broadly to the presence of a signal (e.g., indicative of a logic-1 signal) or the absence of a signal (e.g., indicative of a logic-0 signal). Therefore, in the context of RQL, for example, the term “signal” may describe the presence of at least one SFQ pulse to indicate a first logic state (e.g., logic-1), or the absence of an SFQ pulse to indicate a second logic state (e.g., logic-0). Similarly, in the context of RQL, for example, the at least one pulse may correspond to a positive pulse (e.g., SFQ pulse or fluxon) followed by a negative pulse (e.g., negative SFQ pulse or anti-fluxon).
For memory operations, the memory element (e.g., a magnetic Josephson junction (MJJ)), or collective memory elements, in each of the superconducting array cells in the superconducting cell array logic circuit can store a digital state corresponding to one of a first binary state (e.g., logic-1) or a second binary state (e.g., logic-0) in response to a write-word current and a write-bit current associated with the MJJ. For example, the first binary state may correspond to a positive x-state, in which a superconducting phase is exhibited. As an example, the write-word and write-bit currents can each be provided on an associated (e.g., coupled, such as by a magnetic field, to the MJJ) write-word line (WWL) and an associated write-bit line (WBL), which, in conjunction with one another, can set the logic state of a selected MJJ. As the term is used herein, a “selected” MJJ may be defined as an MJJ designated for writing among a plurality of MJJs by activating current flow in its associated write-bit line. The digital state of a selected MJJ is preferably written by application of a positive or negative current flow within its associated write-bit line (for all known or postulated MJJs except a “toggle” MJJ). Moreover, to prevent the MJJ from being set to an undesired negative x-state, the MJJ may include a directional write element that is configured to generate a directional bias current through the MJJ during a data-write operation. Thus, the MJJ can be forced into a positive x-state to provide a superconducting phase in a predetermined direction.
In addition, the MJJ in each of the JMRAM memory cells in an array can provide an indication of the stored digital state in response to application of a read-word current and a read-bit current. The superconducting phase can thus lower a critical current associated with at least one Josephson junction of each of the JMRAM memory cells of a row in the array. Therefore, the read-bit current and a derivative of the read-word current (e.g., induced by the read-word current flowing through a transformer) can be provided, in combination, (i) to trigger the Josephson junction(s) to change a voltage on an associated read-bit line if the MJJ stores a digital state corresponding to the first binary state, and (ii) not to trigger the Josephson junction(s) to change the voltage on the associated read-bit line if the MJJ stores a digital state corresponding to the second binary state. Thus, the read-bit line may have a voltage present, the magnitude of which varies based on whether the digital state of the MJJ corresponds to the binary logic-1 state or the binary logic-0 state (e.g., between a non-zero and a zero amplitude). As used herein, the term “trigger” with respect to Josephson junctions is intended to refer broadly to a phenomenon of the Josephson junction generating a discrete voltage pulse in response to current flow through the Josephson junction exceeding a prescribed critical current level.
Returning to the discussion of logic operations enabled by the MM according to embodiments of the present inventive concept, a predetermined logic operation may be based on a selective coupling of at least one input to at least one corresponding output via the superconducting array cells. Stated another way, the predetermined logic operation can be based on a selective coupling of the at least one input to the superconducting array cells, which are coupled to at least one corresponding output in each respective column.
As described herein, the term “selective coupling” with respect to a given one of the superconducting array cells is intended to refer broadly to a condition of a respective input of a given one of the superconducting array cells being either coupled or decoupled from a respective output of the given one of the superconducting array cells (e.g., via a programmable switch (PS)), or either coupled or decoupled from a respective one of the superconducting array cells. Therefore, for a given array of superconducting array cells, the array of superconducting array cells can have inputs that are selectively coupled to one or more outputs, such that all, some (e.g., a proper subset), or none of the input(s) to the respective superconducting array cells may be coupled to the output(s) via the respective superconducting array cells in a predetermined manner. Accordingly, input(s) that are coupled to the respective output(s) via the respective superconducting cell(s) based on the selective coupling are described herein as “coupled,” and thus the respective coupled superconducting cell(s) provide a corresponding output logic signal in response to a logic input signal. Conversely, input(s) that are not coupled to the respective output(s) via the respective superconducting cell(s) based on the selective coupling are described herein as “uncoupled” or “non-coupled,” and thus the respective uncoupled superconducting cell(s) do not provide a corresponding output logic signal in response to a logic input signal.
The selective coupling of the input(s) to the output(s) via the superconducting array cells (configured as programmable switches) can be beneficially field-programmable in a manner similar to a field-programmable gate array (FPGA). For example, the selective coupling may be based on the presence or absence of an inductive coupling of the input(s) to a superconducting quantum interference device (SQUID) associated with the respective superconducting cell, with the SQUID being coupled to a respective one of the output(s), or on the direct injection of an SFQ pulse or pulse pair (for RQL) to read the state of a non-destructive readout (NDRO) memory cell (see, e.g., U.S. Pat. No. 10,554,207 to Herr, et al., the disclosure of which is incorporated by reference herein in its entirety), etc. Therefore, one or more Josephson junctions associated with the SQUID may trigger in response to a respective logic input signal to provide a respective output signal in a first logic state based on inductive coupling or direct injection, or will not trigger in response to the respective logic input signal(s) to provide the respective output signal in a second logic state based on no inductive coupling or no direct injection.
As another example, each of at least a subset of the superconducting array cells may include a hysteretic magnetic Josephson junction device (HMJJD) configured to store a first magnetic state or a second magnetic state in response to at least one programming signal. Thus, the HMJJD may provide coupling between the respective input(s) and output(s) in the first magnetic state, and may provide non-coupling between the respective input(s) and output(s) in the second magnetic state. Accordingly, providing the programming signal(s) to set the magnetic state of the HMJJD may facilitate field-programmability of the superconducting cell array logic circuit system to set the logic operation.
With reference to
Each superconducting cell (MC/PS/FS) 104 may be connected to a corresponding input of a column OR gate 106 that serves as a bit line or data line. Each of at least a subset of the superconducting cells 104 (MC/PS/FS) may be configured to implement the functionality of the OR gate 106, in whole or in part; that is, although shown as a separate block, it is to be understood that at least a portion of the function(s) of the column OR gate 106 may be integrated into the superconducting cells 104. In the illustrative embodiment depicted in
When using a memory cell as described in U.S. Pat. No. 10,554,207 to Anna Herr, et al. (which defines an RQL-read-path NDRO memory cell) in the superconducting array 102A, 102B, a two-input RQL OR gate would be included as part of the memory cell to form, when connected in series with the respective two-input RQL OR gates of other cells in the array, the multiple-input OR gate 106 (for bit line and/or data line) shown in
With continued reference to
In a mixed memory and logic operation or in a multiple-read memory operation, the output bus circuit 108 is preferably configured to perform bit-wise (i.e., bit-by-bit), or whole data field, combinations of a memory operation, first logic operation and/or second logic operation using OR logic (e.g., OR gate 110). One skilled in the relevant art, given the teachings herein, will appreciate the various possibilities of memory, logic and mixed-mode (memory and logic) operations that can be achieved using the MM 100, in accordance with embodiments of the inventive concept. In fact, ORing the results of one or more superconducting cells 104 among the superconducting array 102A, 102B configured as a programmable switch (PS), configured as a memory cell (MC), and/or configured as a fixed switch (FS), can be done within each array 102A, 102B, or, in yet another embodiment, among multiple arrays, such as through array enable signals, which can preferably be propagated, in the exemplary MM 100, by “WoDa” signals (
The output bus circuit 108 may, in one or more embodiments, includes a plurality of OR gates 110 for combining data outputs of the arrays 102A, 102B, and a plurality of electable inverters 112 for enabling a realization of a comprehensive Boolean logic function in two passes through the MM 100, as will be discussed in further detail with respect to
Many of the FIGS. depict illustrative embodiments of the present invention that are directed to memory read operations, which may be enabled by selection of a single row of memory cells, and to logic operations, which may be enabled by rows of programmed switches that are invoked in a “none,” “some,” or “all” multi-row select read operation. Given this purpose, and for enhanced clarity of the description, all of the FIGS. (as interpreted by the inventors) notably exclude independent write address ports, write word lines (WWLs), and write bit lines (WBLs) of the arrays 102A, 102B, as will become apparent to those skilled in the art. It is important to note that superconducting read and write ports are typically separate even at the level of a memory cell circuit. Unlike 6-transistor (6-T) static random-access memory (SRAM), dynamic random-access memory (DRAM), and flash memories, among other CMOS-based memories, most, if not all, superconducting memory cells feature independent read and write ports.
For superconducting memories or programmable logic arrays (PLAs), a write operation configured to program a row of programmable switches (i.e., superconducting array cells (MC/PS/FS) 104 configured as programmable switches) can be performed in a manner similar to a write operation to set the state of a row of memory cells (i.e., superconducting array cells (MC/PS/FS) 104 configured as memory cells). Likewise, a column line (e.g., bit line, data line) oriented write operation can be performed to enable a match array of a content addressable memory (CAM), as known in the art. (See, e.g., U.S. Pat. No. 9,613,699, Apr. 4, 2017, to W. Reohr et al.).
One or more embodiments of the invention contemplate that certain artificial intelligence (AI) operations may gain an advantage if this architecture permits updates of/to at least a subset of programmable switches during a program or an algorithm runtime. Thus, considering such updates can be desirable, there is essentially no formal difference between the timing of write operations directed to memory cells and programmable switches during the runtime of a program or an algorithm. On the other hand, a different style of writing, which exploits shift registers, can be used to program programmable switches (PSs) if these switches are programmed in advance of program runtime.
The MM 100 can further include proximate (i.e., local) decode logic 114 configured to direct data input or memory word line selection signals to a particular (i.e., selected) subset of rows within an array 102A, 102B, which is defined herein as a “region” (one of which is highlighted by underlying rectangle/region 132). Such row-oriented functionality is emphasized by input signals labeled with the prefix “WoDa,” where the “Wo” portion of the prefix represents a word line of a memory, and where the “Da” portion of the prefix represents data input to a logic array. In the exemplary MM 100 shown in
More particularly, in the array 102A shown in
Similarly, the second array 102B shown in
More particularly, in the array 102B shown in
Although a specific arrangement of logic gates is shown in
In the decode logic 114, 144 of the exemplary MM 100, the true signal feeds the AND gates 116 directly. For generation of the complement signals, the true signal is fed into an inverting input of the AnotB gates 118. A logic function made available within the family of RQL circuits, the AnotB gates 118 can be implemented, in one or more embodiments, as AND gates with their second (B) input inverted. As noted in
Row activation signals for a memory read operation and/or for logic operations (i.e., row activation signals directly propagating on row line <1:N>_A1, row line <1:N>_A2, row line <1:N>_B1, and row line <1:N>_B2) emerge from the AND gates 116 and AnotB gates 118 of the decode logic 114, 144. When activated by such row activation signal(s), selected superconducting memory cells 104 in a given row deliver their respective data states (either “1-state” and/or “0-state”) to corresponding OR gates 106, which serve, at least conceptually speaking, as bit lines (as the term is often used for conventional memory operations and architectures) and output data lines (as the term is used for conventional logic operations in the art of PLA design) of the arrays 102A, 102B. These bit lines and data lines preferably feed into the output bus circuit 108, which can operatively merge results from a plurality of operations applied to the arrays 102A, 102B and deliver these results to an intended location, as will be described in further detail in conjunction with
More formally within embodiments of the present disclosure, a “region” may be defined as a set of rows partially enabled by the “partial” address feeding decode logic 114, 144. Row activation requires simultaneous “region” and “WordData” (WoDa<#>) activation. A region in a given array 102A, 102B includes at least one row line.
While the decode logic 114, 144 is shown as directly driving the row lines (i.e., row line <1:N>_A1, row line <1:N>_A2, row line <1:N>_B1, and row line <1:N>_B2) in
By way of example only and without limitation, for a 16-deep memory array addressed with what would be a maximum of 4 encoded address bits, possible combinations of encoded partial address bits (which may define a region) and WordData (WoDa) bits are indicated in the following table:
It is also important to recognize that column-oriented address logic (e.g., multiplexors)-serving to enable a second form of address decoding in arrays—can be implemented by array output circuit(s) 119, which may be integrated in each array 102A, 102B, in some embodiments, or may reside external to the arrays in other embodiments. These array output circuits 119 may be configured to select a set of columns of data from one or more arrays 102A, 102B for output, as will be discussed in further detail herein below with respect to illustrative bit slice output dataflows of
Table 2 describes dynamic decoding to select more than one region. Two regions having four rows each would double the width of the maximum PLA column based OR to an eight input OR. It has 3 columns. For example, that same exemplary 16-deep memory array, the possible combinations of encoded address bits (notably not fully defining a region), region enables (in the example of Table 2, there are two), and WordData bits are indicated in the following table:
Region enable bits in combination with encoded address bits define a region. The choice to exploit one or more regions for a logic operation (an operator function) depends on the size of the OR logic needed to perform a particular function. Moving forward to discuss the principal/predominant data flow logic in system containing both control and data flow, it is important to note that while the illustrative MM 100 exploits OR-based logic for which non-controlling inputs are “0s,” it is similarly contemplated by embodiments of the invention that AND-based logic could be used in its place, in whole or in part, as will be understood by those skilled in the art. From a practical standpoint for RQL, it is advantageous to use an OR-based approach, which results in significant power savings given that a logic “0,” the non-controlling signal, consumes extremely little energy. In the case of AND-based logic, a non-controlling logic “1” propagates flux quantum pairs 180 degrees out of phase with each other with respect to the AC clocks of RQL (a flux quantum first and an anti-flux quantum second). These two flux quanta consume substantial energy.
The MM 100 can further include one or more array-output circuits 119 operatively coupled between OR gate(s) 106 of array(s) 102 and OR gate(s) 110 of output bus circuit 108. The array-output circuits 119 are preferably proximate to their corresponding array(s) 102. These array-output circuits 119, in one or more embodiments, can be adapted in the following manner (which will be discussed with reference to the schematic of
For possible adaption (5), a rich set of instructions (e.g., programmable instructions) could be devised so that logic (memory or mixed logic and memory) operations preferably among neighboring outputs (physically adjacent outputs) of the input terms WoDa<1:N>A and WoDa<1:N>_B could be executed. Note, that WoDa<1:N>A and WoDa<1:N>_B feed Array_B and Array_B, respectively. It is to be understood that the outputs of either array contain within-array-programmed ORing of WoDa<1:N>_A or of WoDa<1:N>_B. In a first combined output data field, only odd integers of the data output of Array_A are enabled, and even integers of the data output of Array_B are enabled, or, in a second alternative combined output data field, vice versa.
As will be understood, these embodiments recognize the criticality (crucial nature) of (i) proximate signals and (ii) regular data flows in realizing logic operations in superconducting circuits, especially given the limited reach or bandwidth capabilities (i.e., JTLs or PTLs, respectively) of superconducting interconnect in comparison to CMOS interconnect. Increasing the temporal and spatial confluence of various sets of signals along the output bus circuit 108, and into various processing units, assures a greater variety of nearest neighbor computations. In the example offered, the output can have twice as many input terms. For this example, the twice-as-many input terms are WoDa<1:N>A and WoDa<1:N>_B.
As will be discussed with respect to
It should be understood that branches, a redirection of present and future computation passing through the data flow, can be based on a number/text stored in memory, data just processed, and/or data stored in a register as understood by those skilled in the art.
When mixed logic and memory operations are conducted among arrays 102, certain portions of the field can be disabled. In one or more embodiments, AND gates can be inserted along with associated enables to elect a fraction of the data from an array 102 or set of arrays to be propagated in the output bus circuit 108 before OR gates 110. To support an understanding of these aforementioned capabilities of the array-output circuit(s) 119, the function and exemplary implementation of such array-output circuit(s) 119 will be described in greater detail in conjunction with
After understanding the detailed principles of an array 102 with a decode structure designed to support fungible circuits for memory, logic or mixed mode operations, it is now beneficial to speak broadly about array alternatives. Fungible arrays require memory cells, which can serve as either memory cells (MC) or programmable switches (PS). While this capability represents some embodiments of the present invention, it is important to recognize that the capabilities described in the present invention can be applied more generally to superconducting arrays, such as Josephson fixed logic arrays (JFLA). For example, other superconducting array cells 102 of these JFLAs form fixed switches (FS), which are advantageously more compact than programmable switches (PS).
RQL-read-path-NDRO array cells (represented by array cell 104) can be, for example, NDRO memory cells described in Burnett, R. et. al., “Demonstration of Superconducting Memory for an RQL CPU,” Proceedings of the Fourth International Symposium on Memory Systems (2018), the disclosure of which is incorporated by reference herein in its entirety. Generally, these array cells 104 are meant to be representative of almost all other memory cells, for example, formed form ERSFQ or RQL circuit families. The name “RQL-read-path NDRO array cells” is thus intended herein to denote representative capabilities of these other memory cells, in terms of their potential bandwidths and latencies (which are abstracted in this detailed description), but it is not construed to be limiting.
RF-read-path array cells (represented by array cell 104) can be, for example, NDRO memory cells for JMRAM, PRAM (passive random-access memory as known in the art), or JLA. It is important to recognize that not all array cells 104 are memory cells; some array cells, when selected, couple a specified Boolean state—that being a “0-state or 1-state”—to the OR gate 106; a particular variety of them cannot be written; they are what is known as mask programmable array cells (a.k.a. Fixed Switches, abbreviated “FS”). The name “RF-read-path array cells” is thus intended herein to denote representative performance capabilities of these array cells 104, as well as structural aspects of their read port circuitry, whether they are memory cells, field programmable superconducting array cells (actually a memory cell itself), or an superconducting cell retaining a fixed state (a.k.a. mask programmed array cell), in terms of their potential bandwidths and latencies (which are abstracted in this detailed description). The term RF-read-path is not construed to be limiting. With respect to RQL-read-path-NDRO array cells 104, RF-read-path array cells 104 can advantageously provide lower latency, but can have lower bandwidth (typically depending on design), principally because multiple flux quanta (signals) are propagated and consumed on microwave lines (“RF” lines). Multi-cycle recovery times are required by flux pumps to restore the multiple flux quanta required for a subsequent operation. Known flux pumps are single flux quantum capable, generating one flux quanta per cycle and storing it in a superconducting loop for future use.
Named herein are two alternative PLAs, Josephson PLA and Josephson magnetic PLA, which will be referred respectively to as “JPLA” and “JMPLA” (following the JMRAM naming convention of Josephson magnetic RAM). A JPLA is formed exclusively with Josephson superconducting array cells 104. Likewise, a JMPLA is formed exclusively with MJJ-based superconducting array cells 104. While the MM 100 populated exclusively with JPLA and/or JMPLA arrays 104 may or may not facilitate a random-access memory function as known in the art, it is still contemplated as a unique embodiment(s) of the invention, as are exclusive memory embodiments and mixed memory and PLA embodiments. Additionally, the mixed memory and PLA embodiments can be further subdivided into (i) those in which the size of the PLAs and memories are fixed and (ii) those in which the size of the PLAs and memories can be programmed. For (ii), programming allots a subset of a fixed number of superconducting array cells 104 to PLAs and the remainder of array cells 104 to memory. It may be best to use the acronym “JPLARAM” to broadly refer to these fungible arrays 102 in which rows can be assigned to memory or logic functions as previously discussed.
The MM 100 can include a plurality of arrays 102 wherein arrays can be of different types. Thus, the MM 100 can have arrays containing at least one of RQL-read-path NDRO array cells (can be memory or logic cells), RF-read-path NDRO array cells (can be memory or logic cells), RF-read-path magnetic array cells (can be memory or logic cells, which form, e.g., JMRAM), RF-read-path magnetic superconducting array cells (which form, e.g., JMPLA), and RF-read-path superconducting array cells (which form, e.g., JPLA). In other words, any known superconducting logic array 104 or memory array can 104 be incorporated into the MM 100, exclusively, or with other types of arrays 104.
Moreover, FPGA gates can be integrated into the MM 100. In particular, they can be incorporated into the decode 114 for a non-trivial decode at the front end of the MM 100, can be included in the array-output circuits 116 for a non-trivial-data-flow modification of the output data of the arrays 104, and can be included in the return bus circuit 108 at the back end of the MM 100 for, for example, (i) routing data to a unique set of entities and/or (ii) recording data bits in a bus.
In another embodiment, an individual array 102 can include different array cells 104 within it so long as these cells conform to the same read path type (e.g., RQL-read-path or RF-read-path). For example, JPLA array cells 104 and JMPLA array cells 104 can be combined within an array in the following fashions: (i) in rows; (ii) in columns; or (iii) intermingled among rows and columns. A potential key for functionality of such a heterogeneous array cell approach can be to ensure substantially similar read-path properties (or elements) in columns and rows. In this embodiment, it should be noted that JLA array cells 104 do not have write ports, while JPLA array cells do. In other words, it should be noted that this embodiment principally concerns the read ports of array cells 104.
Now concerning the JMPLA in particular—while the write time of the array cells 104 of (configured in) the JMPLA, or its write circuits, are long and power hungry and thus may preclude them from forming a high performance “memory” array 102, these array cells 104 can still form an important class of circuits known as “field programmable logic” array 102.
Finally, to prepare for a discussion with respect to
It is important to note that if the decode logic 114 is included in the illustrative MM 100 shown in
The level of each signal in the timing diagram 200 indicates the Boolean value of that signal. An “x” (cross hatch) indicates that the Boolean value of the signal can be either a logic “0” or a logic “1.” Ignoring the presence of the electable inverter (e.g., 112 in
With continued reference to
Out of this subset of programmable switches, active programmable switches (i.e., active superconducting array cells (MC/PS/FS) 104) are preprogrammed (i.e., written) to store “1-states” so that, when activated by a row activation operation (a read operation of the superconducting array cells (MC/PS/FS) 104 in a given row), they couple their logic input/selection signals, WoDa<1:N>_A, to the inputs of corresponding OR gate(s) 106. The logic input signals selected by the active programmable switches can thus be transformed by operation of the OR gate(s) 106 (and representative OR gate(s) 110) and produce the resultant signals Data<1:M>_for_logic_OPS, which notably are generated within array A 102 exclusively. Since all WoDa<1:N>_B are “0s,” the outputs of array B 102 are “0s,” and thus they do not control the output of OR gates 110, as will be understood by those skilled in the art. It should be noted that the broader generalized Boolean capability of the MM 100 (in its capacity to provide a logic operation) will be discussed in greater detail with respect to
Before describing the set of input signals 204, a simple example is useful in illustrating the programming of a subset of the superconducting array cells 104 functioning as programmable switches (MC/PS/FS) associated with OR gates 106. Input signals 202 define both the data input (WoDa<1:N>_A) and the location/set of the programmable switches (MC/PS 104) used (controlled by a combination of signals WoDa<1:N>_A and “exemplary 1 bit encoded ‘partial’ address for Array A). For example, suppose WoDa<1> is assigned to logic signal E, WoDa<2> is assigned to logic signal E_not/bar, WoDa<3> is assigned to logic signal F, and WoDa<4> is assigned to logic signal F_not/bar. To set Datum<1>_for_logic_OPS equal to the OR of E and F_not/bar, the superconducting array cells (MC/PS/FS) 104 1_1_A1 and 4_1_A1 are programmed to a “1-state.” The superconducting array cells (MC/PS/FS) 104 2_1_A1 and 3_1_A1, associated with the same column in the MM 100, are programmed to “0-state” so that their connections to OR gate 106, corresponding to Datum<1>_for_logic_OPS, are disabled.
With continued reference now to the illustrative timing diagram 200 shown in
Out of this subset of programmable switches (superconducting array cells 104), active programmable switches (MC/PS 104) have been preprogrammed (i.e., written) to store “1-states” so that, when activated by a row activation operation (a read operation of the MC/PSs 104 in a given row), they couple their logic input/selection signals, WoDa<1:N>_A, to the inputs of OR gate(s) 106. The logic input signals selected by the active programmable switches can thus be transformed by the logical operation of the OR gate(s) 106 and produce the resultant signals Data<1:M>_for_logic_OPS, which notably are generated within array A 102 exclusively.
Since all WoDa<1:N>_B signals are logic “0s,” the outputs of array B 102 are logic “0s,” and thus they do not control the output of OR gates 110, as will be understood by those skilled in the art. As previously stated, the broader generalized Boolean capability of the MM 100 (in its capacity to provide a logic operation) will be discussed in further detail with respect to
A third set of input signals 206 configures the MM 100 to perform logic functions that are designated by at least a subset of the superconducting array cells (MC/PS/FS) 104 configured as programmable switches of array A 102. More specifically, the third set of input signals 206 drives an OR-based logic operation defined by the “programmed” state of a subset of the programmable switches (superconducting array cells (MC/PS/FS) 104) in array A 102, where the active superconducting array cells (MC/PS/FS) 104, which participate in the logic operation, are located in the active rows defined by at least one “partial” address input to the decode logic 114. Potentially active programmable switches (superconducting array cells (MC/PS/FS) 104) in the logic operation notably have the suffix “B1.” The “B1” set of programmable switches (superconducting array cells (MC/PS/FS) 104) include cells in row 1, labeled 1_1_B1 through 1_M_B1, cells in in-between rows, and cells in row N, labeled as N_1_B1 through N_M_B1.
Out of this set of programmable switches (superconducting array cells (MC/PS/FS) 104), active programmable switches (superconducting array cells (MC/PS/FS) 104) are preprogrammed (i.e., written) to store “1-states” so that, when activated by a row activation operation (a read operation of the MC/PSs 104 in a given row), they couple their logic input/selection signals, WoDa<1:N>_B, to the inputs of corresponding OR gate(s) 106. The logic input signals selected by the active programmable switches (active subset of superconducting array cells (MC/PS/FS) 104) can thus be transformed by the logical operation of the OR gate(s) 106 and produce the resultant signals Data<1:M>_for_logic_OPS, which notably are generated within array B 102 exclusively.
Since all WoDa<1:N>A signals are logic “0s,” the outputs of array A 102 are logic “0s,” and thus they do not control the output of the OR gates 110, as will be understood by those skilled in the art. As previously stated, the broader generalized Boolean capability of the MM 100 (in its capacity to provide a logic operation) will be discussed in further detail in conjunction with
A fourth set of input signals 208 configures the MM 100 to perform logic functions that are designated by at least a subset of the superconducting array cells (MC/PS/FS) 104 configured as programmable switches of array A 102. More specifically, the fourth set of input signals 208 drives an OR-based logic operation defined by the “programmed” state of a subset of programmable switches (superconducting array cells (MC/PS/FS) 104) in array A 102, where the active MC/PSs 104, which participate in the logic operation, are located in the active rows defined by at least one “partial” address input to the decode logic 114. Potentially active superconducting array cells (MC/PS/FS) 104 configured as programmable switches in the logic operation notably have the suffix “B2.” The “B2” set of programmable switches (superconducting array cells (MC/PS/FS) 104) include cells in row 1 being 1_1_B2 through 1_M_B2, cells in in-between rows, and cells in row N being N_1_B2 through N_M_B2.
Out of this set of programmable switches (superconducting array cells (MC/PS/FS) 104), active programmable switches (MC/PS 104) are preprogrammed (i.e., written) to store “1-states” so that, when activated by a row activation operation (a read operation of the MC/PSs 104 in a given row), they couple their logic input/selection signals, WoDa<1:N>_B, to the inputs of OR gate(s) 106. The logic input signals selected by the active programmable switches (subset of superconducting array cells (MC/PS/FS) 104) can thus be transformed by the logical operation of the OR gate(s) 106 and produce the resultant signals Data<1:M>_for_logic_OPS, which notably are generated within array B 102 exclusively.
Since all WoDa<1:N>A signals are logic “0s,” the outputs of array A 102 are logic “0s,” and thus they do not control the output of the OR gates 110, as will become apparent to those skilled in the art. As previously stated, the broader generalized Boolean capability of the MM 100 (in its capacity to provide a logic operation) will be described in further detail in conjunction with
With continued reference to
Since this logic operation involves programmable switches (superconducting array cells (MC/PS/FS) 104) from two arrays, potentially active programmable switches in the logic operation notably have the suffices “A1” and “B1.” The “A1” and “B1” set of programmable switches (superconducting array cells (MC/PS/FS) 104) include: (i) for array A 102, cells in row 1 labeled 1_1_A1 through 1_M_A1, cells in in-between rows, and cells in row N being N_1_A1 through N_M_A1; and (ii) for array B 102, cells in row 1 labeled 1_1_B1 through 1_M_B1, cells in in-between rows, and cells in row N labeled N_1_B1 through N_M_B1.
Out of this set of programmable switches (superconducting array cells (MC/PS/FS) 104), active programmable switches (MC/PS 104) are preprogrammed (i.e., written) to store “1-states” so that, when activated by a row activation operation (a read operation of the superconducting array cells (MC/PS/FS) 104 in a given row), they couple their logic input/selection signals, WoDa<1:N>_A and WoDa<1:N>_B, to the inputs of their corresponding OR gates 106. The logic input signals selected by the active programmable switches (subset of superconducting array cells (MC/PS/FS) 104) can thus be transformed by the logical operation of the OR gates 106 and 110, and produce the resultant signals Data<1:M>_for_logic_OPS. The resultant operation notably depends on the respective programmable switches (MC/PSs 104) in arrays A and B 102. The broader generalized Boolean capability of the MM 100 (in its capacity to provide a logic operation) will be described in further detail herein with reference to
A sixth set of input signals 212 in the illustrative timing diagram 200 configures the MM 100 to perform logic functions that are designated by at least a subset of the superconducting array cells (MC/PS/FS) 104 configured as programmable switches in respective arrays A and B 102. Specifically, the sixth set of input signals 212 drives an OR-based logic operation defined by the “programmed” state of respective subsets of programmable switches (i.e., superconducting array cells (MC/PS/FS) 104 configured as programmable switches) in arrays A and B 102, where the active MC/PSs 104, which participate in the logic operation, are located in the active rows defined by at least two “partial” address inputs to the decode logic elements 114 of each array 102.
Since this logic operation involves programmable switches (subset of superconducting array cells (MC/PS/FS) 104) from two arrays 102, potentially active programmable switches in the logic operation notably have the suffices “A1” and “B2.” The respective “A1” and “B2” subsets of programmable switches (superconducting array cells (MC/PS/FS) 104) include: (i) for array A 102, cells in row 1 labeled 1_1_A1 through 1_M_A1, cells in in-between rows, and cells in row N labeled N_1_A1 through N_M_A1; and (ii) for array B 102, cells in row 1 labeled 1_1_B2 through 1_M_B2, cells in in-between rows, and cells in row N labeled N_1_B2 through N_M_B2.
Out of this set of programmable switches (superconducting array cells (MC/PS/FS) 104), active programmable switches (MC/PS 104) are preprogrammed (i.e., written) to store “1-states” so that, when activated by a row activation operation (a read operation of the MC/PSs 104 in a given row), they couple their respective logic input/selection signals, WoDa<1:N>A and WoDa<1:N>_B, to the inputs of their corresponding OR gates 106. The logic input signals selected by the active programmable switches (subset of superconducting array cells (MC/PS/FS) 104) can thus be transformed by the logical operation of the OR gates 106 and 110, and produce the resultant signals Data<1:M>_for_logic_OPS. The resultant operation notably depends on the programmable switches (superconducting array cells (MC/PS/FS) 104) in arrays A and B 102. The broader generalized Boolean capability of the MM 100 (in its capacity to provide a logic operation) will be described in further detail herein below.
The sets of input signals 202 through 212 in the illustrative timing diagram 200 were used to configure the superconducting array cells (MC/PS/FS) 104 in the MM 100 to perform logic operations. As previously stated, at least a portion of the superconducting array cells (MC/PS/FS) 104 may be configured to perform memory operations. In this regard, a seventh set of input signals 214 configures at least a subset of the superconducting array cells (MC/PS/FS) 104 in the MM 100 to perform memory operations. More particularly, the set of input signals 214 drives a memory operation in array A 102.
Active superconducting array cells (MC/PS/FS) 104 configured as memory cells, which source read data, are located in an active row defined by a combination of signal WoDa<1>_A, being a “1-state,” and the “partial” address input to the decode logic elements 114, being a “0-state.” WoDa<1>_A is intended to be exemplary of one selected row within a subset of rows, WoDa<1:N>_A, wherein all unselected rows WoDa<2:N>A are in a “0-state,” associated with a particular decoded address. In this scenario, active memory cells are 1_1_A1 through 1_M_A1. Given that no other rows are selected, the outputs of the active memory cells (subset of superconducting array cells (MC/PS/FS) 104) pass through OR gates 106 of array A 102 and on through OR gates 110, and produce resultant signals Data<1:M>_for_memory_OPS.
Similarly, an eighth set of input signals 216 in the illustrative timing diagram 200 shown in
Entities 100, 304, 306, 308, and 310 can all be implemented with reciprocal quantum logic technologies, as known in the art, and associated RAMs (herein referred to as RQL-based RAMs).
The MM 100 can assist in (i) the synthesis of qubit control signals (e.g., potentially providing oversight, operators, and/or acquired test data for signal synthesis), in (ii) the generation of digital signals from qubit read out signals, in (iii) the processing of syndromes, and in (iv) the classical computations (classical subroutines) of an algorithm. For the control signal generation (i) in particular, far more energy efficient waveform storage can be realized by superconducting circuits than by semiconductor circuits at the expense of die/chip area with respect to SRAM: The word array tokens X equal 2Y, where Y is the binary integer of an instruction. Each instruction is weighted for a particular set of qubit parameters.
No particular diagram is needed to discuss briefly how the present invention fits into the state of the art of qubits, which can be understood by considering the controls and applications that might be useful on a quantum computer. A brief overview of qubits indicates possible roles for the present invention and, moreover, makes some command execution steps clearer, as known in the art.
In a quantum computer, generally the following two actions are desired: turning on and off coupling between qubits and applying excitations to individual qubits. Applied correctly, these can effect the qubit behavior as a (quantum) logic gate on one, two, or more qubits. These operations/actions can be executed by properly applying a voltage tuned to the qubit(s) in question; exactly what this voltage does isn't too important for this present discussion. Unlike regular logic, this voltage is more akin to an analog signal than the Boolean quasi-DC voltage. The more accurate the voltage value, the less corrections need to be applied. What adds complexity is that every qubit can be fabricated differently, is different, given its unique analog physical characteristics. Fortunately, the target voltage signal waveform for each physical qubit doesn't change within any cool down. There are ways to approximate an optional voltage signal waveform for each physical qubit. Simply put, a system can “guess” in the form of a well-defined signal, and extract the individual device parameters from the measurement, potentially repeating this process as a refinement until the desired level of optimization is reached.
It is evident where a local data cache (or instruction word memory) might be useful. Localized control over voltages is needed to apply a near perfect, uniform voltage, selectively turning it on and off when and where needed for specific qubits. If Boolean data defining the voltage signal waveform would be retrieved from room temp every time, that would at best make the whole process much less energy efficient (causing heat injection into the dilution refrigerator . . . phonon noise). However, if the voltage values for the voltage waveforms are advantageously obtained locally in the “low temperature” area (4.2 Kelvin cold space), the number of data connections and I/O data streams can be significantly decreased, making the system significantly more power efficient.
In different words, if voltage values (e.g., 8-bit digital-to-analog converter (DAC) inputs) can be stored locally, for use at a certain time and place, the values can be pulled from the local storage. This is more like a local data storage than anything else. In this case, one instance of a MM 100 can be used to define signals to control a set of qubits. Any solution for building waveforms and signals, which will be tailored to the system at hand and its underlying architecture, will benefit from in-memory logic operations that can transform a set of values and relationships into more individual and specific inputs to a general waveform generator.
A few common applications of a logic array, suitable to different levels of overall system complexity, include decompression of data, dynamic remote control, and automated system repair. None of the possible uses are exclusive to any size system; instead, these examples highlight the utility of a programmable logic array that can adapt to any size system, capabilities, and constraints.
A particularly simple application of a PLA would be to decompress stored (or incoming) data. A well-design control system will not be expecting a random and unpredictable set of inputs but a an organized one with observable patterns. Any stream with a pattern can exploit the redundancy to reduce superfluous bits and/or replace predictable data with a check bit or check sum to ensure data integrity. For example, an array of logical qubits may all have an average additional offset in a voltage of the same amount. Decomposing the explicit values into an average and an offset from average may require less data store resources. A system that is sufficiently large will see a reduction in resources needed when the PLA that decompresses data is smaller than the memory resources needed to store all values individually and explicitly. This concept can be applied to both locally stored data or data streamed in through a room-temperature connection.
Along the same lines, a PLA may enable dynamic remote control of a system. Frequently used signals can be defined in a library stored at the cold device. Calling the signal from room temperature is then a matter of sending down the appropriate index for lookup in the library instead of a full (lengthy) control signal. The device on the cold side then decodes the index for the desired signal and generates it in situ. Taken to the logical conclusion, this is a way of building up a set of control words for higher level functionality of the system while leaving the details of the electrical signals within the system.
Customization for either the physical device in use or the use case of interest can be done once ahead of operation and then reused until a new use case or device is needed. In a similar application, if a large number of individual processing units all receive the same instructions but operate on different sets of data (a SIMD type of use), a PLA could operate as a control manager for the individual cores.
Finally, a PLA may be part of an internal data feedback loop. Mechanisms within the system to provide measurement data may send a result to the PLA, which could analyze it and take appropriate action. In any array with non-negligible failure rate, if the array units can be interchanged, spare units can be automatically activated to assume the role of the failed unit. This allows a separation of concerns, leaving detailed management of resources to the “hardware” while the overall use case and algorithm are managed externally (e.g., “software” operating at room temperature to control the overall behavior of the system). This is entirely similar to a hard drive controller automatically managing good and bad sectors of a hard drive without the need for the operating system of a computer to take on this task.
Returning to the discussion of quantum computing in general, it is important to understand what kind of operations can be done on qubits. Of particular importance, due to its utility as a known starting state, is creating “Bell” states out of pairs of qubits. This is done by coupling, exciting, and rotating (which for this discussion just means a certain voltage waveform extending for a certain amount of time) the qubits of interest. A generic algorithm may start with all qubits in such a state, or only some, intending to either create Bell states later or allow the yet-unused qubits to serve another purpose. In such an algorithm, every even qubit might be used to create a “Bell” state, while, with no voltage waveform applied, every odd qubit remains in the ground (|0>) state. Or, every qubit could be used to form an array of qubits in the “Bell” state. This kind of common and repeated operation can be a useful command to execute frequently. A PLA could receive a short command to do this, which would then execute other commands local to the array, ultimately sending voltage values to DACs, mixers, or any other electrical devices, each with their own unique values designed to excite each parametrically unique qubit to an identical “Bell” state (the uniqueness being recorded as the “acquired test data” noted earlier).
Another common step may be to measure the qubit state after a specific amount of time. “Measuring” may also involve applying the right voltage signal waveforms to couple qubits to readout devices. Again, the higher-level capability of readout can be defined as a command and the details of applying voltages left to the PLA. With the right voltage waveforms collected for the digital-to-analog and analog-to-digital converters of the qubits, the quantum computation entities 320 can be operated. If used for qubit control, any of the common qubit gates can be defined as a command to the PLA. Each one is ultimately applying voltages, but it is applying the right voltages for the right amount of time on each qubit (or coupler). All in all, with proper values stored as inputs to the control mechanisms and the desired commands stored as indices in a defined library, all gates and readout of interest can be reduced to a set of commands. This is the opposite of generating all signals explicitly at room temperature and sending down analog signals directly to qubits.
In view of the above, it can be seen that an MM 100 (or, in combination, principally with a cache) can be used much closer to the qubits (e.g., for storing up a single DAC value in each line and applying it for one cycle only, then moving on to the next) or much farther away from them (e.g., for higher level commands, like a gate operation). Given that instructions are shared between many qubits, the quantum control architecture (which involves 300 of
Returning now specifically to the discussion of the quantum computing system 300 of
Functioning as a central entity, the illustrative MM 100 shown in
In general, exemplary logic operations of the quantum computing system 300 include: (i) data path/plane operations performed by quantum computation entities 320; (ii) data path/plane operations performed by the MM 100; (iii) data path/plane operations performed by “low temperature” logic entities 304; (iv) data path/plane operations performed by “room temperature” processing entities 302; (v) control path/plane operations performed by the collective entities 100, 304, 306, 308, 310, 312, 314, 316 in overseeing the data path/plane operations of the quantum computation entities 320; (vi) control operations performed by the collective entities 304, 306, 308, 310, 312, 314, 316 in overseeing the MM 100; (vii) control operations performed by the “room temperature” processing entities 302 in overseeing the collective entities 100, 304, 306, 308, 310, 312, 314, 316, 318, 320; and (viii) a system of feedbacks and acknowledgments among various entities after completing tasks.
The data path/plane operations performed by the illustrative quantum computing system 300 can be, for example, Boolean-based addition, comparisons, or multiplication, or can be Qubit-based wave function operations (parallel via superposition of entangled states) to realize, for example, Shor's algorithm (for factoring), Grover's algorithm (for sorting unstructured lists), and quantum Fourier transforms (QFTs). Considering the rich set of quantum computing technology alternatives (analog and digital/Boolean gate-like), the field of possibilities is numerous and therefore will not be fully described herein.
Attempting to optimize what operations are performed and within what entities (i.e., where) in the quantum computing system 300, for the lowest overall cost or for the greatest performance, involves consideration of intrinsic latencies, bandwidths, power consumptions, circuit yield, functional and environmental factors (e.g., necessary for quantum computation) of the various constituent entities, among other factors. Programmed/set for a window of time during which a particular algorithm is run, functions can be intertwined, the logic, memory, and mixed configurations, can be provided, in part, by the collective superconducting array cells 104, or, more generally speaking, by superconducting array cells 104 and FPGA superconducting elements. In fact, as already stated, a goal of this invention is to have circuits/entities serve roles for logic, memory, and mixed/combined operations. It is to be appreciated that the entities in the illustrative quantum computing system 300 are explicitly organized and generically named to convey the broad capabilities of various embodiments of the present invention.
Embodiments of the invention can be used in conjunction with known superconducting memories, or portions thereof, for beneficially improving the capabilities of such conventional superconducting memories, given their diverse specifications, with specialized entities adapted to address the unique attributes of each memory. An all-RF read path (i.e., radio frequency (RF) transmission line-based read path system) of JMRAM or PRAM advantageously has far lower latencies than an all RQL read path, but unfortunately can lower overall bandwidth (due to multi-flux quanta signaling, which requires flux pumps). These differences will be noted herein are not the subject of the present invention.
Embodiments of the inventive concept further seek to improve performance in light of extremely poor yields associated with conventional superconducting electronics. As opposed to modern CMOS chips, superconducting chips support only at least one hundred times fewer Boolean circuits. Just as in the case of the illustrative MM 100 shown in
Portions of “quantum” algorithms, programs and applications can be processed in all temperature zones using various physical systems available at each temperature: Qubits (at milli-Kelvin), superconducting analog and digital circuits (less than or equal to 4.2 degrees Kelvin; liquid helium), potential intermediate stages CMOS bipolar (liquid nitrogen temperatures at 77 Kelvin), and CMOS and bipolar circuits (at “room temperature,” e.g., a range from about-50 to 100 degrees Celsius). In embodiments of the invention, the specific design of a hardware/application/algorithm across different physical systems, having different logic and memory circuits, is not emphasized. Rather, the entities, their capabilities and their interactions with the MM (e.g., MM 100 depicted in
“Low temperature” logic and memory management entities are named as such because they are likely to contain superconducting circuits such as RQL, but there is a reasonable possibly that they may contain CMOS circuits as well (two such example for Josephson magnetic FPGAs and JMPLAs are discussed in Appendix G). Much information regarding exact circuit choices (e.g., CMOS versus RQL) remains only conjecture at the moment, given the immature state of quantum computing. Thus, with respect to quantum computing system 300, the following principal elements according to one or more embodiments of the invention will each be discussed in broad detail (rather than discussing them as a more precise design like other embodiments, e.g. MM 100).
In general, the “low temperature” logic entities 304 can be directed to various purposes, such as, for example: (i) performing control; (ii) managing data pipelines of the MM 100; (iii) preparing data for use in the MM 100 (e.g., true/complement generation for WoDa<1:N> signals, determination of encoded “partial” address<1:?>, and, in concert with multiplexors 308 and 310, aligning signals on appropriate phases and RQL cycles); (iv) improving the compute efficiency of operations associated with the illustrative MM 100 and/or the quantum computation entities 320; and (v) operatively interacting with “room temperature” processing entities 302, quantum computation entities 320 and each other 304, among other functions.
“Low temperature” logic entities 304 include a true complement Boolean data generation logic 305 (i.e., inverters and buffers) that operatively sources the MM 100 with true complement data (WoDa<1:N>) to be processed by the logic enabled within 100. “Low temperature” logic entities 304 can further include sparse logic 316 that can perform logic functions so that a second pass through the MM 100, which is typically required for Boolean logic completeness, can be avoided. Avoiding a second pass can improve performance.
Multiplexor entities for word line address and/or data 308 can be just a simple multiplexor(s) that selects among logic operations, memory read operations, and mixed memory and logic operations for the MM 100. It can also be enhanced to include slightly more complex control signals which set specified bit ranges to logical “0s” (using AND gates, as known by those skilled in the art) to assure only a subset of rows in a location selected by multiplexor entities for encoded “partial” addresses 310 (decoder 114) can be activated/selected; in this row-zeroing-out mode (setting bit fields to logical “0s”), the multiplexor 308 can, for example, provide a secondary decode function to the decoder 114.
Multiplexor entities for encoded “partial” addresses 310 can be a simple multiplexor(s) that chooses array locations (e.g., arrays 102 in
Memory read and write request buses 312 are highlighted as entities in the illustrative quantum computing system 300 depicted in
“Low temperature” logic and memory overlap entities 314 can include reservation tables for all operations occurring within the MM 100, in one or more embodiments. Such operations may involve arrays 102 of different column/output widths over a two-pass logic operation. It is important to note that two passes may be necessary to generate a complete Boolean function, as will be discussed in conjunction
In one or more embodiments, the sparse logic 316 can include at least one of the following: (i) Josephson FPGA (field programmable gate array) cells and interconnect for performing more customized bit-by-bit (i.e., bitwise) operations; or (ii) fixed logic functions for implementing frequently used functions. Sparse logic 316, in one or more embodiments, can include instruction-branch-partial-resolution logic, such as performing a comparison of only the high-order bits of two numbers involved in a conditional clause containing less than or greater than statements between two numbers/operands. If a branch cannot be resolved by such logic 318, a second pass can be executed.
Quantum computation entities 320 include a plurality of entangled qubits. Generally, interactions/communications among the quantum computation entities 320 and other entities (note that the interaction with the “low temperature” logic entities 304 is highlighted) involve complex analog electronics, represented here by A_to_D and D_to_A communication entities for Qubits 318. The “low temperature” logic entities can participate in control plane/path (e.g., quantum error correction code (QECC)) and data plane/path (e.g., Shor's algorithm for factoring, Grover's algorithm for searching an unordered list, and quantum Fourier transform (QFT) algorithm).
Logic operations, like OR functions, can be and are preferably performed in the output bus circuit 108 in the illustrative MM 100 shown in
It should be noted that the differences between
As previously discussed, in
The electable inverter 400 can function as a buffer or as an inverter, or can output “0s.” To invoke inversion during an exemplary Pass 1 of the MM 100, the Pass 1 control signal is set to a logic “1” and the Pass 2 control signal is set to a logic “0.” To invoke buffering during an exemplary Pass 2 of the MM 100, the Pass 2 control signal is set to a logic “1” and the Pass 1 control signal is set to a logic “0.” Otherwise, when logic operations are not active, the electable inverter 400 can be beneficial to reduce power consumption by setting both Pass 1 and 2 control signals to “0,” which saves energy in generating and propagating the signals associated with the control inputs (e.g., Pass 1 and Pass 2), in computing the signal driven through the output “Out,” and in propagating any “1s” of “Out” to downstream logic.
Represented as glue logic circuits 506, the exemplary logic flow diagram 500 to generate a universal Boolean expression further includes “low temperature” logic entities 304, multiplexor entities for word line address and/or data 308, multiplexor entities for encoded “partial” addresses 310 (all entities 304, 308, 310 can be represented as glue logic circuits 506 in
It is important to recognize that, as the “key” in
When De Morgan's Law 510 (for an OR_not transform) is applied, it can be seen that pass 1 and pass 2 logic representations 502 and 504, respectively, can be represented with AND gates 512, OR gate 514, and inversion bubbles 516. In combination with T/C generation for Boolean “data” terms 305, the aforementioned resulting logic forms a universal Boolean expression. It should be noted that the labeling of logic gates 106, 110, 112 from
Addressing the inverter bubbles 516, positioned just before the AND gates 512, it should be noted that inversions actually result from the transform 510 applied to combinations of OR gates 106 and 110, in one or more embodiments. Moreover, it should be noted that the programmable switches (PS) 508, situated in front of AND gates 512 and OR gate 514, provide the programmability of the logic.
The embodiment of the logic flow diagram 500 to generate a universal Boolean function should be considered illustrative rather than limiting. Other significant, but intermediate, manipulations of data at the output of multiplexors 308 and arrays 102, for example, can be beneficial as has been discussed already in connection with some embodiments of the invention. Moreover, pass1 ORing and pass 2 ANDing can be enabled by configuring the electable inverters 112 as buffers in pass1 and then as inverters in pass 2, respectively. In general, N pass traversals can occur (where Nis an integer ranging from 1 to an extremely large number/infinity) to form a prescribed logic expression or, perhaps, co-mingling logic operations with memory operations, to implement a prescribed algorithm.
It should be further understood that the exemplary block diagram 500 depicted in
Assuming the Datum_Out<1> signal from an upstream array(s) is a logic “0” (given that the upstream arrays are disabled from generating data in the pipeline for the cycle(s) under consideration), specified control signal settings will trigger the following exemplary behavior(s) of the array-output circuit 600 noted in the MM 100 as 119:
Collectively, exemplary behaviors [3] and [4] of the array-output circuit 600 embody multiplexing. As opposed to time division multiplexing (TDM), which preserves data in data beats across cycles (for this example, two cycles), traditional multiplexing, such as described by the combination of behaviors [3] and [4], discards a subset of the data. In this exemplary array-output circuit 600, half the data is lost if multiplexor requests are enabled.
It is also important to recognize that column-oriented address logic (multiplexors) serves to enable a second form of address decoding for arrays. As implemented by the array output circuit(s) 600 (119), these circuits can select various sets of columns of data, being sourced form one or more arrays 102 in a MM 100, for output as will be discussed with respect to bit slice output dataflows of
Under the oversight of instructions, which implement a certain computer architecture corresponding to a particular program, control logic (which drives, for example, signals Enable<1>, Enable<2>, Enable_TDM<2> in the illustrative array-output circuit 600 shown in
Neighbor-to-neighbor two-input even-odd data flow operations include, for example, those associated with RQL. Two-input logic includes, for example, AND, OR, XOR (a composite function of RQL AndOr and RQL AnotB), RQL AndOr, RQL AnotB gates.
Referring to
An output of the AND gate 802 is supplied to a first input of OR gate 804, and an output of AND gate 806 is delayed by the P cycle delay module 808 before being supplied to a second input of the OR gate 804. An output of OR gate 804 is supplied to a first input of OR gate 812. An output of the AND gate 810 is supplied to an input of the W cycle delay module 814 before being supplied to a second input of the OR gate 812. An output of the OR gate 812 generates the Datum_Out signal.
The following exemplary application of control signals is illustrative, but not limiting, of an exemplary behavior of the programmable copy delay circuit 800:
For the cycle(s) of interest, fixed copy delay circuit 850 serves (i) to feed the datum provided to input Datum_In on cycle N to output Datum_Out on cycle N, where N represents an arbitrary input data cycle, and (ii) to feed the datum provided to input Datum_In on cycle N to output Datum_Out on cycle N+P, where N represents an arbitrary cycle and P is an integer associated with the P cycle delay module 808, and (iii) to feed the datum provided to input Datum_In on cycle N to output Datum_Out on cycle N+W, where W represents an arbitrary cycle and P is an integer associated with the P cycle delay module 808.
It is also important to recognize that column-oriented address logic (e.g., multiplexors) serves to enable a second form of address decoding for arrays. The array output data flow 1000 of
In one or more embodiments, a first one of the first plurality of array output circuits 600 may be connected to a first pair of adjacent column lines, <x> and <x+1> (where x is an integer), in array_A 102A, and a second one of the first plurality of array output circuits 600 may be connected to a second pair of adjacent column lines, <x+2> and <x+3>, in array_A. Likewise, a first one of the second plurality of array output circuits 600 may be connected to a first pair of adjacent column lines, <x> and <x+1>, in array_B 102B, and a second one of the second plurality of array output circuits 600 may be connected to a second pair of adjacent column lines, <x+2> and <x+3>, in array_B. The first pair of adjacent column lines in each array 102A, 102B may be considered to be associated with an “even” bit slice, and the second pair of adjacent column lines in each array 102A, 102B may be considered to be associated with an “odd” bit slice.
Each of the array output circuits may be configured to select at least a given one of the corresponding column lines to which it is connected as a function of one or more enable signals supplied thereto. For example, the first one of the first plurality of array output circuits 600 may be configured to receive a first set of one or more enable signals, ens_A<0,1>, the second one of the first plurality of array output circuits 600 may be configured to receive a second set of one or more enable signals, ens_A<2,3>, the first one of the second plurality of array output circuits 600 may be configured to receive a third set of one or more enable signals, ens_B<0,1>, and the second one of the second plurality of array output circuits 600 may be configured to receive a fourth set of one or more enable signals, ens_B<2,3>. In one or more embodiments, the enable signals ens_A<0,1>, ens_A<2,3>, ens_B<0,1>, ens_B<2,3> may comprise “control bits” in a corresponding dataflow outside the array 102. Exemplary control bits will be discussed in the cache section with reference to
Outputs of each of the array output circuits 600 configured to select the first pairs of adjacent column lines <x>, <x+1> in array_A 102A and array_B 102B may be supplied to corresponding inputs of a first one of the OR gates 1002, which may correspond to the even bit slice. Similarly, outputs of each of the array output circuits 600 configured to select the second pair of adjacent column lines <x+2>, <x+3> in array_A 102A and array_B 102B may be supplied to corresponding inputs of a second one of the OR gates 1002, which may correspond to the odd bit slice.
Outputs of the OR gates 1002 may be supplied to corresponding programmable copy delay (PCD) circuits 800. More particularly, an input of a first PCD circuit 800, which may be associated with the even bit slice, is preferably configured to receive an output of the first (even) OR gate 1002, and a second PCD circuit 800, which may be associated with the odd bit slice, is preferably configured to receive an output of the second (odd) OR gate 1002. Each of the PCD circuits 800 is preferably configured to generate an output signal that is a delayed copy of the input signal supplied thereto, as a function of one or more corresponding enable signals supplied thereto. The first PCD circuit 800 may be configured to receive a first set of one or more PCD enable signals, ens_even PCD, and the second PCD circuit 800 may be configured to receive a second set of one or more PCD enable signals, ens_odd PCD. In one or more embodiments, an amount of delay generated by the PCD circuits 800 may be controllable based on the PCD enable signals.
Respective outputs generated by the PCD circuits 800 may be supplied to corresponding inputs of the nearest neighbor logic (NNL) 700. The nearest neighbor logic 700, in one or more embodiments, is preferably configured for implemented 2-inupt logic operations (such as a XOR) as a function of one or more enable signals supplied thereto. Specifically, the nearest neighbor logic 700 may be configured to receive a first set of one or more enable signals, ens_even NNL, and a second set of one or more enable signals, ens_odd NNL, for selecting, as an output of the nearest neighbor logic 700, a nearest neighbor in the even bit slice or a nearest neighbor in the odd bit slice, respectively; the nearest neighbor output for the even bit slice is selected by setting the ens_even NNL enable signal(s) to an appropriate level or data pattern, and likewise the nearest neighbor output for the odd bit slice is selected by setting the ens_odd NLL enable signals(s) to an appropriate level or data pattern.
The outputs of the nearest neighbor logic 700 may be supplied to corresponding inputs of the elective inverters (EI) 400. Each of the elective inverters 400 may be configured to generate a corresponding datum output as a function of an enable signal(s) supplied thereto. Specifically, the elective inverter 400 associated with the even bit slice is preferably configured to generate an output datum, Datum<Z>_for_logic_OPS, as a function of a corresponding enable signal, ens_even EI, supplied thereto, and the elective inverter 400 associated with the odd bit slice is preferably configured to generate an output datum, Datum<Z+1>_for_logic_OPS, as a function of a corresponding enable signal, ens_odd EI, supplied thereto.
Each of the superconducting array circuit further includes a plurality of output ports connected to corresponding multiplexers in the network. More particularly, the north superconducting array circuit 100N includes a first output port, West_mux_from_N, a second output port, South_mux_from_N, and a third output port, East_mux_from_N. The north superconducting array circuit 100N is configured to pass received data from its input data port, North_in, to a given one of its output ports as a function of the Address_in signal supplied to its address input port. The south superconducting array circuit 100S includes a first output port, West_mux_from_S, a second output port, North_mux_from_S, and a third output port, East_mux_from_S. The south superconducting array circuit 100S is configured to pass received data from its input data port, South_in, to a given one of its output ports as a function of the Address_in signal supplied to its address input port. The east superconducting array circuit 100E includes a first output port, West_mux_from_E, a second output port, South_mux_from_E, and a third output port, North_mux_from_E. The east superconducting array circuit 100E is configured to pass received data from its input data port, East_in, to a given one of its output ports as a function of the Address_in signal supplied to its address input port. Likewise, the west superconducting array circuit 100W includes a first output port, North_mux_from_W, a second output port, South_mux_from_W, and a third output port, East_mux_from_W. The west superconducting array circuit 100W is configured to pass received data from its input data port, West_in, to a given one of its output ports as a function of the Address_in signal supplied to its address input port. The north-south-east-west-network element 1100 can preferably support more than one non-conflicting_transaction concurrently (e.g., north to south, south to north, east to west, and west to east).
The network element 1100 further includes a plurality of merge OR circuits 1102. Each of the merge OR circuits is configured to generate an output as a function of logical sum of the signals applied to its inputs. Specifically, a first merge OR circuit, which may be a “north” merge OR circuit 1102N, is configured to receive, as inputs, output signals generated by the superconducting array circuits 100S, 100E, 100W, for a “north” multiplexer, North_mux_from_E, North_mux_from_W and North_mux_from_S, and to generate, as an output, a North_out signal. A second merge OR circuit, which may be a “south” merge OR circuit 1102s, is configured to receive, as inputs, output signals generated by the superconducting array circuits 100N, 100E, 100W, for a “south” multiplexer, South_mux_from_N, South_mux_from_E and South_mux_from_W, and to generate, as an output, a South_out signal. A third merge OR circuit, which may be a “east” merge OR circuit 1102E, is preferably configured to receive, as inputs, output signals generated by the superconducting array circuits 100N, 100S, 100W, for an “east” multiplexer, East_mux_from_N, East_mux_from_W and East_mux_from_S, and to generate, as an output, an East_out signal. Likewise, a fourth merge OR circuit, which may be a “west” merge OR circuit 1102w, is preferably configured to receive, as inputs, output signals generated by the superconducting array circuits 100N, 100S, 100E, for a “west” multiplexer, West_mux_from_S, West_mux_from_E and West_mux_from_N, and to generate, as an output, a West_out signal. As shown in
The MM 100 can, for example, be used as a mesh network routing element 1100 in a networking system receiving data from any first direction, either North, South, East, or West, and sending the data along at least one of a set of second directions, North, South, East, and West, the at least one second direction being notably not the first. The data from the first direction is applied to the WoDa inputs, the Data_for_logic_OPS is divided into three outputs corresponding to the set of possible second directions (in this exemplary planar network). Each address selects a region (which in this instance includes a partial encoded address, an array 102 enable address, and possibly a column address).
The term sparse logic (e.g., sparse logic 316 in
While the logic for bit field comparisons is known in the art, bit field comparisons (by comparators), useful for the lookup path which will be discussed with reference to
High-bandwidth passive transmission line (PTL) circuits are particularly well-suited for the low latency distribution of control signals (described as enables), which typically have wide fan-outs that steer many bits of data flow logic.
A common limitation in current superconducting logic technologies is the distance SFQ pulses can travel before being captured and retransmitted. This limitation arises in different contexts and for different reasons. A common solution would be to transmit the signal along a PTL, rather than an active JTL; this makes sense when the resources for changing from JTL to PTL, and back again, are less than the resources required to actively transmit the same signal. However, this adds a new limitation, in that there is a limited bandwidth of the PTLs, often associated with the “charge time” for building up a signal for a PTL (actually what is storing flux quanta; in practice this is the time required to build up a multi-SFQ signal). Generally, PTLs can only take one flux quantum at a time, and therefore PTLs take time to recover. Embodiments of the invention beneficially provide a solution to this PTL “recovery” issue, and therefore are able to achieve increased bandwidth while minimizing resource usage.
In an extreme solution, a single signal could be split to N different PTL drivers, where N is the number of clock cycles required to ready a PTL driver. In this way, the bandwidth of the PTLs collectively would match that of the incoming line, at the cost of a large amount of PTLs and corresponding PTL drivers. In more practical terms, such long-distance signals are more likely to be less frequent, for example a single synchronization bit every 64 cycles. In such cases, a similar round-robin approach can be used with far fewer resources. Of course, ultimately the amount of resources needed will generally be dictated by the requirements of the rest of the system.
With continued reference to
Similarly, a second one of the PTL circuits, 1430B, may include an AND gate 1404B having a first input adapted to receive the data input signal Datum_In, and a second input adapted to receive a second enable signal. The input data signal Datum_In supplied to the AND gate 1404B, in some embodiments, may be buffered first, such as by the non-inverting buffer 1402. An output of the AND gate 1404B is preferably supplied to an input of a PTL driver (i.e., transmitter) 1408B, and an output of the PTL driver 1408B may be fed to a corresponding passive transmission line 1410B. The signal conveyed by the passive transmission line 1410B may be supplied to an input of a PTL receiver 1412B, and an output of the PTL receiver 1412B may be combined with outputs of one or more other PTL circuits (e.g., PTL circuit 1430A) to form the data output signal, Datum_Out, of the PTL distribution circuit 1400.
The first and second enable signals, collectively the one-hot bus 1406, supplied to the second input of the AND gates 1404A and 1404B in the first and second PTL circuits 1430A and 1430B, respectively, may be generated by the delay circuit 1420. In one or more embodiments, the delay circuit 1420 includes an OR gate 1422 having a first input adapted to receive a buffered version of an initialize signal, Initialize, supplied to the delay circuit. In the illustrative PTL distribution circuit 1400, a non-inverting buffer 1408 may be used to generate the buffered version of the initialize signal supplied to the OR gate 1422, although use of an inverting buffer is similarly contemplated. An output of the OR gate 1422 may be supplied to a first delay line 1424, and an output of the first delay line may be supplied to an input of a second delay line 1426. An output of the second delay line 1426 is preferably fed back to a second input of the OR gate 1422. The outputs of the respective delay lines 1424, 1426 are preferably used to generate the first and second enable signals supplied to the PTL circuits 1430A, 1473B.
In one or more embodiments, each of the first and second delay lines 1424, 1426 may include a plurality of series-connected non-inverting or inverting buffers having a prescribed delay value associated therewith. The delay values corresponding to the first and second delay lines 1424, 1426 are preferably the same, although it is similarly contemplated that, in some embodiments, the first and second delay lines 1424, 1426 may have different delay values. In the delay circuit 1420, the second enable signal will have a longer delay value compared to the first enable signal, since the initialize signal is passed through at least one additional delay line to generate the second enable signal. In one or more embodiments, the delay value corresponding to each of the delay lines 1424, 1426 may be equal to the fractional PTL driver recovery period of the whole circuit.
As previously stated, the outputs of the respective PTL circuits 1430A, 1430B are preferably combined to form the output data signal Datum_Out. To accomplish this, the respective outputs of the PTL receivers 1412A, 1412B in the PTL circuits 1430A, 1430B, can be supplied to corresponding inputs of an OR gate 1414; an output of the OR gate 1414 preferably forms the Datum_Out signal as a logical summation of the output signals from the PTL circuits 1430A, 1430B. Thus, whenever any one of the outputs of the PTL circuits 1430A, 1430B is a logic high, the Datum_Out signal will be a logic high (i.e., logic “1”), and the Datum_Out signal will only be a logic low (i.e., logic “0”) when all of the respective outputs of the PTL circuits 1430A, 1430B are a logic low. Although shown as an OR gate 1414, it is to be understood that similar summation functionality may be achieved using other logic gates, such as inverters, AND gates, NOR gates and/or NAND gates (e.g., using De Morgan's theorem), as will become apparent to those skilled in the art.
The exemplary techniques for distributing a datum signal(s) using PTLs as described in conjunction with
As in the illustrative PTL distribution circuit 1400 shown in
As previously stated in conjunction with the PTL distribution circuit 1400 shown in
An output count value generated on one or more outputs, collectively 1510, may be indicative of a delay amount. These outputs 1510, and the count value associated therewith, are preferably supplied to the second input of the corresponding AND gates 1404A, 1404B, 1404C, 1404D in each of the respective PTL circuits 1430A, 1430B, 1430C, 1430D. If the count value generated by the counter unit 1520 exceeds a prescribed threshold value, the counter unit may generate a rejection flag 1508 to indicate that an overflow condition has occurred, or may generate a rejection flag 1508 if the counter unit 1520 exceeds a prescribed threshold value and receives an input Datum_In signal 1506.
With reference now to
In one or more embodiments, the skewed data Datum<1:16>_for_logic_OPS: (i) can undergo no delay to feed the early branch resolution logic 1602 (which may process the most significant bits in a comparison); and/or (ii) can be delayed by P cycles to feed the late branch resolution logic 1004, which can process a combination of skewed early and late data of Datum<1:M> (for at least portions of the skewed data). The MM may have rapid single flux quantum (RSFQ) or RQL-based-read-path NDRO arrays. It is important to recognize that, with additional enables (not explicitly shown, but contemplated), the programmable copy delay circuits 800 may be configured to align various bit fields in the late branch resolution logic 1604 to overlay or correct functional operations in pipelined logic.
In a practical application of a processing system, to improve performance, serial instructions are often fetched and decoded before branches are resolved, thereby tying up many resources including caches. Therefore, when a branch will be taken in an instruction stream (as detected by the early branch resolution logic 1602), even though the branch address may be unknown at the time, it is important to stop serial fetches and decodes occurring in sequence after the branch instruction (wrongly processed) in an operation. As is known in the art, a pipeline flush (also known as a pipeline break or pipeline stall) is a procedure enacted by a processor when it cannot ensure that it its instruction pipeline will be correctly processed in the next clock cycle. The pipeline flush essentially frees all pipelines occupied by the serial processing of instructions (wrongly processed) for use by other new processes contending for resources. Such pipeline control actions improve performance and save power, and might involve, for example, the stoppage (i.e., disablement of propagation) of wrongly fetched data in the MM 100 (
The early branch resolution logic 1602 shown in
A second input signal, an “initialize” input initializes an enablement/trigger/launch loop circuit 1420, which itself is responsible for activating one of the PTL drivers 1408A, 1408B, readied with sufficient flux, to drive a signal applied to the input Datum_In across one of the PTLs 1410A, 1410B.
The initialization procedure of the enablement/trigger/launch loop circuit 1420 follows: At some point before operation of the circuit, a one-time logical “1” is applied to the initialize input of JTL 1408, which drives the signal through OR gate 1422. The signal then passes through the one-bit shift register 1424 in one clock cycle and splits, going to both an input of AND gate 1404 and the other one-bit shift register 1426. Shift register 1426 retains the in-flight signal for one clock cycle before feeding it back to OR gate 1422 and the other AND gate 1406. In this way, on alternating clock cycles, either AND gate 1404 or AND gate 1406 (but not both) will see a logical 1. In this sense, enablement/trigger/launch loop circuit 1420 functions as a “round robin” trigger generator, with a two-bit, one-hot output. A continuous stream of data at Datum_In will now alternatingly hit PTL Driver 1408A and 1408B on alternating clock cycles. Assuming substantially similar delays through passive transmission lines 1410A, 1410B and passive transmission line receivers 1412A, 1412B, OR gate 1414 will merge the transmissions back into a single output, Datum_Out.
If the duty cycle of the circuit is less than 100% of the available bandwidth of the underlying SFQ technology, such as RQL, (for example 25%), alternative embodiments can be implemented with more sophisticated versions of an input and round robin N-bit, one-hot circuits. It is contemplated that, in other embodiments of the invention, data unable to be driven through the PTLs, being that not one of the PTLs is charged with flux quanta, can be redirected. That is, data arriving before a PTL driver is available will not be transmitted via PTL; this feature may not be used or necessary for a properly restrained input data sequence.
The invention contains at least two gated transmission channels 1430 (a set of at least one AND gate 1404, at least one PTL driver 1408, at least one PTL transmission line 1410, and at least one PTL receiver 1412), at least one N-way OR gate 1414, at least one input JTL 1402 (which in general may be split into any number of JTLs necessary to achieve sufficient fan-out), at least one counter unit 1420 (which itself has at least one data input 1408, can have a rejection flag output 1508, and N outputs 1406 for the “one-hot” bus 1406), and at least one “one-hot, N-channel” output bus 1406.
The input signal 1502 is split by JTL 1402 (or an appropriate splitter tree) to go to each of the four AND gates 1404A, 1404B, 1404C, 1404D as well as to the counter unit 1520. Upon receiving an input bit, the counter unit can send the bit to the “rejection flag” output 1508 if no PTL driver is ready. It is contemplated that the circuit providing input to 1500 is designed such that input only arrives when sufficient PTL drivers are ready, and as such the description of this implementation will assume no early data arrives.
The counter unit 1520 has a data input 1506 from the split, and outputs four (or more generally, N) one-hot outputs for each of the PTLs, collectively indicated as the data bus 1406. The one-hot outputs may also overall have a “no-hot” state (logical zero on all outputs) if no PTLs are ready to receive data. The counter unit first has an internal timer counting up to one-fourth of the recharge time of the PTL drivers. (In general, the counter increases to a value such that the next PTL will be ready for the next incoming logical 1 bit; this value is highly dependent on the frequency of incoming data, the size of the incoming data, the number of PTL drivers available, and the recharge time of the PTL drivers.) The counter unit waits until this counter has reached the appropriate value and then cycles through the four one-hot outputs upon arrival of the data-present signal. Each of the one-hot outputs goes to a separate AND gate. The one-hot outputs will maintain a single logical 1 output until a data input is received, after which it will have a “no-hot” output until the counter once again reaches its maximum value, after which the next line in the one-hot output switches to a logical 1. In this way, incoming data arriving early to the AND gates 1404 will not proceed to the PTL driver 1408 until the appropriate PTL driver is ready.
The remainder of circuit 1500 functions similarly to circuit 1400, with only one of the AND gates primed to pass through data at a time. (Note, that by design the circuit 1400 has as many channels as it takes cycles to recharge the PTL driver, and as such never needs a “no-hot” output state of the bus. Otherwise, the counter unit 1520 provides the same functionality as enablement/trigger/launch loop circuit 1420 of circuit 1400.) The input data is split and each goes to one each of the AND-gates that the one-hot counter unit signals go to. In this way, only one of the AND gates 1404A, 1404B, 1404C, and 1404D is ever active, and sends the signal through from the output of the AND gates to one of the PTL drivers 1410A, 1410B, 1410C, and 1410D. By design, the next PTL, either PTL 1412A, 1412B, 1412C, and 1412D, used will be the next one in sequence.
In summary, the counter unit will direct data input to a failure flag output unless the one of the one-hot outputs is active. Otherwise, one and only one of the “one-hot” outputs is active and will allow the split input data to pass through that and only that AND gate to the appropriate PTL driver. The counter unit outputs are “no-hot” when the counter is running up to the maximum value to prevent access to the PTL drivers if they are still charging up.
In a more sophisticated design, the counter could count up to the total time needed for a PTL driver to recharge, starting at the fourth (or in general, Nth) input bit, and independently going through the one-hot outputs. In this way, four consecutive bits can be sent, but at a cost of a longer overall recharge time before any other bit can be sent. This is useful for a “burst mode” transmission where (in this example) four consecutive bits need to be sent in quick succession but the overall frequency of such transmissions is the same as the full recharge time of the PTL drivers. This is not particularly different from the implementation described above, except that the one-hot outputs cycle through the first to the last bit before reaching a “no-hot” output state. In different words, the PTL triggering times do not need to be periodic. Instead, they can occur one cycle after the next sort of collectively in bursts, should there be a need for a burst signal.
In a real processing circuit/system, to improve performance, serial instructions are fetched and decoded before branches are resolved, tying up many resources including a cache. Therefore, if a branch will be taken in an instruction stream (as detected by the early branch resolution circuit 1602), even though it is not known to which address, it is important, not only to stop serial fetches and decodes occurring in sequence after the branch instruction (wrongly processed) in an operation. As known in the art, a pipeline flush frees all pipelines occupied by the serial processing of instructions (wrongly processed) for use by other new processes contending for resources. Such pipeline control actions improve performance and save power and might involve, for example, the stoppage (disablement of propagation) of wrongly fetched data in the MM 100 pipeline by setting the enables of 600 (also known as the array output circuit 119) all to a “0-state.”
The early branch resolution logic represents one particular form of sparse logic 316 (see
In more general terms, multiple inputs ranging from early to late can be fed to a plurality of logic pipelines (e.g., 1602, 1604). The multi pipeline system with embedded variable delay elements 1600 includes a plurality of programmable copy delay circuits 800, a first early processing pipeline 1602 and a second late processing pipeline 1604. As discussed with respect to
With regard to caches, there are various known cache architectures, including set-associate caches. Note, that a 4-way (and 3-way) set-associative cache is exemplary for the present disclosure, but embodiments of the invention may not be limited to such cache architectures.
The exemplary level 1 cache 1706 may include a data RAM 1708, or other addressable storage element, a directory 1710, and a translation lookaside buffer (TLB) 1712, which is often defined as a memory cache that stores recent translations of logical/virtual memory to absolute/physical memory. In one or more embodiments, the data RAM 1708 may be fungible—configured to perform logic, memory, and mixed memory and logic operations. The data RAM 1708 is preferably configured to store lines of data, each line of data comprising, for example, contiguous data, independently addressable, and/or contiguous instructions, also independently addressable. Furthermore, the data RAM 1708 may comprise one or more operands, one or more instructions, and/or one or more operators stored in at least a portion of the data RAM. In one or more embodiments, the data RAM 1708 may be implemented as at least a portion of the MM 100 (see
The file system 1702, like the data RAM 1708, may comprise one or more operands, one or more instructions, and/or one or more operators stored in at least a portion of the file system. Similarly, the main memory 1704 may comprise one or more operands, one or more instructions, and/or one or more operators stored in at least a portion of the main memory. The main memory 1704 is preferably operatively coupled to the cache 1706, such as via a bus 1716 or other connection arrangement. Although only one bus 1716 is shown in
In accordance with one or more embodiments of the invention, the level 1 cache 1706 may be advantageously configured as a “compute cache.” To begin a discussion regarding aspects of the present inventive concept associated with the level 1 cache 1706 configured as a “compute cache,” it is important to recall the inclusion of operand(s), instruction(s), and operator(s) in all regions of the file and memory system 1700, including the file system 1702, the main memory 1704 and the data RAM 1708 in the cache 1706. An important benefit that is unique to the level 1 cache 1706 according to aspects of the present disclosure is it may be configured: (i) to enable fetching of one or more operators from the main memory 1704 (or the file system 1702, or other levels of the cache 1706 (not explicitly shown for simplicity)); (ii) to enable a lookup of one or more operators (and, of course, operands and instructions) by the directory 1710 and the TLB 1712, for example via a lookup path 1714 in the cache 1706; (iii) to enable processing of one or more operators (also one or more operations) by the data RAM 1708; and (iv) to enable storage of operators in the data RAM 1708 alongside instructions and operands. Additional functionality resident in the MM 100 within the level 1 cache 1706—for example, the output data flows depicted as a bit slice 1000 (including its driving arrays 102A, 102B) in
After describing several elements in the exemplary file and memory system 1700 and defining their respective functions, a top-down description will now be provided. First, in accordance with one or more embodiments of the invention, known computer architectures may be enhanced with new instructions for exploiting capabilities of the underlying attributes of the MM 100 and its external support logic. By way of example only and without limitation or loss of generality, one (partial) format of an illustrative instruction which may be suitable for use with embodiments of the invention is as follows:
For the exemplary instruction format shown above, “address” may refer to a logical/effective/virtual storage address of an instruction, operand or operation, and “reg” may refer to the register in which an operand is stored (e.g., general purpose register (GPR) or register holding results from an earlier pass). Moreover, in the exemplary instruction format shown above, “control” may define one or more logic actions of primarily the dataflow outside the superconducting array(s) 102A, 102B of the MM 100 (
In decoding an instruction, the controls/enables may be defined by the instruction name itself. In contrast, “address_or_register_defined controls” can be enables of at least one of data flow and address flow entities, defined, at least in part, by a particular logical address or defined by a particular register. Address or register defined controls, as opposed to “instruction defined controls,” may make it easier for diagnostic purposes (e.g., to fix bugs) involving the control bit values, because only a software change would be needed as opposed to a hardware change. In the particular example that follows, it will be assumed that the control bits are not defined by a logical address, and thus the control bits are sent to the level 1 cache 1706.
Whether the source of the controls are instruction, address, or register, they can drive data flow entities, driving for example, (i) ens_A<0:3>, ens_B<0:3>, ens_even PCD, ens_odd PCD, ens_even NNL, ens_odd NNL, ens_even EI, and ens_odd EI, all signals being associated with the illustrative bit slice 1000 in
In an alternative instruction set architecture (ISA) according to one or more embodiments of the inventive concept, as part of the operator fields, the specific values and bit positions of these control bits would be specifically listed. Instruction decode would send these values and bit positions to the instruction unit, which would send them to the level 1 cache 1706.
An operation can be a collection of instructions, each of which can have an operator(s) and operand(s) associated therewith. Thus, in the context of a cache, an “operation” may invoke dataflow circuitry outside the array 102A, 102B of the MM 100 (
To set up an operator, the contents of the operator's storage location can be used in an MM (see, e.g., block 100 in
For performing an operation, controls will preferably be used to invoke external (i.e., outside the array 102A, 102B in
Some non-limiting examples of control bits for use in conjunction with embodiments of the invention can be as follows:
Traditionally, the term “split” cache refers to separate instruction and operand (data) caches, and the term “unified” cache refers to one cache that contains both instructions and operands. The term “fully split,” as used herein, is intended to broadly refer to separate instruction, operand, and operator caches, and the term “fully unified” (a compute cache), as used herein, is intended to broadly refer to one cache that contains instructions, operands and operators (and performs operations). While embodiments of the invention may be configured as a “split” operator-only cache, this cache example describes a fully unified cache as the MM (100 in
Depending on programming use cases, traditional caches can fill differently, determining the residency of instructions and data during the execution of processes. This new fully unified cache memory according to one or more embodiments of the invention (e.g., level 1 cache(s) 1706 of
The “low temperature” memory management entities 306 within
Discussed subsequently is an exemplary setup process to enable (i) retrieval of an instruction, (ii) retrieval of an operand, and (iii) a pass of an instruction/operation for logic (operator). In one or more embodiments, all setup processes will initially involve a fetch.
If the instruction(s), operator(s), and operand(s) were all identified by their addresses, and their lookups all had directory misses, the fetches to the next level of storage hierarchy and corresponding unified cache installs would be similar. If they later had all directory hits, the operand(s) data returns would be saved in registers. During the execution of an instruction, the operator's control (derived from the instruction name or pointed to by the instruction with a reference to address_control) would be saved in a register in the cache control area.
The instruction and operand fetches would be normal memory requests/accesses to the cache. In mapping cache function into the superconducting entities (i.e., entities 304, 306, 308, 310, 312, and 314) of
Unlike an instruction or operand lookup, the operator's lookup would not enable or access the unified cache. Instead, a subset of the operator's logical/effective address and directory hit setid (or way or sid) would be saved to prepare for the execution of a portion of an instruction, where the portion of the instruction execution occurs in an address-defined-PLA portion of a MM(s) 100 (where the one or more operands get processed). In a more complex architected interaction(s), a combined memory and address-defined-PLA interaction(s) can be performed concurrently within a MM(s) 100. While the MM(s) can support such interaction(s), the interaction(s) is made complex because of the practical control requirements necessitated to assure consistency in system state over the interaction and thereafter (e.g., write interactions must be managed and barred from interfering), this saved information is the cache index and setid which will be referenced later to perform at least part of an operation.
This and subsequent paragraphs describe one pass through the MM 100, and supporting/surrounding logic (e.g., entities 304), which, in different words, can be referred to as the execution of an instruction. It is important to understand that control bits can modify dataflow function outside array(s) 102 of
It was discussed earlier that there are several possible sources of the control bits, and that the option of sourcing them from a logical address was not chosen for this example. It was also mentioned that the setup process saved the control bits in a register in the cache control area. These control bits within the cache control area are referenced during a pass through the MM.
The operand(s) register values are used to set multiple WoDa bits (See 322, 308, 100, 114). In the case of an operator, if the number of WoDa bits being used is less than the number of WoDa bits available, then the MM(s) 100 can be made immune to the interference that the unused true-complement WoDa signals (bits) would cause on a result by setting the state of the programmable switches 104 in the unused rows (associated with the unused true-complement WoDa signals) in MM(s) 100 to zero.
The operator's saved cache array index and setid are used to set the encoded partial address, associated with rows, as well as the column address made possible by the array output circuit 600 (acting as a column multiplexor). (Also see controls ens A/B<0:3>600 of
A PLA access of the data RAM 1708 (MM 100) is needed to now perform the desired Boolean operation (as depicted in
For the PLA access (computation), part of the array address is encoded 328. The input data lines 322 (later WoDas), which are fed by bitwise true and complement versions of the operand(s) generated by the T/C Boolean data generation entity 305, set multiple bits of a bus that, for operand(s) and instruction(s) fetches, contains a decoded portion of the array address, known as the WoDa bits (which pass through multiplexor 308). Upon processing within the MM 102, the corresponding result(s) (data<1:M> for logic OPS) 112 of
Examples of PLA accesses are shown in timing diagram examples 202 through 212 of
This completes the description of one pass through the MM 100. The desired operation, however, may require more than one pass through the MM (which can be enabled by executing multiple instructions). See pass 2 in
One example of a 2-pass operation is an and-or, where the “and” is done on the first pass and the “or” is done on the second pass (see
For the first pass instruction, some dataflow control bits are set as follows:
The inverted bus feeds the true complement Boolean data generation block 305, within low temperature logic entities 304. The outputs from block 305 feed input data lines for PLA requests 322, which propagate through the WoDa mux 308, which feed the WoDa bits into the MM 100. Through this propagation, the inverted bus that fed block 305 negates the inversion bubbles 516. This source of bitwise complement operand(s) connects into the column 106, via enabled programmable switches 104. The output of the column OR(s) 106 propagate to the “ORs 110,” which propagate to EI 112, which behaves as an invert due to the EI control (See upper half of pass 1 logic representation 502). Because the source data is inverted, the or-invert propagation behaves as an “and” 512 (See lower half of pass 1 logic representation 502). This “and” feeds MM 100 output data<1:M> for_logic_ops which feeds back into a holding register within low temperature logic entities 304.
Only the simplest dataflow, that of
For a multipass operation, a separate instruction can be used for each pass. The first pass may dump its results in a register. The second pass may then pick up an operand from that register. This is one way that passes of a multipass operation can link to each other.
For the second pass instruction, some dataflow control bits are set as follows:
The operand for a second pass instruction can be retrieved from the holding register holding the first pass result. The holding register can be located in entity 304. This time a bitwise “true” holding register result, within low temperature logic entities 304, feeds the true complement Boolean data generation block 305. The outputs from block 305 feed input data lines for PLA requests 322 (via an internal multiplexor, which selects between the outputs of T/C Boolean data generation entity 305 and Data<1:M>_for_logic_OPS), which propagate through the WoDa mux 308, which feed the WoDa bits into the MM 100. Through the array cells 104, this source then feeds the column ORs 106, the output of which propagate to the ORs 110, which propagate to EI 112, which behaves as a non-inverting buffer due to the EI control (see upper half of pass 2 logic representation 504). The or-buffer propagation behaves as an “or” 514 (see lower half of pass 2 logic representation 504). This “OR” feeds MM 100 output data<1:M> for_logic_OPS.
Perhaps the directory array contains a bit that specifies whether the entry is for an operation. This bit protects the line from being LRU′ed out (by a “least recently used management” known in the art), unless all available setids for a given directory index are operations. If a special flush request comes in for that address, this protection bit is cleared. Further details regarding a use of this function are provided herein below.
When a fetch request comes to the L1 cache for an operator, it will come along with two control bits that will be installed in the directory entry:
There will also be a control bit put in the directory indicating operator. The “operator bit” will be used for debug and for various other uses that are TBD. One use of the read-only bit is, if the operator fetch misses in the L1 cache, to fetch the line from the cache hierarchy with read-only status (as opposed to fetching it conditional-exclusive). Another use of the read-only bit is, if a store to the line is attempted, then detect an exception or an error. The two purposes of the read-only bit then can be to improve performance in a multiprocessor system and detect when operators are being improperly modified.
The “line protect” bit is used to override the directory least recently used management (LRU), so that the line does not get LRU′ed out of the cache, except for unusual cases. One unusual case would be another CPU in a multiprocessing system storing to that line. The reason for the “line protect” bit is to preserve the operator for an extended period of time, until usage of it is complete, even if usage is infrequent enough such that the operator would normally LRU-out.
There would be a corresponding new “flush operator(s)” (one or more operators) request to the level 1 cache. It would either come with a corresponding line address, or increment through all directory indexes. It would turn off the directory's line protect bit(s).
In an alternative embodiment of the invention, it is contemplated that the operator line itself can be modified while the system is running, to modify the operation based on results from earlier computations (a self-modifying code). There would need to be logic in place to prevent the operation from being stored-to while it is being used.
Referring again to
As part of how the term “symmetric” is being defined in accordance with embodiments of the present disclosure, the LA directory and corresponding level 1 cache 1706 must be indexed only by address bits that are not subject to translation. This restriction limits the maximum size of the cache. Furthermore, if the directories are separate, they must still be addressed with the same array index bits.
The term “synonyms,” as may be used herein, is intended to refer broadly to two (or more) logical addresses that map to the same absolute address. This design point beneficially removes the possibility of two synonyms being in different cache index addresses, since the cache is not indexed by address bits that are subject to translation.
The LA directory according to aspects of the present inventive concept preferably integrates the functions of the TLB 1712 and directory 1710. Therefore, any requests that normally invalidate only directory entries (e.g., such as cross interrogates XIs), and any requests that normally invalidate only TLB entries (such as ipte=invalidate page table entry), will invalidate LA directory entries.
The main fields in the logical directory portion of an LA directory are the full logical (or effective) page address and, if it exists, the full logical address extension (e.g., alet, asce, etc); those fields may be abbreviated as “log address” and “ext.” The main field in the absolute directory portion of an LA directory is the full absolute (or physical) page address; that field may be abbreviated as “absolute address.”
The LA directory arrays may be sliced like a TLB, in terms of the three fields: log address, ext and absolute address. However, the LA directory's array indexing and set associativity are like a directory. For example, assume an LA directory and cache that is four-way set associative (i.e., four setids).
For a fetch request, the requestor's log address and ext are preferably compared against the corresponding values in the LA directory's four setids. If there is a hit, the corresponding data is selected from the cache to return to the requestor. The absolute address is not used at all for a fetch hit or a fetch miss.
For an XI (cross-interrogate or invalidate), the XI's absolute address may be compared against corresponding values in the LA directory's four setids. If there is a hit, the corresponding LA directory valid bit is turned off. If the LA directory implementation is split, the XI searches may run in parallel with fetches.
In one or more embodiments of the invention, the interface between the level 1 cache 1706 and a next level of storage hierarchy preferably receives two related changes as follows:
In one or more embodiments of the invention, changes to the system architecture may be performed for implementing certain other features or performance enhancements, including changes within a next level (e.g., level 2) storage hierarchy to support the new design, and changes for TLB spop (special op) handling (ipte, etc.), which may affect how virtual addressing, page swapping, etc., function, among other possible modifications.
As previously stated, with regarding to synonyms, this design point removes the possibility of two synonyms being in different cache index addresses, since the cache is not indexed by address bits that are subject to translation. However, the new design, according to some embodiments, can have synonyms within the same cache/LA directory index address, in different setids, where different log address/ext values have the same corresponding absolute page address; one or more embodiments of the invention may provide multiple ways to handle this case.
In terms of array layout comparisons for a traditional design versus the new design according to aspects of the inventive concept, as a rough approximation, using a reasonable set of assumptions, the new design may use about five percent more array area for the level 1 cache unit compared to a traditional base design. However, this slight increase in array area may results in increased data RAM 1708 bandwidth. See pros/cons list.
With reference to
The traditional lookup circuitry 1800 further comprises a plurality of sets of comparison blocks 1803, 1807 and 1811, each set of comparison blocks including multiple comparators (e.g., two in this example, one for each setid X, Y). Each comparator in the first set of comparison blocks 1803 is configured to compare an output of a corresponding one of the TLB log arrays 1802 (setid X, Y) with an fth_log address supplied to the comparator. Each comparator in the second set of comparison blocks 1807 is configured to compare an output of a corresponding one of the TLB ext arrays 1806 (setid X, Y) with an fth_ext address supplied to the comparator. Likewise, each comparator in a third set of comparison blocks 1811 is configured to compare an output of a corresponding one of the TLB abs arrays 1810 (setid X, Y) with an spop_abs address supplied to the comparator. “fth_log” Is the logical address of a fetch lookup request. “fth_ext” is the logical address extension of a fetch lookup request. “spop_abs” is the absolute address of a TLb special op, such as ipte (invalidate page table entry’).
An output of each comparator in the first and second sets of comparison blocks 1803 and 1807, respectively, is supplied as inputs to corresponding AND gates 1808 (one AND gate for each setid X, Y). Respective outputs of the AND gates 1808 forms an output signal, tlb log/ext hit x/y, of the traditional lookup circuitry 1800. The tlb log/ext hit x/y output signal is a source of the control AO 1812, which selects the TLB absolute address, sent to the next level of cache hierarchy for directory misses. An output of each comparator in the third set of comparison blocks 1811 forms a corresponding output signal inv_tlb. At least a portion of the outputs of the TLB abs arrays 1810 may form a TLB absolute address which is supplied to an AND-OR (AO) gate 1812. The AO gate 1812 is configured to select the TLB absolute address from the TLB abs arrays 1810 with the tlb log/ext hit x/y output signal from the AND gates 1808 to form an address signal, L2 fth abs adr, which may be sent to a next level of cache hierarchy (e.g., “L2,” standing for level 2 cache) when a TLB hit and an absolute directory miss occurs.
A portion of the outputs of the TLB abs arrays 1810 (corresponding to setid X) may be supplied as an input to a second AO gate 1817. The second AO gate 1817 may also be configured to receive a cross-interrogate absolute signal, xi_abs, and a bypass control signal, byp. An output generated by the second AO gate 1817 may be supplied as an input to a fourth set of comparison blocks 1815. The fourth set of comparison blocks 1815 preferably includes multiple comparators (14 in this example—a pair (setids X, Y) for each setid A through D in the absolute directory 1814; that is, setids AX, AY, BX, BY, . . . , DX, DY). Outputs from the set of comparison blocks 1815 may be supplied as an input to a set of AO gates 1816 (four in this example; one for each setid A through D in the absolute directory 1814). The tlb log/ext hit x/y signal generated by the AND gates 1808 may be supplied as an input to the set of AO gates 1816, which is configured to generate corresponding absolute directory hit signals, abs dir hit A/B/C/D, corresponding to each of the setids in the absolute directory 1814.
Referring now to
The enhanced operator lookup circuitry 1900 further includes a plurality of sets of comparison blocks 1903, 1907 and 1911, each set of comparison blocks including multiple comparators (e.g., four in this example, one for each setid A through D of the arrays 1902, 1906, 1910). Each comparator in the first set of comparison blocks 1903 is configured to compare an output of a corresponding one of the directory log arrays 1902 (setid A, B, C, D) with an fth_log signal supplied to the comparator. Each comparator in the second set of comparison blocks 1907 is configured to compare an output of a corresponding one of the directory extension arrays 1906 (setid A, B, C, D) with an fth_ext signal supplied to the comparator. Likewise, each comparator in a third set of comparison blocks 1911 is configured to compare an output of a corresponding one of the directory absolute arrays 1910 (setid A, B, C, D) with an xi/spop_abs signal supplied to the comparator.
Outputs from the first and second sets of comparison blocks 1903, 1907 may be supplied as inputs to AND gates 1908, configured to generate respective directory log/extension output signals indicative of a hit occurring between the directory log and extension arrays associated with setids A through D. The fth_log and fth_ext control signals may also be supplied to an AO gate 1912 for forming a log/ext address which may be sent to a next level of cache hierarchy (e.g., “L2,” standing for level 2 cache) when an fth log/extension miss occurs. Outputs generated by the third set of comparison blocks 1911 may form control signals indicative of a directory absolute hit occurring in the directory absolute arrays 1910 associated with setids A through D.
Comparing the illustrative traditional lookup circuitry 1800 shown in
The traditional lookup circuitry 1800 includes eight absolute directory comparators 1815 (setids AX, AY, BX, BY, CX, CY, DX, DY) rather than four as used in the enhanced lookup circuitry 1900. This is because both TLB setids in the traditional lookup circuitry 1800 are compared against the directory in parallel, to reduce latency. The tlb log/ext X/Y hit result of the AND gates 1808 is then used to select four of the eight absolute directory compare results via the AO gate 1816 (labelled “abs dir hit A/B/C/D”). The enhanced lookup circuitry 1900 does not have this parallel-compare-neckdown scheme.
The absolute directory hit setid (labelled “abs dir hit A/B/C/D”) signal generated by the AO gates 1816 in the traditional lookup circuitry 1800 is used for the late select to data RAM 1708 (
In the traditional lookup circuitry 1800, the TLB absolute address generated by the AO gate 1812 is the source of the address sent to the next level of cache hierarchy (e.g., L2 cache) for directory misses. The enhanced lookup circuitry 1900 instead sends the requestor's log address and ext, generated as an output of the AO gate 1912, to the next level of cache hierarchy (e.g., L2 cache) for directory misses.
The traditional lookup circuitry 1800 detects a TLB hit (labelled “tlb log/ext hit x/y”) through AND gates 1808 if it has both a TLB log address hit resulting from comparators 1803 and a TLB ext hit resulting from comparators 1807. If the traditional lookup circuitry 1800 detects a TLB miss, it sends an address translation request to a larger TLB or address translation unit that is likely located nearby. By contrast, the enhanced lookup circuitry 1900 does not differentiate between a TLB hit and a TLB miss. Rather, the next level of storage hierarchy (here described previously as L2 cache) checks its TLB to determine TLB hit/miss status.
As previously described, the traditional lookup circuitry 1800 includes a bypass (“byp”) path associated with AO gate 1817 on one of the TLB setids (e.g., setid X) feeding the directory absolute address comparators 1815, which is used for XI searches. The TLB absolute address comparators 1811 are only used for TLB special ops. In the enhanced lookup circuitry 1900, directory absolute address comparators 1911 are used for XI searches and TLB special ops. The enhanced lookup circuitry 1900 may also use the directory absolute address comparators 1911 for synonym detection for directory misses, depending on how synonyms are handled.
Aspects according to the present disclosure presented herein are well-suited for use in conjunction with superconducting logic. For example, if a superconducting system includes a cache hierarchy, level 1 cache (e.g., level 1 cache 1706 in
A superconducting system does not require, and moreover cannot support, a large level 1 cache array size. Embodiments of the present disclosure can beneficially exploit this practical design restriction, where the cache is not indexed by any address bits that require translation, since this restriction limits cache size.
It was previously mentioned that there may be TLB spop handling performance concerns. Ironically, a smaller cache size may help alleviate such concerns. That is because lines that may otherwise be a performance concern, for example due to being re-referenced with an old translation that is not in the directory, may be overwritten or removed in a smaller cache anyway, for instance as a result of a least recently used (LRU) caching scheme or the like, thereby requiring a refetch to level 2 cache regardless of whether the translation was available.
The idea presented here is well-suited to fungible arrays for memory, logic, and mixed memory/logic operations (see diagrams 300 and 950).
This can be seen when comparing the old versus new lookup schemes, in terms of the ‘merged directories/cache,’ described with reference to
For the traditional lookup circuitry 1800, the following seven steps would be needed:
For the enhanced lookup circuitry 1900, only two steps are needed, as follows:
However, it should be noted that the traditional lookup scheme can result in a more symmetric lookup array layout for the merged idea, in terms of array row bits lining up with each other, for traditional lookup (TLB vs. absolute directory) compared to enhance lookup (directory log/ext vs directory absolute). This can be seen when comparing the two examples in
In summary, there are various advantages provided by embodiments of the enhanced lookup scheme, including, but not limited to, the following:
Some cons of the enhanced lookup scheme, as a trade-off for the benefits, may include:
An L1 cache 1706 has a corresponding lookup path 1714 that consists of arrays such as a TLB 1712, a logical directory 1710, and an absolute directory also 1710. Typically, these arrays are separate from each other, and accessed mostly or fully in parallel, to minimize data return speed to optimize performance.
Consider an alternative array layout where the lookup path arrays and the L1 cache array are mostly or fully accessed in series. Furthermore, consider array layouts where these serially accessed arrays, instead of being vertically sliced into separate arrays, are horizontally sliced into different groups of addresses/rows, within one or more arrays, as shown in
The term “array” will continue to be used when referring to the cache or lookup arrays, even though they are now only subsets of physical arrays.
When balancing real estate between TLB/directory(s)/cache, assume the best tradeoff is for the cache to use the majority of real estate. Assume the most efficient array depth is a power of two. To achieve both of these goals, the desired cache array depth is not a power of two. Typical array addressing does not achieve this.
One key solution is to:
This means the setids of the cache are horizontally sliced. For the 2 examples in
It is preferred that the lookup path arrays have their setids sliced vertically, so that their setids can be compared against in parallel. See (i) log/absolute directory setids A through C/F 2010 and 2020 in
It is mentioned above that this idea allows for easier implementation of higher set associativities. Accessing the arrays in series removes the need for parallel compares that had existed only for the purpose of further latency reduction, such as parallel absolute directory compares against multiple TLB setids 1815. Using the directory hit setids as encodes of cache address bits, instead of late selects, is more area-efficient. Although the examples below vertically slice all setids for a given lookup array, an alternative for handling high set-associativity is to do a combination of vertical and horizontal slicing. For example, grouping half the setids in one horizontal slice, and the other half in another horizontal slice, where each group of 3 setids is vertically sliced, like this:
This would increase latency further, but would allow for reducing the number of setids read in parallel, along with reducing the number of comparators needed.
For
What follows is a description of the example in
The sequence for an exemplary XI is the following:
The sequence for an exemplary fetch is the following:
The comparator efficiency follows: the row position of the absolute directory AA fields could be lined up with the position of the log directory LA or EXT field. Then the absolute directory AA comparators could be shared with the log directory LA or EXT compares.
What follows is a description of the example in
The sequence for XI is the following:
The sequence for TLB AA spop (special op) is the following:
The sequence for a fetch is the following:
Concerning comparator efficiency, the row position of the 6 absolute directory AA fields could be lined up with the position of the 6 TLB LA/EXT/AA fields as seen in 2120 and 2120. Then the 6 absolute directory AA comparators could be shared with the TLB LA/EXT/AA fields.
Fungible arrays for memory, logic, and mixed memory/logic operations are well-suited for use with merged arrays.
The intrinsic flexibility of the hardware, covering logic, memory, and both, permits a variety of hardware organizations suited to achieve far higher performance per die area than a general purpose processor. Moreover, the underlying circuit structure is regular, which lends itself to simpler implementations in the highly constrained design execution process of a microwave design in superconducting technology. Again, looking at the fetch for the example in
There are many alternative embodiments of merged arrays that use different addressing combinations, different set associativities, larger cache sizes, more array instances, different building block dimensions, different lookup approaches, etc. An alternative concept would be to structure the fungible arrays and surrounding logic such that various lookup/cache structures could be added and removed during machine operation.
It is to be understood that the methods, circuits, systems and components depicted herein are merely exemplary. Alternatively, or in addition, the functionality described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), Application-Specific Standard Products (ASSPs), System-on-a-Chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc. In an abstract but still definite sense, any arrangement of components to achieve the same functionality is effectively “associated” such that the desired functionality is achieved. Hence, any two components herein combined to achieve a particular functionality can be seen as “associated with” each other such that the desired functionality is achieved, irrespective of architectures or inter-medial components. Likewise, any two components so associated can also be viewed as being “operably connected,” or “coupled,” to each other to achieve the desired functionality.
The functionality associated with the examples described in this disclosure can also include instructions stored in a non-transitory media. The term “non-transitory media” as used herein refers to any media storing data and/or instructions that cause a machine, such as a processor, to operate in a specific manner. Exemplary non-transitory media include non-volatile media and/or volatile media. Non-volatile media include, for example, a hard disk, a solid-state drive, a magnetic disk or tape, an optical disk or tape, a flash memory, an EPROM, NVRAM, MRAM, or other such media, or networked versions of such media.
Volatile media include, for example, dynamic memory, such as, DRAM, SRAM, a cache, or other such media. Non-transitory media is distinct from, but can be used in conjunction with transmission media. Transmission media is used for transferring data and/or instruction to or from a machine. Exemplary transmission media, include coaxial cables, fiber-optic cables, copper wires, and wireless media, such as radio waves.
Furthermore, those skilled in the art will recognize that boundaries between the functionality of the above-described operations are merely illustrative. The functionality of multiple operations may be combined into a single operation, and/or the functionality of a single operation may be distributed in additional operations. Moreover, alternative embodiments may include multiple instances of a particular operation, and the order of operations may be altered in various other embodiments.
Although the disclosure provides specific examples, various modifications and changes can be made without departing from the scope of the disclosure as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of the present disclosure. Any benefits, advantages, or solutions to problems that are described herein with regard to a specific example are not intended to be construed as a critical, required, or essential feature or element of any or all the claims.
Furthermore, the terms “a” or “an,” as used herein, are defined as one or more than one. Also, the use of introductory phrases such as “at least one” and “one or more” in the claims should not be construed to imply that the introduction of another claim element by the indefinite articles “a” or “an” limits any particular claim containing such introduced claim element to inventions containing only one such element, even when the same claim includes the introductory phrases “one or more” or “at least one” and indefinite articles such as “a” or “an.” The same holds true for the use of definite articles.
Unless stated otherwise, terms such as “first” and “second” are used to arbitrarily distinguish between the elements such terms describe. Thus, these terms are not necessarily intended to indicate spatial, temporal, or other prioritization of such elements.
This application claims the benefit of and priority to U.S. Provisional Patent Application No. 63/425,160, filed Nov. 14, 2022, entitled “Superconducting Memory, Programmable Logic Arrays, and Fungible Arrays,” U.S. Provisional Patent Application No. 63/394,130, filed on Aug. 1, 2022, entitled “Control and Data Flow Logic for Reading and Writing Large Capacity Memories, Logic Arrays, and Interchangeable Memory and Logic Arrays Within Superconducting Systems,” and U.S. Provisional Patent Application No. 63/322,694, filed Mar. 23, 2022, entitled “Control Logic, Buses, Memory and Support Circuitry for Reading and Writing Large Capacity Memories Within Superconducting Systems,” the disclosures of which are incorporated by reference herein in their entirety for all purposes.
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/US2023/016090 | 3/23/2023 | WO |
| Number | Date | Country | |
|---|---|---|---|
| 63425160 | Nov 2022 | US | |
| 63394130 | Aug 2022 | US | |
| 63322694 | Mar 2022 | US |