Electronic devices have become an integral part of daily life. Many electronic applications perform various operations such as TopK operation, ArgMax operation, SoftMax operation, etc., on an input data, e.g., a vector, in order to make a certain determinations. For example, in one particular application in machine learning (ML), a TopK operation is used to identify the top K indices or entries with the highest probabilities among a large set of data entries, e.g., classifying an image among thousands of classes. TopK operation has also become a common operator in other applications, such as ad-hoc search and retrieval in relational databases, documents, and multimedia databases, etc. A special case of TopK operation is the ArgMax operation, in which K is set equal to one, to identify the largest data value and its index. To perform an ArgMax operation, elements in a vector are compared to one another to identify the largest value and the index location associated with the largest value.
In general, a SoftMax operation operates on an input data, e.g., data associated with an image, translation term, etc., normalizes the input data into a probability distribution with respective probability values. For example, the SoftMax operation may be performed on an image of a cat, to determine the likelihood that the image is a subject of interest, e.g., a cat as opposed to a dolphin, a dog, a pole, etc. In another example, SoftMax operation has become even more important in light of recent developments in ML, e.g., self-driving vehicles to identify objects, language translation to determine the meaning of the term, etc. The probability distribution can be used in a self-driving application to identify the likelihood that the object is a person as opposed to a pole, etc., whereas in translation applications the SoftMax operation is used to determine the correct translation of a term as an example.
The amount of data being processed has increased substantially in recent years, given the increase in ML applications as well as the increase in the amount of data being exchanged. Unfortunately, conventional systems utilize a single processor (element) to process large amount of data, resulting in large delays and slower processing speed for various ML operations such as SoftMax, ArgMax operation, etc. For example, some conventional systems gather distributed data onto a single processing element in order to perform the desired operation, e.g., SoftMax operation, sequentially, which is time consuming and unsuitable for many applications, e.g., self-driving vehicles. Moreover, using a single processing element usually results in the local memory, e.g., static random access memory (SRAM), being inadequate to perform the operations without a need to utilize the external double data rate (DDR) memory, which results in additional data transfer latencies.
Aspects of the present disclosure are best understood from the following detailed description when read with the accompanying figures. It is noted that, in accordance with the standard practice in the industry, various features are not drawn to scale. In fact, the dimensions of the various features may be arbitrarily increased or reduced for clarity of discussion.
The following disclosure provides many different embodiments, or examples, for implementing different features of the subject matter. Specific examples of components and arrangements are described below to simplify the present disclosure. These are, of course, merely examples and are not intended to be limiting. In addition, the present disclosure may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.
Before various embodiments are described in greater detail, it should be understood that the embodiments are not limiting, as elements in such embodiments may vary. It should likewise be understood that a particular embodiment described and/or illustrated herein has elements which may be readily separated from the particular embodiment and optionally combined with any of several other embodiments or substituted for elements in any of several other embodiments described herein. It should also be understood that the terminology used herein is for the purpose of describing the certain concepts, and the terminology is not intended to be limiting. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood in the art to which the embodiments pertain.
A need has arisen to perform certain ML operations, e.g., SoftMax, ArgMax, TopK, etc., on an ML hardware with a plurality of processing tiles that enables data to be processed in a much faster fashion in comparison to the sequential processing of a single processing element, thereby improving the processing speed. Leveraging multiple processing tiles addresses inadequacies associated with data movement between local memory, e.g., SRAM, and external memory such as DDR because a large data set is broken down to smaller data sets, which can be processed by each processing tile locally without a need to access the external memory once the data is stored locally.
Specifically, the core is configured to divide the plurality of ML commands between the core, e.g., host or host CPU, and the inference engine for efficient execution thereof. The ML commands, e.g., SoftMax, TopK, ArgMax, etc., are compiled by the compiler into ISA instructions and the relevant data associated with the ISA instructions are transmitted for execution to the inference engine from the core and the memory to the instruction-streaming engine and the data streaming engine for efficient streaming to the inference engine. The data and instruction steaming engines are configured to send one or more data streams, e.g., data sub-vectors to be operated on by the plurality of processing elements, and ML commands that are compiled, e.g., ISA instructions corresponding to SoftMax, TopK or ArgMax, to the inference engine in response to the received programming instructions from the core.
It is appreciated that, in some embodiments, the ML commands being transmitted from the core to the data/instruction-streaming engines is in a function call format, therefore enabling different processors with different instruction set architectures to be programmed using one type of instruction set architecture. To the core, the operation being performed is a write operation into a memory component, but in reality the operation being done is passing on specific instructions along with their associated data via a function call to the streaming engines for transmission to the inference engine where they can be executed. The inference engine is configured to process the instruction/data streams received from the data/instruction stream engines for the ML operation according to the programming instructions received from the instruction/data streaming engines.
For a non-limiting example, the inference engine may include 64 processing elements (each processing element may further include a plurality of smaller processing elements PE and POD that are described in the U.S. patent application Ser. No. 16/226,508, filed Dec. 19, 2018 that is incorporated herein by reference in its entirety). Each of those processing elements is configured to receive a sub-vector and an instruction (i.e., compiled SoftMax instructions, ArgMax instruction, etc.). As such, multiple sub-vectors may be operated on simultaneously, thereby reducing the processing time. For illustrative purposes, it is assumed that there are 64 processing elements (also referred to as processing tiles) where each processing element is configured to process 64 elements with a depth of 10 (i.e., 10 vectors). However, it is appreciated that any number of processing tiles, each being capable of processing any number of elements such as 32 as opposed to 64 with a different depth such as 5. In some examples, 4 processing elements may receive a sub-vector (each 32 elements as an example) to process an ArgMax operation on a vector data of size 128 elements in parallel while the other 60 processing elements of the inference engine may operate on a different vector or perform a different ML operation altogether. Accordingly, the index associated with the vector with the largest value can be identified.
The proposed ML hardware architecture is highly efficient, flexible and optimized for high-efficiency ML computing while programmable to adapt to the changing environment, usage, applications and algorithms for ML with reduced overhead. By providing hardware support to streamline data/instruction flow, the proposed ML hardware architecture improves system-level performance by significantly reducing the hardware overhead involved in moving data and/or instruction in existing computing architectures. Moreover, the programming instruction set reduces the number of instructions required to perform certain tasks, e.g., processing, moving data, loading data, etc. The proposed ML hardware architecture works well with existing software frameworks and code and may be applied to a wide variety of ML algorithms and neural networks including but not limited to convolution neural network (CNN), recurrent neural network (RNN), gradient boosting machine (GBM), generative adversarial neural network, decision trees, random forest, support vector machine (SVM), clustering, Markov random field (MRF), etc.
A SoftMax operation when compiled is generally broken down into sub-operations or tasks. For example, a SoftMax operation generally involves identifying the maximum value within a given vector. The maximum value is then subtracted from each element of the vector and an exponential of the result is formed to form exponential values. The exponential results are summed and the result is inverted to form an inverted value. Finally, the exponential values are multiplied by the inverted value. Various steps in SoftMax operation is summarized below:
Performing the SoftMax operation on an ML hardware with multiple processing tiles is challenging. For example, if a vector is divided into a plurality of sub-vectors then identifying the largest or maximum value may need certain data to be exchanged between the processing tiles. Other operations of the SoftMax operation may similarly need certain information to be exchanged between the processing tiles. It is appreciated that generally latency increases as the amount of data exchange between processing tiles increases. As such, performing the operations for the SoftMax operation in an efficient manner utilizing the architecture of the ML hardware is critical.
The proposed approach performs the SoftMax operation in an efficient manner while leveraging the architecture of the ML hardware with multiple processing tiles to increase processing speed and reducing latencies associated with data movement. The architecture of ML hardware is described first before describing the proposed approach to perform an ML operation such as SoftMax operation.
In the example of
The ML hardware in this nonlimiting example receives an input data 299. For illustration purposes the input data 299 is presumed to be (4096,10). The input data 299 may be represented as a data matrix as shown below:
In other words, the input data 299 includes 10 vectors, each vector having 4096 elements. In some examples the vector A=(a1, . . . , a4096) may be associated with a first image, while vector B=(b1, . . . , b4096) which is the second column of the input data 299 may be associated with a second image. It is appreciated that vector C=(c1, . . . , c4096) is the third column of the input data 299 and associated with a third image, vector C=(c1, . . . , c4096) is the third column of the input data 299 and associated with a third image, vector D=(d1, . . . , d4096) is the fourth column of the input data 299 and associated with a fourth image, vector E=(e1, . . . , e4096) is the fifth column of the input data 299 and associated with a fifth image, vector F=(f1, . . . , f4096) is the sixth column of the input data 299 and associated with a sixth image, vector G=(g1, . . . , g4096) is the seventh column of the input data 299 and associated with a seventh image, vector H=(h1, . . . , h4096) is the eighth column of the input data 299 and associated with an eighth image, vector I=(i1, . . . , i4096) is the ninth column of the input data 299 and associated with a ninth image, and vector J=(j1, . . . , j4096) is the tenth column of the input data 299 and associated with a tenth image.
Since each vector includes 4096 elements, no one processing tile can process the entire vector alone (because each processing tile is capable of processing 64 elements) unless it is processed sequentially, which is inefficient, slow and suffers from unnecessary latency.
Accordingly, the input data 299 may be divided into smaller portions. For example, vector A=(a1, . . . , a4096) may be divided into 64 portions each with 64 elements. In other words, vector A=(A1, A2, . . . , A64) where A1=(a1, . . . , a64), A2=(a65, . . . , a128), A3=(a129, . . . , a192), etc. Similarly vector B=(b1, . . . , b4096) may be divided into 64 portions each with 64 elements. In other words, vector B=(B1, B2, . . . , B64) where B1=(b1, . . . , b64), B2=(b65, . . . , b128), B3=(b129, . . . , b192), etc. Similarly other vectors of the input data 299 are divided into smaller portions. For example, vector C=(C1, C2, . . . , C64) where C1=(c1, . . . , c64), C2=(c65, . . . , C128), C3=(C129, . . . , c192), etc., vector D=(D1, D2, . . . , D64) where D1=(d1, . . . , d64), D2=(d65, . . . , d128), D3=(d129, . . . , d192), etc., vector E=(E1, E2, . . . , E64) where E1=(e1, . . . , e64), E2=(e65, . . . , e128), E3=(e129, . . . , e192), etc., vector F=(F1, F2, . . . , F64) where F1=(f1, . . . , f64), F2=(f65, . . . , f128), F3=(f129, . . . , f192), etc., vector G=(G1, G2, . . . , G64) where G1=(g1, . . . , g64), G2=(g65, . . . , g128), G3=(g129, . . . , g192), etc., vector H=(H1, H2, . . . , H64) where H1=(h1, . . . , h64), H2=(h65, . . . , h128), H3=(h129, . . . , h192), etc., vector I=(I1, I2, . . . , I64) where I1=(i1, . . . , i64), I2=(i65, . . . i128), I3=(i129, . . . , i192), etc., and vector J=(J1, J2, . . . , J64) where J1=(j1, . . . , j64), J2=(j65, . . . , j128), J3=(j129, . . . , j192), etc. As such, the input data 299 that is divided and may form data input associated with each processing tile. For example, data 201A, 202A, 203A, 204A, . . . , 264A, are formed for processing by their respective processing tiles 201, . . . , 264.
It is appreciated that data 201A may comprise the first 64 elements from each image. In other words, data 201A=(A1, B1, C1, D1, . . . , J1), which is transmitted to processing tile 201 for processing. Data 202A may comprise the second set of 64 elements from each image, i.e., data 202A=(A2, B2, C2, D2, . . . , J2), which is transmitted to processing tile 202 for processing. It is appreciated that data 203A is the third set of 64 elements and is formed similar to above and transmitted to the third processing tile 203. It is appreciated that other data inputs are similarly formed and transmitted to their respective processing tiles, i.e., data 204A is formed and transmitted to processing tile 204, . . . , and data 264A is formed and transmitted to processing tile 264.
Accordingly, the first 64 elements of the 10 images are processed by tile 201, while the second 64 elements of the 10 images are processed by tile 202, while the third 64 elements of the 10 images are processed by tile 203, and while other portions of the images are processed by other processing tiles simultaneously. In other words, the processing tile 201 processes data 201A concurrently as the processing tile 202 processing data 202A concurrently as the processing tile 203 processing data 203A so and so forth concurrently as the processing tile 264 processing data 264A.
It is appreciated that the input data 299 is divided based on the architecture of the ML hardware 100 and its capabilities, e.g., ability to process 64 elements with depth of 10 simultaneously, as well as the size of the data being received. As such, the input data 299 may have been divided differently for a different ML hardware architecture with different capabilities.
Once each processing tile receives its respective data for processing, the SoftMax operation may be performed. First each processing tile performs an operation to find a maximum value for a vector associated with each image. For example, each processing tile may perform an ArgMax operation, TopK operation, etc., on each vector associated with each image. For example, the processing tile 201 performs operations to find the maximum value (say a11) for entries within A1, maximum value (say b11) for entries within B1, maximum value (say c11) for entries within C1, maximum value (say d11) for entries within D1, maximum value (say e11) for entries within E1, maximum value (say f11) for entries within F1, maximum value (say g11) for entries within G1, maximum value (say h11) for entries within H1, maximum value (say i11) for entries within I1, and maximum value (say j11) for entries within J1. The processing tile 201 therefore may form a vector with 10 elements, each element representing the maximum value of the data portion of its respective image. For example, the processing tile 201 may form vector (a11, b11, c11, d11, e11, f11, g11, h11, i11, j11). Simultaneously, other processing tiles, i.e., processing tiles 202-264, find the maximum values for their respective input data similar to processing tile 201. Accordingly, each processing tile in this nonlimiting example finds 10 maximum values, one for each portion of the image that is received. For example, processing tile 202 forms vector (a21, b21, c21, d21, e21, f21, g21, h21, i21, j21) where a21 is the maximum value within A2, b21 is the maximum value within B2, c21 is the maximum value within C2, d21 is the maximum value within D2, e21 is the maximum value within E2, f21 is the maximum value within F2, g21 is the maximum value within G2, h21 is the maximum value within H2, i21 is the maximum value within I2, and j21 is the maximum value within J2. Similarly, processing tiles 203-264 form their respective vectors for the maximum values, one for each portion of the image that is received. It is appreciated that operations to find the maximum values, as described above, occurs simultaneously between processing tiles 201-264, thereby improving the processing speed.
It is appreciated that since data associated with one image is scattered over multiple processing tiles (in this example 64 processing tiles), the maximum values found are local maximum values and not global. In order to find the global maximum value for each image, the local maximums that have been found by each processing tile is communicated to other processing tiles in order for each processing tile to find its global maximum value for each image. In other words, the processing tiles 201-264 may exchange their maximum values with other processing tiles such that each processing tile can find the global maximum for each image via, e.g., an all2all operation among the processing tiles. For example, processing tile 201 may send the vector (a11, b11, c11, d11, e11, f11, g11, h11, i11, j11) to other processing tiles while the processing tile 202 may send the vector (a21, b21, c21, d21, e21, f21, g21, h21, i21, j21) to other processing tiles, etc. Accordingly, each processing tile can independently process and find its global maximum for each image. Alternatively, the maximum values may be transmitted to one processing tile for processing and the result may be communicated to processing tiles 201-264.
Accordingly, the maximum value for vector A may be found to be amax, for vector B may be found to be bmax, for vector C may be found to be cmax, for vector D may be found to be dmax, for vector E may be found to be emax, for vector F may be found to be fmax, for vector G may be found to be gmax, for vector H may be found to be hmax, for vector I may be found to be imax, and for vector J may be found to be jmax.
Once the global maximums for each image is found, then each processing tile performs an elementwise subtraction of its maximum value. For example, processing tile 201 may subtract amax from elements within A1 while it may subtract bmax from elements within B1, etc. For example, (a1−amax), (a2−amax), . . . , (a64−amax), (b1−bmax), (b2−bmax), . . . , (b64−bmax), (c1−cmax), (c2−cmax), . . . , (c64−cmax), (d1−dmax), (d2−dmax), . . . , (d64−dmax), (e1−emax), (e2−emax), . . . , (e64−emax), (f1−fmax), (f2−fmax), . . . , (f64−fmax), (g1−gmax), (g2−gmax), . . . , (g64−gmax), (h1−hmax), (h2−hmax), . . . , (h64−hmax), (i1−imax), (i2−imax), . . . , (i64−imax), (j1−jmax), (j2−jmax), . . . , (j64−jmax). Similarly, other processing tiles 202-264 subtract the maximum values from each element. For example, processing tile 202 may subtract amax from elements within A2 while it may subtract bmax from elements within B2, etc., processing tile 203 may subtract amax from elements within A3 while it may subtract bmax from elements within B3, etc. As such, subtraction of the maximum values occurs simultaneously between processing tiles 201-264 in an elementwise fashion (i.e., the respective global maximum of each image is subtracted from the element of the vector).
Once the subtraction of the global maximums, as described above, is complete, the processing tiles 201-264 perform an exponential operation on each element (i.e., forming a value that is between 0-1 inclusive) to form an exponential element. For example, processing tile 201 may perform exp(a1−amax), exp(a2−amax), . . . , exp(a64−amax), exp(b1−bmax), exp(b2−bmax), . . . , exp(b64−bmax), exp(c1−cmax), exp(c2−cmax), . . . , exp(c64−cmax), exp(d1−dmax), exp(d2−dmax), . . . , exp(d64−dmax), exp(e1−emax), exp(e2−emax), . . . , exp(e64−emax), exp(f1−fmax), exp(f2−fmax), . . . , exp(f64−fmax), exp(g1−gmax), exp(g2−gmax), . . . , exp(g64−gmax), exp(h1−hmax), exp(h2−hmax), . . . , exp(h64−hmax), exp(i1−imax), exp(i2−imax), . . . , exp(i64−imax), exp(j1−jmax), exp(j2−jmax), . . . , exp(j64−jmax). It is appreciated that each processing tile may also perform a summation for each vector to form a partial sum of an image. For example, the processing tile 201 may form ten partial sums, one for each image. In some embodiments, a separate register may be used for storing the partial sum value. It is appreciated that accumulation of the exponential values in the separate register makes the partial sum available without a need to perform an additional summation later. The partial sum for A1 may be represented below:
Similarly, partial sums for B1, C1, D1, E1, F1, G1, H1, and I1 are represented below:
It is appreciated that other processing tiles may similarly form their respective partial sums for each image. In some embodiments, the partial sums that are created by each processing tile is transmitted to other processing tiles such that a complete sum value for each image can be formed via, e.g., an all2all operation among the processing tiles. For example, the complete sum for vector A associated with the first image may be formed as shown below:
Similarly, complete sum for vectors B, C, D, E, F, G, H, I, and J associated with second through tenth image can be formed and shown below:
It is appreciated that in some embodiments, instead of sending the partial sums to every processing tile, the partial sums may be transmitted to one processing tile to calculate a global sum, as described above. The global sum associated with each image may then be transmitted to each processing tile.
It is appreciated that once the global sum for each image is formed, each processing tile may compute 1/global sum for each image. In other words, each processing tile, e.g., processing tiles 201-264, each compute 1/global sum for each image for vectors A, B, C, . . . , J to form a scaled value. For example, 1/Global Sum for A, 1/Global Sum for B, 1/Global Sum for C, . . . , 1/Global Sum for J are formed by each processing tile.
Once the scaled value is formed, each processing tile may perform an elementwise operation by calculating the multiplication of the scaled value by each exponential element. For example, processing tile 201 may multiply each exponential element for vector A, e.g., exp(a1−amax), exp(a2−amax), . . . , exp(a64−amax), by 1/Global Sum for A, multiply each exponential element for vector B, e.g., exp(b1−bmax), exp(b2−bmax), . . . , exp(b64−bmax), by 1/Global Sum for B, etc. Similarly other processing tiles perform a multiplication of the scaled values in an elementwise fashion. For example, the processing tile 202 may multiply each exponential element for vector A, e.g., exp(a65-amax), exp(a66−amax), . . . , exp(a128−amax) by 1/Global Sum for A, multiply each exponential element for vector B, e.g., exp(b65−bmax), exp(b66−bmax), . . . , exp(b128−bmax), by 1/Global Sum for B, etc. Other processing tiles also perform a similar calculation. It is appreciated that the processing tiles may perform the calculation simultaneously, thereby improving the processing speed.
Accordingly, the processing tiles 201-264 perform a SoftMax operation on the elements of the input data 299. It is appreciated that leveraging the architecture of the ML hardware, as described above, enables data to be processed (e.g., SoftMax operation) in a much faster fashion in comparison to the sequential processing of a single processing element, thereby improving the processing speed. Leveraging multiple processing tiles addresses inadequacies associated with data movement between local memory, e.g., SRAM, and external memory such as DDR because a large data set is broken down to smaller data sets, which can be processed by each processing tile locally without a need to access the external memory once the data is stored locally. Moreover, it is appreciated that intelligently exchanging local data to derive the global data also reduce latencies associated with unnecessarily sharing data.
It is appreciated that while the embodiments are described with respect to a SoftMax operation for illustrative purposes and should not be construed as limiting the scope of the embodiments. For example, any operation may be performed by the ML hardware leveraging its parallel processing capabilities. As one example, large data may be received by the ML hardware. The large data may be divided into smaller portions and each portion may be sent to one processing tile among a plurality of processing tiles for processing. Each processing tile performs a series of operations to calculate local values associated with each processing tile. The local values may be exchanged among the plurality of processing tiles to calculate global value(s). The global value(s) may be used by each processing tile to perform additional operations on its respective received data. The process may be repeated until the final results are computed.
Another example of implementing a SoftMax on an ML hardware is described in
Accordingly, unlike in
Similar to
It is appreciated that the processing tiles may be within an inference engine of an ML hardware and may perform local and/or global operations simultaneously, as described above.
It is appreciated that various methodologies may be utilized in finding the maximum value of a vector. Below is a description of finding the maximum value according to some embodiments that were described in the U.S. patent application Ser. No. 17/511,111, that was filed on Oct. 26, 2021 and which is incorporated herein in its entirety.
The foregoing description of various embodiments of the claimed subject matter has been provided for the purposes of illustration and description. It is not intended to be exhaustive or to limit the claimed subject matter to the precise forms disclosed. Many modifications and variations will be apparent to the practitioner skilled in the art. Embodiments were chosen and described in order to best describe the principles of the invention and its practical application, thereby enabling others skilled in the relevant art to understand the claimed subject matter, the various embodiments and the various modifications that are suited to the particular use contemplated.
This application is a nonprovisional application and claims the benefit and priority to the U.S. Provisional Patent Application No. 63/282,557, that was filed on Nov. 23, 2021, which is incorporated herein by references in its entirety. This application is a continuation in part application and claims the benefit and priority to the U.S. patent application Ser. No. 17/511,111, that was filed on Oct. 26, 2021. The U.S. patent application Ser. No. 17/511,111 claims the benefit and priority to U.S. Provisional Patent Application No. 63/105,861, filed Oct. 26, 2020, and entitled “METHOD AND APPARATUS FOR PERFORMING ARGMAX OPERATIONS IN PARALLEL ON MACHINE LEARNING HARDWARE,” which is incorporated herein in its entirety by reference. This application is also a continuation-in-part patent application and claims the benefit and priority to U.S. patent application Ser. No. 17/248,045, filed Jan. 6, 2021, entitled “INSTRUCTION SET ARCHITECTURE (ISA) FORMAT FOR MULTIPLE INSTRUCTION SET ARCHITECTURES IN MACHINE LEARNING INFERENCE ENGINE,” which is a continuation application of U.S. patent application Ser. No. 16/226,508, filed Dec. 19, 2018, which claims the benefit and priority to U.S. Provisional Patent Application No. 62/628,130 filed on Feb. 8, 2018, U.S. Provisional Patent Application No. 62/644,352 filed on Mar. 16, 2018, and U.S. Provisional Patent Application No. 62/675,076 filed on May 22, 2018, all of which are incorporated herein by reference in their entirety.
Number | Name | Date | Kind |
---|---|---|---|
4982291 | Kurahashi et al. | Jan 1991 | A |
5329611 | Pechanek | Jul 1994 | A |
5481487 | Jang et al. | Jan 1996 | A |
5948098 | Leung et al. | Sep 1999 | A |
6415377 | Wolf et al. | Jul 2002 | B1 |
6640262 | Uppunda et al. | Oct 2003 | B1 |
7089380 | Schober | Aug 2006 | B1 |
7191163 | Herrera et al. | Mar 2007 | B2 |
7394288 | Agarwal | Jul 2008 | B1 |
7509363 | Clifton | Mar 2009 | B2 |
7809663 | Birch et al. | Oct 2010 | B1 |
7840914 | Agarwal | Nov 2010 | B1 |
7853752 | Agarwal et al. | Dec 2010 | B1 |
7912883 | Hussain | Mar 2011 | B2 |
8200728 | Michaels et al. | Jun 2012 | B2 |
8209703 | Yee et al. | Jun 2012 | B2 |
8738860 | Griffin | May 2014 | B1 |
9954771 | Levy et al. | Apr 2018 | B1 |
10161786 | Chang et al. | Dec 2018 | B2 |
10296556 | Zhou | May 2019 | B2 |
10305766 | Zhang et al. | May 2019 | B1 |
10884736 | Farooqui | Jan 2021 | B1 |
11016801 | Sodani et al. | May 2021 | B1 |
11604799 | Bigdelu et al. | Mar 2023 | B1 |
20030163671 | Gschwind et al. | Aug 2003 | A1 |
20040153501 | Yamashita et al. | Aug 2004 | A1 |
20070122347 | Statnikov et al. | May 2007 | A1 |
20080040577 | Nemirovsky et al. | Feb 2008 | A1 |
20090097480 | Curtis et al. | Apr 2009 | A1 |
20090158005 | Carmichael | Jun 2009 | A1 |
20100017420 | Archer | Jan 2010 | A1 |
20110238963 | Kim et al. | Sep 2011 | A1 |
20110307890 | Achilles et al. | Dec 2011 | A1 |
20130101035 | Wang | Apr 2013 | A1 |
20130117521 | Li et al. | May 2013 | A1 |
20140007098 | Stillwell, Jr. et al. | Jan 2014 | A1 |
20150019836 | Anderson et al. | Jan 2015 | A1 |
20150046753 | Cecka et al. | Feb 2015 | A1 |
20150106568 | Feldman et al. | Apr 2015 | A1 |
20150309808 | Nandy et al. | Oct 2015 | A1 |
20150347012 | Dewitt et al. | Dec 2015 | A1 |
20160132272 | Iwashita | May 2016 | A1 |
20160162402 | Woolley, Jr. et al. | Jun 2016 | A1 |
20160170916 | Deshpande et al. | Jun 2016 | A1 |
20170068571 | Lu et al. | Mar 2017 | A1 |
20170083313 | Sankaralingam et al. | Mar 2017 | A1 |
20170353397 | Che | Dec 2017 | A1 |
20170357483 | Nicol et al. | Dec 2017 | A1 |
20170364694 | Jacob et al. | Dec 2017 | A1 |
20180047126 | Falkenstern et al. | Feb 2018 | A1 |
20180068019 | Novikoff et al. | Mar 2018 | A1 |
20180300617 | McBride et al. | Oct 2018 | A1 |
20180341484 | Fowers et al. | Nov 2018 | A1 |
20180349388 | Skiles | Dec 2018 | A1 |
20190121641 | Knowles et al. | Apr 2019 | A1 |
20190121679 | Wilkinson | Apr 2019 | A1 |
20190146455 | Beylkin et al. | May 2019 | A1 |
20190147471 | McKelvey, Jr. et al. | May 2019 | A1 |
20190138210 | Lindholm | Nov 2019 | A1 |
20210158155 | Zhang | May 2021 | A1 |
20210216874 | Jegou | Jul 2021 | A1 |
20210319317 | Power | Oct 2021 | A1 |
20210390076 | Fang et al. | Dec 2021 | A1 |
20220067513 | Stevens | Mar 2022 | A1 |
20220076110 | Shao | Mar 2022 | A1 |
20220207783 | Kwong | Jun 2022 | A1 |
20220261650 | Zhao | Aug 2022 | A1 |
20220405566 | Winterbottom | Dec 2022 | A1 |
20230024035 | Thuerck | Jan 2023 | A1 |
20230071931 | Huang et al. | Mar 2023 | A1 |
20230106651 | Xi | Apr 2023 | A1 |
20230252275 | Nez | Aug 2023 | A1 |
Number | Date | Country |
---|---|---|
2604142 | Aug 2022 | GB |
20210052188 | May 2021 | KR |
0245385 | Jun 2002 | WO |
2018222904 | Dec 2018 | WO |
Entry |
---|
Ceze, L., et al. Colorama: Architectural Support for Data-Centric Synchronization, 2007, IEEE, pp. 134-144 (Year: 2007). |
Brewer, “Instructions Set Innovations for the Convey HC-1 Computer”, 2010, pp. 70-79, Year: 2010. |
Seng, et al. “Reducing Power with Dynamic Critical Path Information”, Jan. 1999, pp. 114-123; Year: 1999. |
Gelado, et al., “An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems”, 2010 (Year: 2010) 12 pages. |
Number | Date | Country | |
---|---|---|---|
63282557 | Nov 2021 | US | |
63105861 | Oct 2020 | US | |
62675076 | May 2018 | US | |
62644352 | Mar 2018 | US | |
62628130 | Feb 2018 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 16226508 | Dec 2018 | US |
Child | 17511111 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 17511111 | Oct 2021 | US |
Child | 17590994 | US | |
Parent | 17248045 | Jan 2021 | US |
Child | 17590994 | US |