Methods and apparatus for performing matrix transformations within a memory array

Description

RELATED APPLICATIONS

This application is related to U.S. patent application Ser. No. 16/002,644 filed Jun. 7, 2018 and entitled “AN IMAGE PROCESSOR FORMED IN AN ARRAY OF MEMORY CELLS”, U.S. patent application Ser. No. 16/211,029, filed Dec. 5, 2018, and entitled “METHODS AND APPARATUS FOR INCENTIVIZING PARTICIPATION IN FOG NETWORKS”, Ser. No. 16/242,960, filed Jan. 8, 2019, and entitled “METHODS AND APPARATUS FOR ROUTINE BASED FOG NETWORKING”, Ser. No. 16/276,461, filed on Feb. 14, 2019, and entitled “METHODS AND APPARATUS FOR CHARACTERIZING MEMORY DEVICES”, Ser. No. 16/276,471, filed on Feb. 14, 2019, and entitled “METHODS AND APPARATUS FOR CHECKING THE RESULTS OF CHARACTERIZED MEMORY SEARCHES”, and Ser. No. 16/276,489, filed on Feb. 14, 2019, and entitled “METHODS AND APPARATUS FOR MAINTAINING CHARACTERIZED MEMORY DEVICES”, each of the foregoing incorporated herein by reference in its entirety.

A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever.

BACKGROUND
1. Technological Field

The following relates generally to the field of data processing and device architectures. Specifically, a processor-memory architecture that converts a memory array into a matrix fabric for matrix transformations and performing matrix operations therein is disclosed.

2. Description of Related Technology

Memory devices are widely used to store information in various electronic devices such as computers, wireless communication devices, cameras, digital displays, and the like. Information is stored by programing different states of a memory device. For example, binary devices have two states, often denoted by a logical “1” or a logical “0.” To access the stored information, the memory device may read (or sense) the stored state in the memory device. To store information, the memory device may write (or program) the state in the memory device. So-called volatile memory devices may require power to maintain this stored information, while non-volatile memory devices may persistently store information even after the memory device itself has, for example, been power cycled. Different memory fabrication methods and constructions enable different capabilities. For example, dynamic random access memory (DRAM) offers high density volatile storage inexpensively. Incipient research is directed to resistive random access memory (ReRAM) which promises non-volatile performance similar to DRAM.

Processor devices are commonly used in conjunction with memory devices to perform a myriad of different tasks and functionality. During operation, a processor executes computer readable instructions (commonly referred to as “software”) from memory. The computer readable instructions define basic arithmetic, logic, controlling, input/output (I/O) operations, etc. As is well known in the computing arts, relatively basic computer readable instructions can perform a variety of complex behaviors when sequentially combined. Processors tend to emphasize circuit constructions and fabrication technologies that differ from memory devices. For example, processing performance is generally related to clock rates, thus most processor fabrication methods and constructions emphasize very high rate transistor switching structures, etc.

Over time, both processors and memory have increased in speed and power consumption. Typically, these improvements are a result of shrinking device sizes because electrical signaling is physically limited by the dielectric of the transmission medium and distance. As previously alluded to, most processors and memories are manufactured with different fabrication materials and techniques. Consequently, even though processors and memory continue to improve, the physical interface between processors and memories is a “bottleneck” to the overall system performance. More directly, no matter how fast a processor or memory can work in isolation, the combined system of processor and memory is performance limited to the rate of transfer allowed by the interface. This phenomenon has several common names e.g., the “processor-memory wall”, the “von Neumann Bottleneck Effect”, etc.

SUMMARY

The present disclosure provides, inter alia, methods and apparatus for converting a memory array into a matrix fabric for matrix transformations and performing matrix operations therein.

In one aspect of the present disclosure, a non-transitory computer readable medium is disclosed. In one exemplary embodiment, the non-transitory computer readable medium includes: an array of memory cells, where each memory cell of the array of memory cells is configured to store a digital value as an analog value in an analog medium; a memory sense component, where the memory sense component is configured to read the analog value of a first memory cell as a first digital value; and logic. In one exemplary embodiment, the logic is further configured to: receive a surjective opcode; operate the array of memory cells as a matrix multiplication unit (MMU) based on the matrix transformation opcode; wherein each memory cell of the MMU modifies the analog value in the analog medium in accordance with the matrix transformation opcode and a matrix transformation operand; configure the memory sense component to convert the analog value of the first memory cell into a second digital value in accordance with the matrix transformation opcode and the matrix transformation operand; and responsive to reading the matrix transformation operand into the MMU, write a matrix transformation result based on the second digital value.

In one variant, the matrix transformation opcode indicates a size of the MMU. In one such variant, the matrix transformation opcode corresponds to a frequency domain transform operation. In one exemplary variant, the frequency domain transform operation spans at least one other MMU.

In one variant, the matrix transformation opcode identifies one or more analog values corresponding to one or more memory cells. In one such variant, the one or more analog values corresponding to the one or more memory cells are stored within a look-up-table (LUT) data structure.

In one variant, each memory cell of the MMU comprises resistive random access memory (ReRAM) cells; and each memory cell of the MMU multiplies the analog value in the analog medium in accordance with the matrix transformation opcode and the matrix transformation operand.

In one variant, each memory cell of the MMU further accumulates the analog value in the analog medium with a previous analog value.

In one variant, the first digital value is characterized by a first radix of two (2); and the second digital value is characterized by a second radix greater than two (2).

In one aspect of the present disclosure, a device is disclosed. In one embodiment, the device includes a processor coupled to a non-transitory computer readable medium; where the non-transitory computer readable medium includes one or more instructions which, when executed by the processor, cause the processor to: write a matrix transformation opcode and a matrix transformation operand to the non-transitory computer readable medium; wherein the matrix transformation opcode causes the non-transitory computer readable medium to operate an array of memory cells as a matrix structure; wherein the matrix transformation operand modifies one or more analog values of the matrix structure; and read a matrix transformation result from the matrix structure.

In one variant, the non-transitory computer readable medium further comprises one or more instructions which, when executed by the processor, cause the processor to: capture image data comprising one or more captured color values; and wherein the matrix transformation operand comprises the one or more captured color values and the matrix transformation result comprises one or more shifted color values.

In one variant, the non-transitory computer readable medium further comprises one or more instructions which, when executed by the processor, cause the processor to:

receive video data comprising one or more image blocks; wherein the matrix transformation operand comprises the one or more image blocks and the matrix transformation result comprises one or more frequency domain image coefficients; and wherein the one or more analog values of the matrix structure accumulate the one or more frequency domain image coefficients from video data over time.

In one variant, the matrix transformation opcode causes the non-transitory computer readable medium to operate another array of memory cells as another matrix structure; and the matrix transformation result associated with the matrix structure and another matrix transformation result associated with another matrix structure are logically combined.

In one variant, the one or more analog values of the matrix structure are stored within a look-up-table (LUT) data structure.

In one aspect of the present disclosure, a method to perform transformation matrix operations is disclosed. In one embodiment, the method includes: receiving a matrix transformation opcode; configuring an array of memory cells of a memory into a matrix structure, based on the matrix transformation opcode; configuring a memory sense component based on the matrix transformation opcode; and responsive to reading a matrix transformation operand into the matrix structure, writing a matrix transformation result from the memory sense component.

In one variant, configuring the array of memory cells includes connecting a plurality of word lines and a plurality of bit lines corresponding to a row dimension and a column dimension associated with the matrix structure.

In one variant, the method also includes determining the row dimension and the column dimension from the matrix transformation opcode.

In one variant, configuring the array of memory cells includes setting one or more analog values of the matrix structure based on a look-up-table (LUT) data structure.

In one variant, the method includes identifying an entry from the LUT data structure based on the matrix transformation opcode.

In one variant, configuring the memory sense component enables matrix transformation results having a radix greater than two (2).

In one aspect, an apparatus configured to configure a memory device into a matrix fabric is disclosed. In one embodiment, the apparatus includes: a memory; a processor configured to access the memory; pre-processor logic configured to allocate one or more memory portions for use as a matrix fabric.

In another aspect of the disclosure, a computerized image processing device apparatus configured to dynamically configure a memory into a matrix fabric is disclosed. In one embodiment, the computerized image processing device includes: a camera interface; digital processor apparatus in data communication with the camera interface; and a memory in data communication with the digital processor apparatus and including at least one computer program.

In another aspect of the disclosure, a computerized video processing device apparatus configured to dynamically configure a memory into a matrix fabric is disclosed. In one embodiment, the computerized video processing device includes: a camera interface; digital processor apparatus in data communication with the camera interface; and a memory in data communication with the digital processor apparatus and including at least one computer program.

In another aspect of the disclosure, a computerized wireless access node apparatus configured to dynamically configure a memory into a matrix fabric is disclosed. In one embodiment, the computerized wireless access node includes: a wireless interface configured to transmit and receive RF waveforms in the spectrum portion; digital processor apparatus in data communication with the wireless interface; and a memory in data communication with the digital processor apparatus and including at least one computer program.

In an additional aspect of the disclosure, computer readable apparatus is described. In one embodiment, the apparatus includes a storage medium configured to store one or more computer programs within or in conjunction with characterized memory. In one embodiment, the apparatus includes a program memory or HDD or SDD on a computerized controller device. In another embodiment, the apparatus includes a program memory, HDD or SSD on a computerized access node.

These and other aspects shall become apparent when considered in light of the disclosure provided herein.

In another aspect of the present disclosure, a computerized apparatus is disclosed. In one embodiment, the computerized apparatus includes: control logic configured to, when operated: receive one or more instructions for an input transformation from a processor apparatus; cause configuration of a plurality of memory elements of a memory as a matrix multiplication unit (MMU) based at least on the received one or more instructions; cause a memory sense component to convert an analog value associated with a memory element of the MMU into a digital value based at least on the received one or more instructions; and based at least on the converted digital value, obtain an output result from the memory sense component.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A is a diagram of processor-memory architecture and a graphical depiction of an associated matrix operation.

FIG. 1B is a diagram of processor-PIM architecture and a graphical depiction of an associated matrix operation.

FIG. 2 is a logical block diagram of one exemplary implementation of a memory device in accordance with various principles of the present disclosure.

FIG. 3 is an exemplary side-by-side illustration of a first memory device configuration and a second memory device configuration.

FIG. 4 is a graphical depiction of a matrix operation performed in accordance with the principles of the present disclosure.

FIG. 5A is a logical block diagram of one exemplary implementation of processor-memory architecture.

FIG. 5B is a logical flow diagram of one exemplary set of matrix operations, performed in accordance with the principles of the present disclosure.

FIG. 5C is an alternate logical flow diagram of one exemplary set of matrix operations, performed in accordance with the principles of the present disclosure.

FIG. 6 is a block diagram of one exemplary method of converting a memory array into a matrix fabric and performing matrix operations therein.

DETAILED DESCRIPTION

Reference is now made to the drawings wherein like numerals refer to like parts throughout.

As used herein, the term “application” (or “app”) refers generally and without limitation to a unit of executable software that implements a certain functionality or theme. The themes of applications vary broadly across any number of disciplines and functions (such as on-demand content management, e-commerce transactions, brokerage transactions, home entertainment, calculator etc.), and one application may have more than one theme. The unit of executable software generally runs in a predetermined environment; for example, the unit could include a downloadable application that runs within an operating system environment.

As used herein, the term “computer program” or “software” is meant to include any sequence or human or machine cognizable steps which perform a function. Such program may be rendered in virtually any programming language or environment including, for example, C/C++, Fortran, COBOL, PASCAL, assembly language, markup languages (e.g., HTML, SGML, XML, VoXML), and the like, as well as object-oriented environments such as the Common Object Request Broker Architecture (CORBA), Java™ (including J2ME, Java Beans, etc.), Register Transfer Language (RTL), VHSIC (Very High Speed Integrated Circuit) Hardware Description Language (VHDL), Verilog, and the like.

As used herein, the term “decentralized” or “distributed” refers without limitation to a configuration or network architecture involving multiple computerized devices that are able to perform data communication with one another, rather than requiring a given device to communicate through a designated (e.g., central) network entity, such as a server device. For example, a decentralized network enables direct peer-to-peer data communication among multiple UEs (e.g., wireless user devices) making up the network.

As used herein, the term “distributed unit” (DU) refers without limitation to a distributed logical node within a wireless network infrastructure. For example, a DU might be embodied as a next-generation Node B (gNB) DU (gNB-DU) that is controlled by a gNB CU described above. One gNB-DU may support one or multiple cells; a given cell is supported by only one gNB-DU.

As used herein, the terms “Internet” and “internet” are used interchangeably to refer to inter-networks including, without limitation, the Internet. Other common examples include but are not limited to: a network of external servers, “cloud” entities (such as memory or storage not local to a device, storage generally accessible at any time via a network connection, and the like), service nodes, access points, controller devices, client devices, etc. 5G-servicing core networks and network components (e.g., DU, CU, gNB, small cells or femto cells, 5G-capable external nodes) residing in the backhaul, fronthaul, crosshaul, or an “edge” thereof proximate to residences, businesses and other occupied areas may be included in “the Internet.”

As used herein, the term “memory” includes any type of integrated circuit or other storage device adapted for storing digital data including, without limitation, random access memory (RAM), pseudostatic RAM (PSRAM), dynamic RAM (DRAM), synchronous dynamic RAM (SDRAM) including double data rate (DDR) class memory and graphics DDR (GDDR) and variants thereof, ferroelectric RAM (FeRAM), magnetic RAM (MRAM), resistive RAM (ReRAM), read-only memory (ROM), programmable ROM (PROM), electrically erasable PROM (EEPROM or EPROM), DDR/2 SDRAM, EDO/FPMS, reduced-latency DRAM (RLDRAM), static RAM (SRAM), “flash” memory (e.g., NAND/NOR), phase change memory (PCM), 3-dimensional cross-point memory (3D Xpoint), and magnetoresistive RANI (MRAM), such as spin torque transfer RANI (STT RAM).

As used herein, the terms “microprocessor” and “processor” or “digital processor” are meant generally to include all types of digital processing devices including, without limitation, digital signal processors (DSPs), reduced instruction set computers (RISC), general-purpose processors (GPP), microprocessors, gate arrays (e.g., FPGAs), PLDs, reconfigurable computer fabrics (RCFs), array processors, secure microprocessors, and application-specific integrated circuits (ASICs). Such digital processors may be contained on a single unitary IC die, or distributed across multiple components.

As used herein, the term “server” refers to any computerized component, system or entity regardless of form which is adapted to provide data, files, applications, content, or other services to one or more other devices or entities on a computer network.

As used herein, the term “storage” refers to without limitation computer hard drives (e.g., hard disk drives (HDD), solid state drives (SDD)), Flash drives, DVR device, memory, RAID devices or arrays, optical media (e.g., CD-ROMs, Laserdiscs, Blu-Ray, etc.), or any other devices or media capable of storing content or other information, including semiconductor devices (e.g., those described herein as memory) capable of maintaining data in the absence of a power source. Common examples of memory devices that are used for storage include, without limitation: ReRAM, DRAM (e.g., SDRAM, DDR SDRAM, DDR2 SDRAM, DDR3 SDRAM, DDR4 SDRAM, GDDR, RLDRAM, LPDRAM, etc.), DRAM modules (e.g., RDIMM, VLP RDIMM, UDIMM, VLP UDIMM, SODIMM, SORDIMM, Mini-DIMM, VLP Mini-DIMM, LRDIMM, NVDIMM, etc.), managed NAND, NAND Flash (e.g., SLC NAND, MLC NAND, TLS NAND, Serial NAND, 3D NAND, etc.), NOR Flash (e.g., Parallel NOR, Serial NOR, etc.), multichip packages, hybrid memory cube, memory cards, solid state storage (SSS), and any number of other memory devices.

As used herein, the term “Wi-Fi” refers to, without limitation and as applicable, any of the variants of IEEE Std. 802.11 or related standards including 802.11 a/b/g/n/s/v/ac or 802.11-2012/2013, 802.11-2016, as well as Wi-Fi Direct (including inter alia, the “Wi-Fi Peer-to-Peer (P2P) Specification”, incorporated herein by reference in its entirety).

As used herein, the term “wireless” means any wireless signal, data, communication, or other interface including without limitation Wi-Fi, Bluetooth/BLE, 3G (3GPP/3GPP2), HSDPA/HSUPA, TDMA, CBRS, CDMA (e.g., IS-95A, WCDMA, etc.), FHSS, DSSS, GSM, PAN/802.15, WiMAX (802.16), 802.20, Zigbee®, Z-wave, narrowband/FDMA, OFDM, PCS/DCS, LTE/LTE-A/LTE-U/LTE-LAA, analog cellular, CDPD, satellite systems, millimeter wave or microwave systems, acoustic, and infrared (i.e., IrDA).

Overview

The aforementioned “processor-memory wall” performance limitations can be egregious where a processor-memory architecture repeats similar operations over a large data set. Under such circumstances, the processor-memory architecture has to individually transfer, manipulate, and store for each element of the data set, iteratively. For example, a matrix multiplication of 4×4 (sixteen (16) elements) takes four (4) times as long as a matrix multiplication of 2×2 (four (4) elements). In other words, matrix operations exponentially scale as a function of the matrix size.

Various embodiments of the present disclosure are directed to converting a memory array into a matrix fabric for matrix transformations and performing matrix operations therein. Matrix transformations are commonly used in many different applications and can take a disproportionate amount of processing and/or memory bandwidth. For example, many image signal processing (ISP) techniques commonly use matrix transformations for e.g., color interpolation, white balance, color correction, color conversion, etc. Video compression uses e.g., the discrete cosine transform (DCT) to identify video image data that can be removed with minimum fidelity loss. Many communication technologies employ fast Fourier transforms (FFTs) and matrix multiplication for beamforming and/or massive multiple input multiple output (MIMO) channel processing.

Exemplary embodiments described herein perform matrix transformations within a memory device that includes a matrix fabric and matrix multiplication unit (MMU). In one exemplary embodiment, the matrix fabric uses a “crossbar” construction of resistive elements. Each resistive element stores a level of impedance that represents the corresponding matrix coefficient value. The crossbar connectivity can be driven with an electrical signal representing the input vector as an analog voltage. The resulting signals can be converted from analog voltages to a digital values by an MMU to yield a vector-matrix product. In some cases, the MMU may additionally perform various other logical operations within the digital domain.

Unlike existing solutions that iterate through each element of the matrix to calculate the element value, the crossbar matrix fabric described hereinafter computes multiple elements of the matrix “atomically” i.e., in a single processing cycle. For example, at least a portion of a vector-matrix product may be calculated in parallel. The “atomicity” of matrix fabric based computations yields significant processing improvements over iterative alternatives. In particular, while iterative techniques grow as a function of matrix size, atomic matrix fabric computations are independent of matrix dimensions. In other words, an N×N vector-matrix product can be completed in a single atomic instruction.

Various embodiments of the present disclosure internally derive and/or use matrix coefficient values to further minimize interface transactions. As described in greater detail herein, many useful matrix transformations may be characterized by “structurally defined dimensions” and performed with “structurally defined coefficients.” Structurally definition refers to those aspects of a matrix computation that are defined for a specific matrix structure (e.g., the rank and/or size of the matrix); in other words, the matrix coefficients can be inferred from the matrix structure and need not be explicitly provided via the processor-memory interface. For example, as described in greater detail hereinafter, the various coefficients for mathematical transforms (such “twiddle factors” for the fast Fourier transform (FFT)) are a function of the matrix size. Similarly, ISP filtering and/or massive MIMO channel coding techniques may use e.g., predefined matrixes and/or codebooks of matrixes having known structures and weighting.

As a brief aside, practical limitations on component manufacture limit the capabilities of each element within an individual memory device. For example, most memory arrays are only designed to discern between two (2) states (logical “1”, logical “0”). While existing memory sense components may be extended to discern higher levels of precision (e.g., four (4) states, eight (8) states, etc.) the increasing the precision of memory sense components may be impractical to support the precision required for large transforms typically used in e.g., video compression, mathematical transforms, etc.

To these ends, various embodiments of the present disclosure logically combine one or more matrix fabrics and/or MMUs to provide greater degrees of precision and/or processing sophistication than would otherwise be possible. In one such embodiment, a first matrix fabric and/or MMU may be used to calculate a positive vector-matrix product and a second matrix fabric and/or MMU may be used to calculate a negative vector-matrix product. The positive and negative vector-matrix product can be summed to determine the net vector-matrix product. In another such embodiment, multiple simple matrix transformations can be used to implement a larger matrix transformation. For example, the first stage of FFT processing for an FFT of size N may be decomposed into M FFTs of size N/M. Thus, the first stage of a 64-point FFT can be decomposed into thirty two (32) 2-point FFTs, sixteen (16) 4-point FFTs, and/or eight (8) 8-point FFTs, depending on a variety of factors (e.g., precision, speed, cost, power consumption, etc.) Handling FFT butterfly transformations in matrix fabric can also be further sequenced or parallelized in accordance with any number of other design considerations. Other examples of logical matrix operations can be substituted with equivalent success (e.g., decomposition, common matrix multiplication, etc.) given the contents of the present disclosure.

Certain applications can save a significant amount of power by turning off system components when not in use. For example, video compression may benefit from “sleep” during video blanking intervals (when no video data is active), etc. However, the sleep procedure often requires a processor and/or memory to shuttle data from operational volatile memory to non-volatile storage memory such that the data is not lost while powered down. Wake-up procedures are also needed to retrieve the stored information from the non-volatile storage memory. Shuttling data back and forth between memories is an inefficient use of processor-memory bandwidth. Consequently, various embodiments disclosed herein leverage the “non-volatile” nature of the matrix fabric. In such embodiments, the matrix fabric can retain its matrix coefficient values even when the memory has no power. More directly, the non-volatile nature of the matrix fabric enables a processor and memory to transition into sleep/low power modes or to perform other tasks without shuffling data from volatile memory to non-volatile memory and vice versa.

Various other combinations and/or variants of the foregoing will be readily appreciated by artisans of ordinary skill, given the contents of the present disclosure.

Detailed Description of Exemplary Embodiments

Exemplary embodiments of the apparatus and methods of the present disclosure are now described in detail. While these exemplary embodiments are described in the context of the previously specific processor and/or memory configurations, the general principles and advantages of the disclosure may be extended to other types of processor and/or memory technologies, the following therefore being merely exemplary in nature.

It will also be appreciated that while described generally in the context of a consumer device (within a camera device, video codec, cellular phone, and/or network base station), the present disclosure may be readily adapted to other types of devices including, e.g., server devices, Internet of Things (IoT) devices, and/or for personal, corporate, or even governmental uses, such as those outside the proscribed “incumbent” users such as U.S. DoD and the like. Yet other applications are possible.

Other features and advantages of the present disclosure will immediately be recognized by persons of ordinary skill in the art with reference to the attached drawings and detailed description of exemplary embodiments as given below.

Processor Memory Architectures

FIG. 1A illustrates one common processor-memory architecture 100 useful for illustrating matrix operations. As shown in FIG. 1A, a processor 102 is connected to a memory 104 via an interface 106. In the illustrative example, the processor multiplies the elements of an input vector a against a matrix M to calculate the vector-matrix product b. Mathematically, the input vector a is treated as a single column matrix having a number of elements equivalent to the number of rows in the matrix M.

In order to calculate the first element of the vector-matrix product b₀, the processor must iterate through each permutation of input vector a elements for each element within a row of the matrix M. During the first iteration, the first element of the input vector a₀is read, the current value of the vector-matrix product b₀is read and the corresponding matrix coefficient value M_0,0is read. The three (3) read values are used in a multiply-accumulate operation to generate an “intermediary” vector-matrix product b₀. Specifically, the multiply-accumulate operation calculates: (a₀·M_0,0)+b₀and writes the result value back to b₀. Notably, b₀is an “intermediary value.” After the first iteration but before the second iteration, the intermediary value of b₀may not correspond to the final value of the vector-matrix product b₀.

During the second iteration, the second element of the input vector a₁is read, the previously calculated intermediary value b₀is retrieved, and a second matrix coefficient value M_1,0is read. The three (3) read values are used in a multiply-accumulate operation to generate the first element of the vector-matrix product b₀. The second iteration completes the computation of b₀.

While not expressly shown, the iterative process described above is also performed to generate the second element of the vector-matrix product b₁. Additionally, while the foregoing example is a 2×2 vector-matrix product, the techniques described therein are commonly extended to support vector-matrix computations of any size. For example, a 3×3 vector-matrix product calculation iterates over an input vector of three (3) elements for each of the three (3) rows of the matrix; thus, requiring nine (9) iterations. A matrix operation of 1024×1024 (which is not uncommon for many applications) would require more than one million iterations. More directly, the aforementioned iterative process exponentially scales as a function of the matrix dimension.

Even though the foregoing discussion is presented in the contest of a vector-matrix product, artisans of ordinary skill will readily appreciate that a matrix-matrix product can be performed as a series of vector-matrix products. For example, a first vector-matrix product corresponding to the first single column matrix of the input vector is calculated, a second vector-matrix product corresponding to the second single column matrix of the input vector is calculated, etc. Thus, a 2×2 matrix-matrix product would require two (2) vector-matrix calculations (i.e., 2×4=8 total), a 3×3 matrix-matrix product would require three (3) vector-matrix calculations (i.e., 3×9=27 total).

Artisans of ordinary skill in the related arts will readily appreciate that each iteration of the process described in FIG. 1A is bottlenecked by the bandwidth limitations of interface 106 (the “processor-memory wall”). Even though the processor and the memory may have internal buses with very high bandwidths, the processor-memory system can only communicate as fast as the interface 106 can support electrical signaling (based on the dielectric properties of the materials used in the interface 106 (typically copper) and the transmission distance (˜1-2 centimeters)). Moreover, the interface 106 may also include a variety of additional signal conditioning, amplification, noise correction, error correction, parity computations, and/or other interface based logic that further reduces transaction times.

One common approach to improve the performance of matrix operations is to perform the matrix operation within the local processor cache. Unfortunately, the local processor cache takes processor die space, and has a much higher cost-per-bit to manufacture than e.g., comparable memory devices. As a result, the processor's local cache size is usually much smaller (e.g., a few megabytes) than its memory (which can be many gigabytes). From a practical aspect, the smaller local cache is a hard limitation on the maximum amount of matrix operations that can be performed locally within the processor. As another drawback, large matrix operations result in poor cache utilization since only one row and one column are being accessed at a time (e.g., for a 1024×1024 vector-matrix product, only 1/1024 of the cache is in active use during a single iteration). Consequently, while processor cache implementations may be acceptable for small matrixes, this technique becomes increasingly less desirable as matrix operations grow in complexity.

Another common approach is a so-called processor-in-memory (PIM). FIG. 1B illustrates one such processor-PIM architecture 150. As shown therein, a processor 152 is connected to a memory 154 via an interface 156. The memory 154 further includes a PIM 162 and a memory array 164; the PIM 162 is tightly coupled to the memory array 164 via an internal interface 166.

Similar to the process described in FIG. 1A supra, the processor-PIM architecture 150 of FIG. 1B multiplies the elements of an input vector a against a matrix M to calculate the vector-matrix product b. However, the PIM 162 reads, multiply-accumulates, and writes to the memory 164 internally via the internal interface 166. The internal interface 166 in much shorter than the external interface 156; additionally, the internal interface 166 can operate natively without e.g., signal conditioning, amplification, noise correction, error correction, parity computations, etc.

While the processor-PIM architecture 150 yields substantial improvements in performance over e.g., the processor-memory architecture 100, the processor-PIM architecture 150 may have other drawbacks. For example, the fabrication techniques (“silicon process”) are substantially different between processor and memory devices because each silicon process is optimized for different design criteria. For example, the processor silicon process may use thinner transistor structures than memory silicon processes; thinner transistor structures offer faster switching (which improves performance) but suffer greater leakage (which is undesirable for memory retention). As a result, manufacturing a PIM 162 and memory array 164 in the same wafer results in at least one of them being implemented in a sub-optimal silicon process. Alternatively, the PIM 162 and memory array 164 may be implemented within separate dies and joined together; die-to-die communication typically increases manufacturing costs and complexity and may suffer from various other detriments (e.g., introduced by process discontinuities, etc.)

Moreover, artisans of ordinary skill in the related arts will readily appreciate that the PIM 162 and the memory array 164 are “hardened” components; a PIM 162 cannot store data, nor can the memory 164 perform computations. As a practical matter, once the memory 154 is manufactured, it cannot be altered to e.g., store more data and/or increase/decrease PIM performance/power consumption. Such memory devices are often tailored specifically for their application; this is both costly to design and modify, in many cases they are “proprietary” and/or customer/manufacturer specific. Moreover, since technology changes at a very rapid pace, these devices are quickly obsoleted.

For a variety of reasons, improved solutions for matrix operations within processors and/or memory are needed. Ideally, such solutions would enable matrix operations within a memory device in a manner that minimizes performance bottlenecks of the processor-memory wall. Furthermore, such solutions should flexibly accommodate a variety of different matrix operations and/or matrix sizes.

Exemplary Memory Device

FIG. 2 is a logical block diagram of one exemplary implementation of a memory device 200 manufactured in accordance with the various principles of the present disclosure. The memory device 200 may include a plurality of partitioned memory cell arrays 220. In some implementations, each of the partitioned memory cell arrays 220 may be partitioned at the time of device manufacture. In other implementations, the partitioned memory cell arrays 220 may be partitioned dynamically (i.e., subsequent to the time of device manufacture). The memory cell arrays 220 may each include a plurality of banks, each bank including a plurality of word lines, a plurality of bit lines, and a plurality of memory cells arranged at, for example, intersections of the plurality of word lines and the plurality of bit lines. The selection of the word line may be performed by a row decoder 216 and the selection of the bit line may be performed by a column decoder 218.

The plurality of external terminals included in the memory device 200 may include address terminals 260, command terminals 262, clock terminals 264, data terminals 240 and power supply terminals 250. The address terminals 260 may be supplied with an address signal and a bank address signal. The address signal and the bank address signal supplied to the address terminals 260 are transferred via an address input circuit 202 to an address decoder 204. The address decoder 204 receives, for example, the address signal and supplies a decoded row address signal to the row decoder 216, and a decoded column address signal to the column decoder 218. The address decoder 204 may also receive the bank address signal and supply the bank address signal to the row decoder 216 and the column decoder 218.

The command terminals 262 are supplied with a command signal to a command input circuit 206. The command terminals 262 may include one or more separate signals such as e.g., row address strobe (RAS), column address strobe (CAS), read/write (R/W). The command signal input to the command terminals 262 is provided to the command decoder 208 via the command input circuit 206. The command decoder 208 may decode the command signal 262 to generate various control signals. For example, the RAS can be asserted to specify the row where data is to be read/written, and the CAS can be asserted to specify where data is to be read/written. In some variants, the R/W command signal determines whether or not the contents of the data terminal 240 are written to memory cells 220, or read therefrom.

During a read operation, the read data may be output externally from the data terminals 240 via a read/write amplifier 222 and an input/output circuit 224. Similarly, when the write command is issued and a row address and a column address are timely supplied with the write command, a write data command may be supplied to the data terminals 240. The write data command may be supplied via the input/output circuit 224 and the read/write amplifier 222 to a given memory cell array 220 and written in the memory cell designated by the row address and the column address. The input/output circuit 224 may include input buffers, in accordance with some implementations.

The clock terminals 264 may be supplied with external clock signals for synchronous operation. In one variant, the clock signal is a single ended signal; in other variants, the external clock signals may be complementary (differential signaling) to one another and are supplied to a clock input circuit 210. The clock input circuit 210 receives the external clock signals and conditions the clock signal to ensure that the resulting internal clock signal has sufficient amplitude and/or frequency for subsequent locked loop operation. The conditioned internal clock signal is supplied to feedback mechanism (internal clock generator 212) provide a stable clock for internal memory logic. Common examples of internal clock generation logic 212 includes without limitation: digital or analog phase locked loop (PLL), delay locked loop (DLL), and/or frequency locked loop (FLL) operation.

In alternative variants (not shown), the memory device 200 may rely on external clocking (i.e., with no internal clock of its own). For example, a phase controlled clock signal may be externally supplied to the input/output (IO) circuit 224. This external clock can be used to clock in written data, and clock out data reads. In such variants, IO circuit 224 provides a clock signal to each of the corresponding logical blocks (e.g., address input circuit 202, address decoder 204, command input circuit 206, command decoder 208, etc.).

The power supply terminals 250 may be supplied with power supply potentials. In some variants (not shown), these power supply potentials may be supplied via the input/output (I/O) circuit 224. In some embodiments, the power supply potentials may be isolated from the I/O circuit 224 so that power supply noise generated by the IO circuit 224 does not propagate to the other circuit blocks. These power supply potentials are conditioned via an internal power supply circuit 230. For example, the internal power supply circuit 230 may generate various internal potentials that e.g., remove noise and/or spurious activity, as well as boost or buck potentials, provided from the power supply potentials. The internal potentials may be used in e.g., the address circuitry (202, 204), the command circuitry (206, 208), the row and column decoders (216, 218), the RW amplifier 222, and/or any various other circuit blocks.

A power-on-reset circuit (PON) 228 provides a power on signal when the internal power supply circuit 230 can sufficiently supply internal voltages for a power-on sequence. A temperature sensor 226 may sense a temperature of the memory device 200 and provides a temperature signal; the temperature of the memory device 200 may affect some memory operations.

In one exemplary embodiment, the memory arrays 220 may be controlled via one or more configuration registers. In other words, the use of these configuration registers selectively configure one or more memory arrays 220 into one or more matrix fabrics and/or matrix multiplication units (MMUs) described in greater detail herein. In other words, the configuration registers may enable the memory cell architectures within the memory arrays to dynamically change both e.g., their structure, operation, and functionality. These and other variations would be readily apparent to one of ordinary skill given the contents of the present disclosure.

FIG. 3 provides a more detailed side-by-side illustration of the memory array and matrix fabric circuitry configurations. The memory array and matrix fabric circuitry configurations of FIG. 3 both use the same array of memory cells, where each memory cell 15 is composed of a resistive element 302 that is coupled to a word-line 304 and a bit-line 306, and optionally accumulation circuitry 307. In the first configuration 300, the memory array circuitry is configured to operate as a row decoder 316, a column decoder 318, and an array of memory cells 320. In the second configuration 350, the matrix fabric circuitry is configured to operate as a row driver 317, a matrix multiplication unit (MMU) 319, and an analog crossbar fabric (matrix fabric) 321. 20 In one exemplary embodiment, a look-up-table (LUT) and associated logic 315 can be used to store and configure different matrix multiplication unit coefficient values.

In the first configuration 300, the memory array circuitry is configured to operate as a row decoder 316, a column decoder 318, and an array of memory cells 320. In the second configuration 350, the matrix fabric circuitry is configured to operate as a row driver 317, a matrix multiplication unit (MMU) 319, and an analog crossbar fabric (matrix fabric) 321. In one exemplary embodiment, a look-up-table (LUT) and associated logic 315 can be used to store and configure different matrix multiplication unit coefficient values.

In one exemplary embodiment of the present disclosure, the memory array 320 is composed of a resistive random access memory (ReRAM). ReRAM is a non-volatile memory that changes the resistance of memory cells across a dielectric solid-state material, sometimes referred to as a “memristor.” Current ReRAM technology may be implemented within a two-dimensional (2D) layer or a three-dimensional (3D) stack of layers; however higher order dimensions may be used in future iterations. The complementary metal oxide semiconductor (CMOS) compatibility of the crossbar ReRAM technology may enable both logic (data processing) and memory (storage) to be integrated within a single chip. A crossbar ReRAM array may be formed in a one transistor/one resistor (1T1R) configuration and/or in a configuration with one transistor driving n resistive memory cells (1TNR), among other possible configurations.

Multiple inorganic and organic material systems may enable thermal and/or ionic resistive switching. Such systems may, in a number of embodiments include: phase change chalcogenides (e.g., Ge₂Sb₂Te₅, AgInSbTe, among others); binary transition metal oxides (e.g., NiO, TiO₂, among others); perovskites (e.g., Sr(ZR)TrO₃, PCMO, among others); solid state electrolytes (e.g., GeS, GeSe, SiO_x, Cu₂S, among others); organic charge transfer complexes (e.g., Cu tetracynaoquinodimethane (TCNQ), among others); organic charge acceptor systems (e.g., Al amino-dicyanoimidazole (AIDCN), among others); and/or 2D (layered) insulating materials (e.g., hexagonal BN, among others); among other possible systems for resistive switching.

In the illustrated embodiment, the resistive element 302 is a non-linear passive two-terminal electrical component that can change its electrical resistance based on a history (e.g., hysteresis or memory) of current application. In at least one exemplary embodiment, the resistive element 302 may form or destroy a conductive filament responsive to the application of different polarities of currents to the first terminal (connected to the word-line 304) and the second terminal (connected to the bit-line 306). The presence or absence of the conductive filament between the two terminals changes the conductance between the terminals. While the present operation is presented within the context of a resistive element, artisans of ordinary skill in the related arts will readily appreciate that the principles described herein may be implemented within any circuitry that is characterized by a variable impedance (e.g., resistance and/or reactance). Variable impedance may be effectuated by a variety of linear and/or non-linear elements (e.g., resistors, capacitors, inductors, diodes, transistors, thyristors, etc.)

For illustrative purposes, the operation of the memory array 320 in the first configuration 300 is briefly summarized. During operation in the first configuration, a memory “write” may be effectuated by application of a current to the memory cell corresponding to the row and column of the memory array. The row decoder 316 can selectively drive various ones of the row terminals so as to select a specific row of the memory array circuitry 320. The column decoder 318 can selectively sense/drive various ones of the column terminals so as to “read” and/or “write” to the corresponding memory cell that is uniquely identified by the selected row and column (as emphasized in FIG. 3 by the heavier line width and blackened cell element). As noted above, the application of current results in the formation (or destruction) of a conductive filament within the dielectric solid-state material. In one such case, a low resistance state (ON-state) is used to represent the logical “1” and a high resistance state (OFF-state) is used to represent a logical “0”. In order to switch a ReRAM cell, a first current with specific polarity, magnitude, and duration is applied to the dielectric solid-state material. Subsequently thereafter, a memory “read” may be effectuated by application of a second current to the resistive element and sensing whether the resistive element is in the ON-state or the OFF-state based on the corresponding impedance. Memory reads may or may not be destructive (e.g., the second current may or may not be sufficient to form or destroy the conductive filament.)

Artisans of ordinary skill in the related arts will readily appreciate that the foregoing discussion of memory array 320 in the first configuration 300 is consistent with existing memory operation in accordance with e.g., ReRAM memory technologies. In contrast, the second configuration 350 uses the memory cells as an analog crossbar fabric (matrix fabric) 321 to perform matrix multiplication operations. While the exemplary implementation of FIG. 3 corresponds to a 2×4 matrix multiplication unit (MMU), other variants may be substituted with equivalent success. For example, a matrix of arbitrarily large size (e.g., 3×3, 4×4, 8×8, etc.) may be implemented (subject to the precision enabled by digital-analog-conversion (DAC) 308 and analog-to-digital (ADC) 310 components).

In analog crossbar fabric (matrix fabric) 321 operation, each of the row terminals is concurrently driven by an analog input signal, and each of the column terminals is concurrently sensed for the analog output (which is an analog summation of the voltage potentials across the corresponding resistive elements for each row/column combination). Notably, in the second configuration 350, all of the row and column terminals associated with a matrix multiplication are active (as emphasized in FIG. 3 by the heavier line widths and blackened cell elements). In other words, the ReRAM crossbar fabric (matrix fabric) 321 uses the matrix fabric structure to perform an “analog computation” that calculates a vector-matrix product (or scalar-matrix product, matrix-matrix product, etc.)

Notably, the concurrent vector-matrix product calculation within the crossbar fabric is atomic. Specifically, the analog computation of vector-matrix products can complete in a single access cycle. As previously mentioned, an atomic operation is immune to data race conditions. Moreover, the vector-matrix product calculation performs calculations on all rows and all columns of the matrix operation concurrently; in other words, the vector-matrix product calculation does not scale in complexity as a function of matrix dimension. While fabrication constraints (e.g., ADC/DAC granularity, manufacturing tolerance, etc.) may limit the amount of precision and complexity that a single matrix fabric can produce, multiple matrix operations may be mathematically combined together to provide much higher precisions and complexities.

For example, in one exemplary embodiment of the present disclosure, inputs are converted to the analog domain by the DAC 308 for analog computation, but may also be converted back to the digital domain by the ADC 310 for subsequent digital and/or logical manipulation. In other words, the arithmetic logic unit 312 can enable sophisticated numeric manipulation of matrix fabric 321 output. Such capabilities may be used where the analog domain cannot implement the required computation due to practical implementation limitations (e.g., manufacturing cost, etc.)

Consider the illustrative example of FIG. 4, where a simple “FFT butterfly” calculation 400 can be performed via a 2×4 matrix fabric. While conductance can be increased or decreased, conductance cannot be made “negative.” As a result, subtraction may need to be performed within the digital domain. The FFT butterfly operation is described in the following matrix multiplication (EQN. 1):

$\begin{matrix} [\begin{matrix} a_{0} \\ a_{1} \end{matrix}] [\begin{matrix} 1 & 1 \\ 1 & - 1 \end{matrix}] = [\begin{matrix} a_{0} + a_{1} \\ a_{0} - a_{1} \end{matrix}] & EQN . 1 \end{matrix}$

This simple FFT butterfly 400 of EQN. 1 can be decomposed into two distinct matrices representing the positive and negative coefficients (EQN. 2 and EQN. 3):

$\begin{matrix} [\begin{matrix} a_{0} \\ a_{1} \end{matrix}] [\begin{matrix} 1 & 1 \\ 1 & 0 \end{matrix}] = [\begin{matrix} a_{0} + a_{1} \\ a_{0} \end{matrix}] & EQN . 2 \end{matrix}$

$\begin{matrix} [\begin{matrix} a_{0} \\ a_{1} \end{matrix}] [\begin{matrix} 0 & 0 \\ 0 & 1 \end{matrix}] = [\begin{matrix} 0 \\ a_{1} \end{matrix}] & EQN . 3 \end{matrix}$

EQN. 2 and EQN. 3 can be implemented as analog computations with the matrix fabric circuitry. Once calculated, the resulting analog values may be converted back to the digital domain via the aforementioned ADC. Existing ALU operations may be used to perform subtraction in the digital domain (EQN. 4):

$\begin{matrix} [\begin{matrix} a_{0} + a_{1} \\ a_{0} \end{matrix}] - [\begin{matrix} 0 \\ a_{1} \end{matrix}] = [\begin{matrix} a_{0} + a_{1} \\ a_{0} - a_{1} \end{matrix}] & EQN . 4 \end{matrix}$

In other words, as illustrated in FIG. 4, a 2×2 matrix can be further subdivided into a 2×2 positive matrix and a 2×2 negative matrix. The ALU can add/subtract the results of the 2×2 positive matrix and a 2×2 negative matrix to generate a single 2×2 matrix. Artisans of ordinary skill in the related arts will readily appreciate the wide variety and/or capabilities enabled by ALUs. For example, ALUs may provide arithmetic operations (e.g., add, subtract, add with carry, subtract with borrow, negate, increment, decrement, pass through, etc.), bit-wise operations (e.g., AND, OR, XOR, complement), bit-shift operations (e.g., arithmetic shift, logical shift, rotate, rotate through carry, etc.) to enable e.g., multiple-precision arithmetic, complex number operations, and/or any extend MMU capabilities to any degree of precision, size, and/or complexity.

As used herein, the terms “digital” and/or “logical” within the context of computation refers to processing logic that uses quantized values (e.g., “0” and “1”) to represent symbolic values (e.g., “ON-state”, “OFF-state”). In contrast, the term “analog” within the context of computation refers to processing logic that uses the continuously changeable aspects of physical signaling phenomena such as electrical, chemical, and/or mechanical quantities to perform a computation. Various embodiments of the present disclosure represent may represent analog input and/or output signals as a continuous electrical signal. For example, a voltage potential may have different possible values (e.g., any value between a minimum voltage (0V) and a maximum voltage (1.8V) etc.). Combining analog computing with digital components may be performed with digital-to-analog converters (DACs), analog-digital-converters (ADCs), arithmetic logic units (ALUs), and/or variable gain amplification/attenuation.

Referring back to FIG. 3, in order to configure the memory cells into the crossbar fabric (matrix fabric) 321 of the second configuration 350, each of the resistive elements may be written with a corresponding matrix coefficient value. Unlike the first configuration 300, the second configuration 350 may write varying degrees of impedance (representing a coefficient value) into each ReRAM cell using an amount of current having a polarity, magnitude, and duration selected to set a specific conductance. In other words, by forming/destroying conductive filaments of varying conductivity, a plurality of different conductivity states can be established. For example, applying a first magnitude may result in a first conductance, applying a second magnitude may result in a second conductance, applying the first magnitude for a longer duration may result in a third conductance, etc. Any permutation of the foregoing writing parameters may be substituted with equivalent success. More directly, rather than using two (2) resistance states (ON-state, OFF-state) to represent two (2) digital states (logic “1”, logic “0”), the varying conductance can use a multiplicity of states (e.g., three (3), four (4), eight (8), etc.) to represent a continuous range of values and/or ranges of values (e.g., [0, 0.33, 0.66, 1], [0, 0.25, 0.50, 0.75, 1], [0, 0.125, 0.250, . . . , 1], etc.).

In one embodiment of the present disclosure, the matrix coefficient values are stored ahead of time within a look-up-table (LUT) and configured by associated control logic 315. During an initial configuration phase, the matrix fabric 321 is written with matrix coefficient values from the LUT via control logic 315. Artisans of ordinary skill in the related arts will readily appreciate that certain memory technologies may also enable write-once-use-many operation. For example, even though forming (or destroying) a conductive filament for a ReRAM cell may require a specific duration, magnitude, polarity, and/or direction of current; subsequent usage of the memory cell can be repeated many times (so long as the conductive filament is not substantially formed nor destroyed over the usage lifetime). In other words, subsequent usages of the same matrix fabric 321 configuration can be used to defray initial configuration times.

Furthermore, certain memory technologies (such as ReRAM) are non-volatile. Thus, once matrix fabric circuitry is programmed, it may enter a low power state (or even powered off) to save power when not in use. In some cases, the non-volatility of the matrix fabric may be leveraged to further improve power consumption. Specifically, unlike existing techniques which may re-load matrix coefficient values from non-volatile memory for subsequent processing, the exemplary matrix fabric can store the matrix coefficient values even when the memory device is powered off. On subsequent wake-up, the matrix fabric can be directly used.

In one exemplary embodiment, the matrix coefficient values may be derived according to the nature of the matrix operation. For example, the coefficients for certain matrix operations can be derived ahead of time based on the “size” (or other structurally defined parameter) and stored within the LUT. As but two such examples, the fast Fourier transform (EQN. 5) and the discrete cosine transform (DCT) (EQN. 6) are reproduced infra:

$\begin{matrix} X [k] = \sum_{n = 0}^{N - 1} x (n) e^{- j \frac{2 π k}{N} n} & EQN . 5 \end{matrix}$

$\begin{matrix} X [k] = \sum_{n = 0}^{N - 1} x (n) \cos [\frac{π}{N} (n + \frac{1}{2}) k] & EQN . 6 \end{matrix}$

As can be mathematically determined from the foregoing equations, the matrix coefficient values (also referred to as the “twiddle factors”) are determined according to the size of the transform. For example, the coefficients for an 8-point FFT are:

$e^{- j \frac{π}{4}}, e^{- j \frac{π}{2,}}, e^{- j \frac{3 π}{4}}, \dots$

etc. In other words, once the size of the FFT is known, the values for

$\frac{2 π k}{N}$

(where k is 0, 1, 2, 3 . . . 7) can be set a priori. In fact, the coefficients for larger FFTs include the coefficients for smaller FFTs. For example, a 64-point FFT has 64 coefficient values, which include all 32 coefficients used in a 32-point FFT, and all 16 coefficients for a 16-point FFT, etc. More directly, a single LUT may contain all the coefficients to support any number of different transforms.

In another exemplary embodiment, the matrix coefficient values may be stored ahead of time. For example, the coefficients for certain matrix multiplication operations may be known or otherwise defined by e.g., an application or user. For example, image processing computations, such as are described in co-owned and co-pending U.S. patent application Ser. No. 16/002,644 filed Jun. 7, 2018 and entitled “AN IMAGE PROCESSOR FORMED IN AN ARRAY OF MEMORY CELLS”, previously incorporated supra, may define a variety of different matrix coefficient values so as to effect e.g., defect correction, color interpolation, white balance, color adjustment, gamma lightness, contrast adjustment, color conversion, down-sampling, and/or other image signal processing operations.

In another example, the coefficients for certain matrix multiplication operations may be determined or otherwise defined by e.g., user considerations, environmental considerations, other devices, and/or other network entities. For example, wireless devices often experience different multipath effects that can interfere with operation. Various embodiments of the present disclosure determine multipath effects and correct for them with matrix multiplication. In some cases, the wireless device may calculate each of the independent different channel effects based on degradation of known signaling. The differences between an expected and an actual reference channel signal can be used to determine the noise effects that it experienced (e.g., attenuation over specific frequency ranges, reflections, scattering, and/or other noise effects). In other embodiments, a wireless device may be instructed to use a predetermined “codebook” of beamforming configurations. The codebook of beamforming coefficients may be less accurate but may be preferable for other reasons (e.g., speed, simplicity, etc.).

As previously alluded to, the matrix coefficient values are stored ahead of time within a look-up-table (LUT) and configured by associated control logic 315. In one exemplary embodiment, the matrix fabric may be configured via dedicated hardware logic. Such internal hardware logic may not be limited by processor word size; thus matrix coefficient values of any dimension may be concurrently configurable (e.g., a 4×4, 8×8, 16×16, etc.) While the present disclosure is presented in the context of internal control logic 315, external implementations may be substituted with equivalent success. For example, in other embodiments, the logic includes internal processor-in-memory (PIM) that can set the matrix coefficient values based on LUT values in a series of reads and writes. In still other examples, for example, an external processor can perform the LUT and/or logic functionality.

FIG. 5A is a logical block diagram of one exemplary implementation of a processor-memory architecture 500 in accordance with the various principles described herein. As shown in FIG. 5A, a processor 502 is coupled to a memory 504; the memory includes a look-up-table (LUT) 506, a control logic 508, a matrix fabric and corresponding matrix multiplication unit (MMU) 510, and a memory array 512.

In one embodiment, the LUT 506 stores a plurality of matrix value coefficients, dimensions, and/or other parameters, associated with different matrix operations. In one exemplary embodiment, the LUT 506 stores a plurality of fast Fourier transform (FFT) “twiddle factors”; where various subsets of the twiddle factors are associated with different FFT dimensions. For example, a LUT 506 that stores the twiddle factors for a 64-point FFT has 64 coefficient values, which include all 32 coefficients used in a 32-point FFT, and all 16 coefficients for a 16-point FFT, etc. In another exemplary embodiment, the LUT 506 stores a plurality of discrete cosine transform (DCT) “twiddle factors” associated with different DCT dimensions. In other embodiments, the LUT 506 stores a plurality of different matrix coefficient values for image signal processing (ISP) e.g., defect correction, color interpolation, white balance, color adjustment, gamma lightness, contrast adjustment, color conversion, down-sampling, and/or other image signal processing operations. In yet another embodiment of the LUT 506, the LUT 506 may include various channel matrix codebooks that may be predefined and/or empirically determined based on radio channel measurements.

In one embodiment, the control logic 508 controls operation of the matrix fabric and MMU 510 based on instructions received from the processor 502. In one exemplary embodiment, the control logic 508 can form/destroy conductive filaments of varying conductivity within each of the memory cells of a matrix fabric in accordance with the aforementioned matrix dimensions and/or matrix value coefficients provided by the LUT 506. Additionally, the control logic 508 can configure a corresponding MMU to perform any additional arithmetic and/or logical manipulations of the matrix fabric. Furthermore, the control logic 508 may select one or more digital vectors to drive the matrix fabric, and one or more digital vectors to store the logical outputs of the MMU.

In the processing arts, an “instruction” generally includes different types of “instruction syllables”: e.g., opcodes, operands, and/or other associated data structures (e.g., registers, scalars, vectors).

As used herein, the term “opcode” (operation code) refers to an instruction that can be interpreted by a processor logic, memory logic, or other logical circuitry to effectuate an operation. More directly, the opcode identifies an operation to be performed on one or more operands (inputs) to generate one or more results (outputs). Both operands and results may be embodied as data structures. Common examples of data structures include without limitation: scalars, vectors, arrays, lists, records, unions, objects, graphs, trees, and/or any number of other form of data. Some data structures may include, in whole or in part, referential data (data that “points” to other data). Common examples of referential data structures include e.g., pointers, indexes, and/or descriptors.

In one exemplary embodiment, the opcode may identify one or more of: a matrix operation, the dimensions of the matrix operation, and/or the row and/or column of the memory cells. In one such variant, an operand is a coded identifier that specifies the one or more digital vectors that are to be operated upon. For example, an instruction to process a 64-point FFT on an input digital vector, and store the results in an output digital vector might include the opcode and operands: FFT64 ($input, $output), where: FFT64 identifies the size and nature of the 64-point FFT operation, $input identifies an input digital vector base address, and $output identifies an output digital vector base address. In another such example, the 64-point FFT may be split into two distinct atomic operations e.g., FFT64($address) that converts the memory array at the $address into a 64-point matrix fabric, and MULT($address, $input, $output) that stores the vector-matrix product of the $input and the matrix fabric at $address to $output.

While FIG. 5A illustrates an instruction interface that is functionally separate and distinct from the input/output (I/O) memory interface. In one such embodiment, the instruction interface may be physically distinct (e.g., having different pins and/or connectivity). In other embodiments, the instruction interface may be multiplexed with the I/O memory interface (e.g., sharing the same control signaling, and address and/or data bus but in a distinct communication mode). In still other embodiments, the instruction interface may be virtually accessible via the I/O memory interface (e.g., as registers located within address space that is addressable via the I/O interface). Still other variants may be substituted by artisans of ordinary skill, given the contents of the present disclosure.

In one embodiment, the matrix fabric and MMU 510 are tightly coupled to a memory array 512 to read and write digital vectors (operands). In one exemplary embodiment, the operands are identified for dedicated data transfer hardware (e.g., a direct memory access (DMA)) into and out of the matrix fabric and MMU 510. In one exemplary variant, the digital vectors of data may be of any dimension, and are not limited by processor word size. For example, an operand may specify an operand of N-bits (e.g., 2, 4, 8, 16, etc.). In other embodiments, the control logic 508 (e.g., DMA logic) can read/write to the matrix fabric 510 using the existing memory row/column bus interfaces. In still other embodiments, the control logic 508 (e.g., DMA logic) can read/write to the matrix fabric 510 using the existing address/data and read/write control signaling within an internal memory interface.

FIG. 5B provides a logical flow diagram of one exemplary set of matrix operations 550 within the context of the exemplary embodiment 500 described in FIG. 5A. As shown therein, the processor 502 writes an instruction to the memory 504 through interface 507 that specifies an opcode (e.g., characterized by a matrix M_x,y) and the operands (e.g., digital vectors a, b).

The control logic 508 determines whether or not the matrix fabric and/or matrix multiplication unit (MMU) should be configured/reconfigured. For example, a section of the memory array is converted into one or more matrix fabrics and weighted with the associated matrix coefficient values defined by the matrix M_x,y. Digital-to-analog (DAC) row drivers and analog-to-digital (ADC) sense amps associated with the matrix fabric may need to be adjusted for dynamic range and/or amplification. Additionally, one or more MMU ALU components may be coupled to the one or more matrix fabrics.

When the matrix fabric and/or matrix multiplication unit (MMU) are appropriately configured, the input operand a is read by the digital-to-analog (DAC) and applied to the matrix fabric M_x,yfor analog computation. The analog result may additionally be converted with analog-to-digital (ADC) conversion for subsequent logical manipulation by the MMU ALUs. The output is written into the output operand b.

FIG. 5C provides an alternative logical flow diagram of one exemplary set of matrix operations 560 within the context of the exemplary embodiment 500 described in FIG. 5A. In contrast to the flow diagram of FIG. 5B, the system of FIG. 5C uses an explicit instruction to convert the memory array into a matrix fabric. Providing further degrees of atomicity in instruction behaviors can enable a variety of related benefits including for example, pipeline design and/or reduced instruction set complexity.

More directly, when the matrix fabric contains the appropriate matrix value coefficients M_w,y, matrix operations may be efficiently repeated. For example, image processing computations, such as are described in co-owned and co-pending U.S. patent application Ser. No. 16/002,644 filed Jun. 7, 2018 and entitled “AN IMAGE PROCESSOR FORMED IN AN ARRAY OF MEMORY CELLS”, previously incorporated supra, may configure a number of matrix fabric and MMU processing elements so as to pipeline e.g., defect correction, color interpolation, white balance, color adjustment, gamma lightness, contrast adjustment, color conversion, down-sampling, and/or other image signal processing operations. Each one of the pipeline stages may be configured once, and repeatedly used for each pixel (or group of pixels) of the image. For example, the white balance pipeline stage may operate on each pixel of data using the same matrix fabric with the matrix coefficient values set for white balance; the color adjustment pipeline stage may operate on each pixel of data using the same matrix fabric with the matrix coefficient values set for color adjustment, etc. In another such example, the first stage of a 64-point FFT can be handled in thirty two (32) atomic MMU computations (thirty two (32) 2-point FFTs) using the same FFT “twiddle factors” (described supra).

Moreover, artisans of ordinary skill in the related arts will further appreciate that some matrix fabrics may have additional versatilities and/or uses beyond their initial configuration. For example, as previously noted, a 64-point FFT has 64 coefficient values, which include all 32 coefficients used in a 32-point FFT. Thus, a matrix fabric that is configured for 64-point operation could be reused for 32-point operation with the appropriate application of the 32-point input operand a on the appropriate rows of the 64-point FFT matrix fabric. Similarly, FFT twiddle factors are a superset of discrete cosine transform (DCT) twiddle factors; thus, an FFT matrix fabric could also be used (with appropriate application of input operand a) to calculate DCT results.

Still other permutations and/or variants of the foregoing example will be made clear to those of ordinary skill in the related arts, given the content of the present disclosure.

Methods

Referring now to FIG. 6, a logical flow diagram of one exemplary method 600 converting a memory array into a matrix fabric for matrix transformations and performing matrix operations therein is presented.

At step 602 of the method 600, a memory device receives one or more instructions. In one embodiment, the memory device receives the instruction from a processor. In one such variant, the processor is an application processor (AP) commonly used in consumer electronics. In other such variants, the processor is a baseband processor (BB) commonly used in wireless devices.

As a brief aside, so-called “application processors” are processors that are configured to execute an operating system (OS) and one or more applications, firmware, and/or software. The term “operating system” refers to software that controls and manages access to hardware. An OS commonly supports processing functions such as e.g., task scheduling, application execution, input and output management, memory management, security, and peripheral access.

A so-called “baseband processor” is a processor that is configured to communicate with a wireless network via a communication protocol stack. The term “communication protocol stack” refers to the software and hardware components that control and manage access to the wireless network resources. A communication protocol stack commonly includes without limitation: physical layer protocols, data link layer protocols, medium access control protocols, network and/or transport protocols, etc.

Other peripheral and/or co-processor configurations may similarly be substituted with equivalent success. For example, server devices often include multiple processors sharing a common memory resource. Similarly, many common device architectures pair a general purpose processor with a special purpose co-processor and a shared memory resource (such as a graphics engine, or digital signal processor (DSP)). Common examples of such processors include without limitation: graphics processing units (GPUs), video processing units (VPUs), tensor processing units (TPUs), neural network processing units (NPUs), digital signal processors (DSPs), image signal processors (ISPs). In other embodiments, the memory device receives the instruction from an application specific integrated circuit (ASIC) or other forms of processing logic e.g., field programmable gate arrays (FPGAs), programmable logic devices (PLDs), camera sensors, audio/video processors, and/or media codecs (e.g., image, video, audio, and/or any combination thereof).

In one exemplary embodiment, the memory device is a resistive random access memory (ReRAM) arranged in a “crossbar” row-column configuration. While the various embodiments described herein assume a specific memory technology and specific memory structure, artisans of ordinary skill in the related arts given the contents of the present disclosure will readily appreciate that the principles described herein may be broadly extended to other technologies and/or structures. For example, certain programmable logic structures (e.g., commonly used in field programmable gate arrays (FPGAs) and programmable logic devices (PLDs)) may have similar characteristics to memory with regard to capabilities and topology. Similarly, certain processor and/or other memory technologies may vary resistance, capacitance, and/or inductance; in such cases, varying impedance properties may be used to perform analog computations. Additionally, while the “crossbar” based construction provides a physical structure that is well adapted to two-dimensional (2D) matrix structures, other topologies may be well adapted to higher order mathematical operations (e.g., matrix-matrix products via three-dimensional (3D) memory stacking, etc.)

In one exemplary embodiment, the memory device further includes a controller. The controller receives the one or more instructions and parses each instruction into one or more instruction components (also commonly referred to as “instruction syllables”). In one exemplary embodiment, the instruction syllables include at least one opcode and one or more operands. For example, an instruction may be parsed into an opcode, a first source operand, and a destination operand. Other common examples of instruction components may include without limitation, a second source operand (for binary operations), a shift amount, an absolute/relative address, a register (or other reference to a data structure), an immediate data structure (i.e., a data structure provided within the instruction itself), a subordinate function, and/or branch/link values (e.g., to be executed depending on whether an instruction completes or fails).

In one embodiment, each received instruction corresponds to an atomic memory controller operation. As used herein, an “atomic” instruction is an instruction that completes within a single access cycle. In contrast, a “non-atomic” instruction is an instruction that may or may not complete within a single access cycle. Even though non-atomic instructions might complete in a single cycle, they must be treated as non-atomic to prevent data race conditions. A race condition occurs where data that is being accessed by a processor instruction (either a read or write) may be accessed by another processor instruction before the first processor instruction has a chance to complete; the race condition may unpredictably result in data read/write errors. In other words, an atomic instruction guarantees that the data cannot be observed in an incomplete state.

In one exemplary embodiment, an atomic instruction may identify a portion of the memory array to be converted to a matrix fabric. In some cases, the atomic instruction may identify characteristic properties of the matrix fabric. For example, the atomic instruction may identify the portion of the memory array on the basis of e.g., location within the memory array (e.g., via offset, row, column), size (number of rows, number of columns, and/or other dimensional parameters), granularity (e.g., the precision and/or sensitivity). Notably, atomic instructions may offer very fine grained control over memory device operation; this may be desirable where the memory device operation can be optimized in view of various application specific considerations.

In other embodiments, a non-atomic instruction may specify portions of the memory array that are to be converted into a matrix fabric. For example, the non-atomic instruction may specify various requirements and/or constraints for the matrix fabric. The memory controller may internally allocate resources so as to accommodate the requirements and/or constraints. In some cases, the memory controller may additionally prioritize and/or de-prioritize instructions based on the current memory usage, memory resources, controller bandwidth, and/or other considerations. Such implementations may be particularly useful where memory device management is unnecessary and would otherwise burden the processor.

In one embodiment, the instruction specifies a matrix operation. In one such variant, the matrix operation may be a vector-matrix product. In another variant, the matrix operation may be a matrix-matrix product. Still other variants may be substituted by artisans of ordinary skill in the related arts, given the contents of the present disclosure. Such variants may include e.g., scalar-matrix products, higher order matrix products, and/or other transformations including e.g., linear shifts, rotations, reflections, and translations.

As used herein, the terms “transformation”, “transform”, etc. refer to a mathematical operation that converts an input from a first domain into a second domain. Transformations can be “injective” (every element of the first domain has a unique element in the second domain), “surjective” (every element of the second domain has a unique element in the first domain), or “bijective” (a unique one-to-one mapping of elements from the first domain to the second domain).

More complex mathematically defined transformations that are regularly used in the computing arts include Fourier transforms (and its derivatives, such as the discrete cosine transform (DCT)), Hilbert transforms, Laplace transforms, and Legendre transforms. In one exemplary embodiment of the present disclosure, matrix coefficient values for mathematically defined transformations can be calculated ahead of time and stored within a look-up-table (LUT) or other data structure. For example, twiddle factors for the fast Fourier transform (FFT) and/or DCT can be calculated and stored within a LUT. In other embodiments, matrix coefficient values for mathematically defined transformations can be calculated by the memory controller during (or in preparation for) the matrix fabric conversion process.

Other transformations may not be based on a mathematical definition per se, but may instead by defined based on e.g., an application, another device, and/or a network entity. Such transformations may be commonly used in encryption, decryption, geometric modeling, mathematical modeling, neural networks, network management, and/or other graph theory based applications. For example, wireless networks may use a codebook of predetermined antenna weighting matrixes so as to signal the most commonly used beamforming configurations. In other examples, certain types of encryption may agree upon and/or negotiate between different encryption matrices. In such embodiments, the codebook or matrix coefficient values may be agreed ahead of time, exchanged in an out-of-band manner, exchanged in-band, or even arbitrarily determined or negotiated.

Empirically determined transformations may also be substituted with equivalent success given the contents of the present disclosure. For example, empirically derived transformations that are regularly used in the computing arts include radio channel coding, image signal processing, and/or other mathematically modeled environmental effects. For example, a multi-path radio environment can be characterized by measuring channel effects on e.g., reference signals. The resulting channel matrix can be used to constructively interfere with signal reception (e.g., improving signal strength) while simultaneously destructively interfering with interference (e.g., reducing noise). Similarly, an image that has a skewed hue can be assessed for overall color balance, and mathematically corrected. In some cases, an image may be intentionally skewed based on e.g., user input, so as to impart an aesthetic “warmth” to an image.

Various embodiments of the present disclosure may implement “unary” operations within a memory device. Other embodiments may implement “binary”, or even higher order “N-ary” matrix operations. As used herein, the terms “unary”, “binary”, and “N-ary” refer to operations that take one, two, or N input data structures, respectively. In some embodiments, binary and/or N-ary operations may be subdivided into one or more unary matrix in-place operators. As used herein, an “in-place” operator refers to a matrix operation that stores or translates its result its own state (e.g., its own matrix coefficient values). For example, a binary operation may be decomposed into two (2) unary operations;

- a first in-place unary operation is executed (the result is stored “in-place”). Thereafter, a second unary operation can be performed on the matrix fabric to yield the binary result (for example, a multiply-accumulate operation).

Still other embodiments may serialize and/or parallelize matrix operations based on a variety of considerations. For example, sequentially related operations may be performed in a “serial” pipeline. For example, image processing computations, such as are described in co-owned and co-pending U.S. patent application Ser. No. 16/002,644 filed Jun. 7, 2018 and entitled “AN IMAGE PROCESSOR FORMED IN AN ARRAY OF MEMORY CELLS”, previously incorporated supra, configures a number of matrix fabric and MMU processing elements to pipeline e.g., defect correction, color interpolation, white balance, color adjustment, gamma lightness, contrast adjustment, color conversion, down-sampling, etc. Pipelined processing can often produce very high throughput data with minimal matrix fabric resources. In contrast, unrelated operations may be performed in “parallel” with separate resources. For example, the first stage of a 64-point FFT can be handled with thirty two (32) separate matrix fabrics operating configured as 2-point FFTs. Highly parallelized operation can greatly reduce latency; however the overall memory fabric resource utilization may be very high.

In one exemplary embodiment, the instruction is received from a processor via a dedicated interface. Dedicated interfaces may be particularly useful where the matrix computation fabric is treated akin to a co-processor or a hardware accelerator. Notably, dedicated interfaces do not require arbitration, and can be operated at very high speeds (in some cases, at the native processor speed). In other embodiments, the instruction is received via a shared interface.

The shared interface may be multiplexed in time, resource (e.g., lanes, channels, etc.), or other manner with other concurrently active memory interface functionality. Common examples of other memory interface functionality include without limitation: data input/output, memory configuration, processor-in-memory (PIM) communication, direct memory access, and/or any other form of blocking memory access. In some variants, the shared interface may include one or more queuing and/or pipelining mechanisms. For example, some memory technologies may implement a pipelined interface so as to maximize memory throughput.

In some embodiments, the instructions may be received from any entity having access to the memory interface. For example, a camera co-processor (image signal processor (ISP)) may be able to directly communicate with the memory device to e.g., write captured data. In certain implementations, the camera co-processor may be able offload its processing tasks to a matrix fabric of the memory device. For example, the ISP may accelerate/offload/parallelize e.g., color interpolation, white balance, color correction, color conversion, etc. In other examples, a baseband co-processor (BB) may be able to may be able to directly communicate with the memory device to e.g., read/write data for transaction over a network interface. The BB processor may be able to offload e.g., FFT/IFFT, channel estimation, beamforming calculations, and/or any number of other networking tasks to a matrix fabric of a memory device. Similarly, video and/or audio codecs often utilize DCT/IDCT transformations, and would benefit from matrix fabric operations. Still other variants of the foregoing will be readily appreciated by artisans of ordinary skill in the related arts, given the contents of the present disclosure.

Various implementations of the present disclosure may support a queue of multiple instructions. In one exemplary embodiment, matrix operations may be queued together. For example, multiple vector-matrix multiplications may be queued together in order to effectuate a matrix multiplication. Similarly, as previously noted, a higher order transform (e.g., FFT1024) may be achieved by queuing multiple iterations of a lower order constituent transform (e.g., FFT512, etc.) In yet another example, ISP processing for an image may include multiple iterations over the iteration space (each iteration may be queued in advance). Still other queuing schemes may be readily substituted by artisans of ordinary skill in the related arts with equal success, given the contents of the present disclosure.

In some cases, matrix operations may be cascaded together to achieve matrix operations of a higher rank. For example, a higher order FFT (e.g., 1024×1024) can be decomposed into multiple iterations of lower rank FFTs (e.g., four (4) iterations of 512×512 FFTs, sixteen (16) iterations of 256×256 FFTs, etc.). In other examples, arbitrarily sized N-point DFTs (e.g., that is not a power of 2) can be implemented by cascading DFTs of other sizes. Still other examples of cascaded and/or chained matrix transformations may be substituted with equivalent success, the foregoing being purely illustrative.

As previously alluded to, the ReRAM's non-volatile nature retains memory contents even when the ReRAM is unpowered. Thus, certain variants of the processor-memory architecture may enable one or more processors to independently power the memory. In some cases, the processor may power the memory when the processor is inactive (e.g., keeping the memory active while the processor is in low power). Independent power management of the memory may be particularly useful for e.g., performing matrix operations in memory, even while the processor is asleep. For example, the memory may receive a plurality of instructions to execute; the processor can transition into a sleep mode until the plurality of instructions have been completed. Still other implementations may use the non-volatile nature of ReRAM to hold memory contents while the memory is powered off; for example, certain video and/or image processing computations may be held within ReRAM during inactivity.

At step 604 of the method 600, a memory array (or portion thereof) may be converted into a matrix fabric based on the instruction. As used herein, the term “matrix fabric” refers to a plurality of memory cells having a configurable impedance that, when driven with an input vector, yield an output vector and/or matrix. In one embodiment, the matrix fabric may be associated with a portion of the memory map. In some such variants, the portion is configurable in terms of its size and/or location. For example, a configurable memory register may determine whether a bank is configured as a memory or as a matrix fabric. In other variants, the matrix fabric may reuse and/or even block memory interface operation. For example, the memory device may allow the memory interface may be GPIO based (e.g., in one configuration, the pins of the memory interface may selectively operate as ADDR/DATA during normal operation, or e.g., FFT16, etc. during matrix operation.)

In one embodiment, the instruction identifies a matrix fabric characterized by structurally defined coefficients. In one exemplary embodiment, a matrix fabric contains the coefficients for a structurally defined matrix operation. For example, a matrix fabric for an 8×8 FFT is an 8×8 matrix fabric that has been pre-populated with structurally defined coefficients for an FFT. In some variants, the matrix fabric may be pre-populated with coefficients of a particular sign (positive, negative) or of a particular radix (the most significant bits, least significant bits, or intermediary bits).

As used herein, the term “structurally defined coefficients” refer to the fact that the coefficients of the matrix multiplication are defined by the matrix structure (e.g., the size of the matrix), not the nature of the operation (e.g., multiplying operands). For example, a structurally defined matrix operation may be identified by e.g., a row and column designation (e.g., 8×8, 16×16, 32×32, 64×64, 128×128, 256×256, etc.) While the foregoing discussions are presented in the context of full rank matrix operations, deficient matrix operators may be substituted with equivalent success. For example, a matrix operation may have asymmetric columns and/or rows (e.g., 8×16, 16×8, etc.) In fact, many vector-based operations may be treated as a row with a single column, or a column with a single row (e.g., 8×1, 1×8).

In some hybrid hardware/software embodiments, controlling logic (e.g., a memory controller, processor, PIM, etc.) may determine whether resources exist to provide the matrix fabric. In one such embodiment, a matrix operation may be evaluated by a pre-processor to determine whether or not it should be handled within software or within dedicated matrix fabric. For example, if the existing memory and/or matrix fabric usage consumes all of the memory device resources, then the matrix operation may need to be handled within software rather than via the matrix fabric. Under such circumstances, the instruction may be returned incomplete (resulting in traditional matrix operations via processor instructions). In another such example, configuring a temporary matrix fabric to handle a simple matrix operation may yield such little return, that the matrix operation should be handled within software.

Various considerations may be used in determining whether a matrix fabric should be used. For example, memory management may allocate portions of the memory array for memory and/or matrix fabric. In some implementations, portions of the memory array may be statically allocated. Static allocations may be preferable to reduce memory management overhead and/or simplify operational overhead (wear leveling, etc.). In other implementations, portions of the memory array may be dynamically allocated. For example, wear-leveling may be needed to ensure that a memory uniformly degrades in performance (rather than wearing out high usage areas). Still other variants may statically and/or dynamically allocate different portions; for example, a subset of the memory and/or matrix fabric portions may be dynamically and/or statically allocated.

As a brief aside, wear leveling memory cells can be performed in any discrete amount of memory (e.g., a bank of memory, a chunk of memory, etc.) Wear leveling matrix fabric may use similar techniques; e.g., in one variant, wear leveling matrix fabric portions may require that the entire matrix fabric is moved in aggregate (the crossbar structure cannot be moved in pieces). Alternatively, wear leveling matrix fabric portions may be performed by first decomposing the matrix fabric into constituent matrix computations and dispersing the constituent matrix computations to other locations. More directly, matrix fabric wear leveling may indirectly benefit from the “logical” matrix manipulations that are used in other matrix operations (e.g., decomposition, cascading, parallelization, etc.). In particular, decomposing a matrix fabric into its constituent matrix fabrics may enable better wear leveling management with only marginally more complex operation (e.g., the additional step of logical combination via MMU).

In one exemplary embodiment, conversion includes reconfiguring the row decoder to operate as a matrix fabric driver that variably drives multiple rows of the memory array. In one variant, the row driver converts a digital value to an analog signal. In one variant, digital-to-analog conversion includes varying a conductance associated with a memory cell in accordance with a matrix coefficient value. Additionally, conversion may include reconfiguring the column decoder to perform analog decoding. In one variant, the column decoder is reconfigured to sense analog signals corresponding to a column of varying conductance cells that are driven by corresponding rows of varying signaling. The column decoder converts an analog signal to a digital value. While the foregoing construction is presented in one particular row-column configuration, other implementations may be substituted with equal success. For example, a column driver may convert a digital value to an analog signal, and a row decoder may convert an analog signal to a digital value. In another such example, a three-dimension (3D) row-column-depth memory may implement 2D matrices in any permutation (e.g., row-driver/column-decoder, row-driver/depth-decoder, column-driver/depth-decoder, etc.) and/or 3D matrix permutations (e.g., row-driver/column-decoder-driver/depth-decoder).

In one exemplary embodiment, the matrix coefficient values correspond to a structurally determined value. Structurally determined values may be based on the nature of the operation. For example, a fast Fourier transform (FFT) transform on a vector of length N (where N is a power of 2) can be performed with FFT butterfly operations (of 2×2) or some higher order of butterfly (e.g., 4×4, 8×8, 16×16, etc.) Notably, the intermediate constituent FFT butterfly operation weighting is defined as a function of the unit circle

$(e . g ., e^{\frac{- 2 π k n}{N}})$

where both n and k are determined from the FFT vector length N; in other words, the FFT butterfly weighting operations are structurally defined according to the length of vector of length N. As a practical matter, a variety of different transformations are similar in this regard. For example, the discrete Fourier transform (DFT) and discrete cosine transform (DCT), both use structurally defined coefficients.

In one exemplary embodiment, the matrix fabric itself has structurally determined dimensions. Structurally determined dimensions may be based on the nature of the operation; for example, an ISP white balance processing may use a 3×3 matrix (corresponding to different values of Red (R), Green (G), Blue (B), Luminance (Y), Chrominance Red (Cr), Chrominance Blue (Cb), etc.) In another such example, channel matrix estimations and/or beamforming codebooks are often defined in terms of the number of multiple-input-multiple-output (MIMO) paths. For example, a 2×2 MIMO channel has a corresponding 2×2 channel matrix and a corresponding 2×2 beamforming weighting. Various other structurally defined values and/or dimensions useful for matrix operations may be substituted by artisans of ordinary skill in the related arts, given the contents of the present disclosure.

Certain variants may additionally subdivide matrix coefficient values so as to handle manipulations that may be impractical to handle otherwise. Under such circumstances, a matrix fabric may include only a portion of the matrix coefficient values (to perform only a portion of the matrix operation). For example, performing signed operation and/or higher level radix computations may require levels of manufacturing tolerance that are prohibitively expensive. Signed matrix operation may be split into positive and negative matrix operations (which are later summed by a matrix multiplication unit (MMU) described elsewhere herein). Similarly, high radix matrix operation may be split into e.g., a most significant bit (MSB) portion, a least significant bit (LSB) portion, and/or any intermediary bits (which may be bit shifted and summed by the aforementioned MMU). Still other variants would be readily appreciated by artisans of ordinary skill, given the contents of the present disclosure.

In one exemplary embodiment, the matrix coefficient values are determined ahead of time and stored in a look-up-table for later reference. For example, a matrix operation that has both structurally determined dimensions and structurally determined values may be stored ahead of time. As but one such example, an FFT of eight (8) elements has structurally determined dimensions (8×8) and structurally determine values

$(e . g ., e^{\frac{- 1 π}{4}}, e^{\frac{- 1 π}{2}}, e^{\frac{- 3 π}{4}}, etc .)$

A FFT8 instruction may result in the configuration of an 8×8 matrix fabric that is pre-populated with the corresponding FFT8 structurally determined values. As another such example, antenna beamforming coefficients are often defined ahead of time within a codebook; a wireless network may identify a corresponding index within the codebook to configure antenna beamforming. For example, a MIMO codebook may identify the possible configurations for a 4×4 MIMO system; during operation, the selected configuration can be retrieved from a codebook based an index thereto.

While the foregoing examples are presented in the context of structurally defined dimensions and/or values, other embodiments may use dimensions and/or values that are defined based on one or more other system parameters. For example, less granularity may be required for low power operation. Similarly, as previously alluded to, various processing considerations may weigh in favor of (or against) performing matrix operations within a matrix fabric. Additionally, matrix operation may affect other memory considerations including without limitation: wear leveling, memory bandwidth, process-in-memory bandwidth, power consumption, row column and/or depth decoding complexity, etc. Artisans of ordinary skill in the related arts given the contents of the present disclosure may substitute a variety of other considerations, the foregoing being purely illustrative.

At step 606 of the method 600, one or more matrix multiplication units may be configured on the basis of the instruction. As previously alluded to, certain matrix fabrics may implement logical (mathematical identities) to handle a single stage of a matrix operation; however, multiple stages of matrix fabrics may be cascaded together to achieve more complex matrix operations. In one exemplary embodiment, a first matrix is used to calculate positive products of a matrix operation and a second matrix is used to calculate the negative products of a matrix operation. The resulting positive and negative products can be compiled within an MMU to provide a signed matrix multiplication. In one exemplary embodiment, a first matrix is used to calculate a first radix portion of a matrix operation and a second matrix is used to calculate a second radix portion of a matrix operation. The resulting radix portions can be bit shifted and/or summed within an MMU to provide a larger radix product.

As a brief aside, logical matrix operations are distinguished from analog matrix operations. The exemplary matrix fabric converts analog voltages or current into digital values that are read by the matrix multiplication unit (MMU). Logical operations can manipulate digital values via mathematical properties (e.g., via matrix decomposition, etc.); analog voltages or current cannot be manipulated in this manner.

More generally, different logical manipulations can be performed with groups of matrices. For example, a matrix can be decomposed or factorized into one or more constituent matrices. Similarly, multiple constituent matrices can be aggregated or combined into a single matrix. Additionally, matrices may be expanded in row and/or column to create a deficient matrix of larger dimension (but identical rank). Such logic may be used to implement many higher order matrix operations. For example, multiplying two matrices together may be decomposed as a number of vector-matrix multiplications. These vector-matrix multiplications may be further implemented as multiply-accumulate logic within a matrix multiplication unit (MMU). In other words, even non-unary operations may be handled as a series of piece-wise unary matrix operations. More generally, artisans of ordinary skill in the related arts will readily appreciate that any matrix operation which can be expressed in whole, or in part, as a unary operation may greatly benefit from the various principles described herein.

Various embodiments of the present disclosure use matrix multiplication units (MMUs) as glue logic between multiple constituent matrix fabrics. Additionally, MMU operation may be selectively switched for connectivity to various rows and/or columns. Not all matrix fabrics may be used concurrently; thus, depending on the current processing and/or memory usage, matrix fabrics may be selectively connected to MMUs. For example, a single MMU may be dynamically connected to different matrix fabrics.

In some embodiments, controlling logic (e.g., a memory controller, processor, PIM, etc.) may determine whether resources exist to provide the MMU manipulations within e.g., column decoder or elsewhere. For example, the current MMU load may be evaluated by a pre-processor to determine whether or not an MMU may be heavily loaded. Notably, the MMU is primarily used for logical manipulations, thus any processing entity with equivalent logical functionality may assist with the MMU's tasks. For example, a processor-in-memory (PIM) may offload MMU manipulations. Similarly, matrix fabric results may be directly provided to the host processor (which can perform logical manipulations in software).

More generally, various embodiments of the present disclosure contemplate sharing MMU logic among multiple different matrix fabrics. The sharing may be based on e.g., a time sharing scheme. For example, the MMU may be assigned to a first matrix fabric during one time slot, and a second matrix fabric during another time slot. In other words, unlike the physical structure of the matrix fabric (which is statically allocated for the duration of the matrix operation), the MMU performs logical operations that can be scheduled, subdivided, allocated, reserved, and/or partitioned in any number of ways. More generally, various embodiments of the matrix fabric are based on memory and non-volatile. As a result, the matrix fabric may be configured in advance, and read from when needed; the non-volatile nature ensures that the matrix fabric retains contents without requiring processing overhead even if e.g., the memory device is powered off.

If both matrix fabrics and corresponding matrix multiplication units (MMUs) are successfully converted and configured, then at step 608 of the method 600, the matrix fabric is driven based on the instruction and a logical result is calculated with the one or more matrix multiplication units at step 610. In one embodiment, one or more operands are converted into an electrical signal for analog computation via the matrix fabric. The analog computation results from driving an electrical signal through the matrix fabric elements; for example, the voltage drop is a function of a coefficient of the matrix fabric. The analog computation result is sensed and converted back to a digital domain signal. Thereafter, the one or more digital domain values are manipulated with the one or more matrix multiplication units (MMUs) to create a logical result.

It will be recognized that while certain aspects of the disclosure are described in terms of a specific sequence of steps of a method, these descriptions are only illustrative of the broader methods of the disclosure, and may be modified as required by the particular application. Certain steps may be rendered unnecessary or optional under certain circumstances. Additionally, certain steps or functionality may be added to the disclosed embodiments, or the order of performance of two or more steps permuted. Furthermore, features from two or more of the methods may be combined. All such variations are considered to be encompassed within the disclosure disclosed and claimed herein.

The description set forth herein, in connection with the appended drawings, describes example configurations and does not represent all the examples that may be implemented or that are within the scope of the claims. The term “exemplary” used herein means “serving as an example, instance, or illustration,” and not “preferred” or “advantageous over other examples.” The detailed description includes specific details for the purpose of providing an understanding of the described techniques. These techniques, however, may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the concepts of the described examples.

Information and signals described herein may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.

While the above detailed description has shown, described, and pointed out novel features of the disclosure as applied to various embodiments, it will be understood that various omissions, substitutions, and changes in the form and details of the device or process illustrated may be made by those skilled in the art without departing from the disclosure. This description is in no way meant to be limiting, but rather should be taken as illustrative of the general principles of the disclosure. The scope of the disclosure should be determined with reference to the claims.

It will be further appreciated that while certain steps and aspects of the various methods and apparatus described herein may be performed by a human being, the disclosed aspects and individual methods and apparatus are generally computerized/computer-implemented. Computerized apparatus and methods are necessary to fully implement these aspects for any number of reasons including, without limitation, commercial viability, practicality, and even feasibility (i.e., certain steps/processes simply cannot be performed by a human being in any viable fashion).

The functions described herein may be implemented in hardware, software executed by a processor, firmware, or any combination thereof. If implemented in software executed by a processor, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable apparatus (e.g., storage medium). Computer-readable media include both non-transitory computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A non-transitory storage medium may be any available medium that can be accessed by a general purpose or special purpose computer. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, include CD, laser disc, optical disc, digital versatile disc (DVD), floppy disk and Blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above are also included within the scope of computer-readable media.

Claims

1. An apparatus comprising: a memory controller;a plurality of arrays of memory cells, where each memory cell of the plurality of arrays of memory cells are configured to store a digital value as an analog value in an analog medium, wherein: the plurality of arrays of memory cells comprises a first bank of arrays of the plurality of arrays of memory cells and a second bank of arrays of the plurality of arrays of memory cells,the first bank of arrays is configured as a memory,the second bank of arrays is configured as a matrix fabric, andthe plurality of arrays of memory cells are dynamically configurable to be partitioned in either of the first bank of arrays or the second bank of arrays; anda memory sense component, where the memory sense component is configured to read the analog value of a first memory cell as a first digital value,wherein the memory controller is configured to: receive image data comprising one or more captured color values;receive a matrix transformation opcode;operate the second bank of the plurality of arrays of memory cells as a matrix multiplication unit (MMU) based on the matrix transformation opcode, wherein: each memory cell of the MMU modifies the analog value in the analog medium in accordance with the matrix transformation opcode and a matrix transformation operand,first read values from a first subset of the memory cells from the first bank are used in a first multiply-accumulate operation to generate a first intermediary vector-matrix product,the first intermediary vector-matrix product is written back to one of the first subset of the memory cells,second read values from a second subset of the memory cells from the first bank are used in a second multiply-accumulate operation to generate a second intermediary vector-matrix product,the second subset is different from the first subset but includes the one of the first subset of the memory cells to which the first intermediary vector-matrix product is written back to, andthe matrix transformation operand comprises the one or more captured color values;configure the memory sense component to convert the analog value of the first memory cell into a second digital value in accordance with the matrix transformation opcode and the matrix transformation operand; andbased at least on a reading of the matrix transformation operand into the MMU, write a matrix transformation result based on the second digital value, the matrix transformation result comprising one or more shifted color values respectively associated with the one or more captured color values.
2. The apparatus of claim 1, wherein the matrix transformation opcode indicates a size of the MMU.
3. The apparatus of claim 2, wherein the matrix transformation opcode corresponds to a frequency domain transform operation.
4. The apparatus of claim 3, wherein the frequency domain transform operation spans at least one other MMU.
5. The apparatus of claim 1, wherein the matrix transformation opcode identifies one or more analog values corresponding to one or more memory cells.
6. The apparatus of claim 5, wherein the one or more analog values corresponding to the one or more memory cells are stored within a look-up-table (LUT) data structure.
7. The apparatus of claim 1, wherein each memory cell of the MMU comprises resistive random access memory (ReRAM) cells; and wherein the each memory cell of the MMU multiplies the analog value in the analog medium in accordance with the matrix transformation opcode and the matrix transformation operand.
8. The apparatus of claim 7, wherein the analog value in the analog medium is accumulated along with a previous analog value.
9. The apparatus of claim 1, wherein the first digital value is characterized by a first radix of two (2); andwherein the second digital value is characterized by a second radix greater than two (2).
10. The apparatus of claim 1, wherein the matrix transformation opcode corresponds to a matrix operation comprising both positive and negative coefficients.
11. The apparatus of claim 1, wherein: the matrix transformation opcode causes the memory controller to operate another array of memory cells as another matrix structure; andthe memory controller is further configured to: logically combine the matrix transformation result with another matrix transformation result associated with the another matrix structure.
12. A computerized memory apparatus comprising: a plurality of memory elements each configured to store digital values as analog values, wherein the plurality of memory elements are dynamically configurable to be partitioned in either a first bank of memory elements or a second bank of memory elements;a memory sense component configured to read the analog values;a data interface configured to perform data communication with a processor apparatus; andcontrol logic configured to, when operated: receive one or more instructions for an input transformation from the processor apparatus;cause configuration of the first bank of memory elements as a matrix multiplication unit (MMU) based at least on the received one or more instructions, wherein the configuration of the first bank of memory elements as the MMU enables atomic matrix fabric computations independent of matrix dimensions;cause the memory sense component to convert a first analog value, associated with a memory element of the second bank of memory elements configured as a memory, into a first digital value based at least on the received one or more instructions; andbased at least on the converted first digital value, obtain an output result from the memory sense component; andwherein at least the conversion of the first analog value by the memory sense component and the obtainment of the output result each occur while the processor apparatus is in a sleep or low power mode.
13. The computerized memory apparatus of claim 12, wherein the control logic is further configured to, when operated, configure respective impedance values for the plurality of memory elements.
14. The computerized memory apparatus of claim 13, wherein the one or more instructions for the input transformation comprise one or more opcodes for performance of at least a matrix transformation via use of the configured impedance values.
15. The computerized memory apparatus of claim 12, wherein the control logic is further configured to, when operated, obtain the first analog value associated with the memory element from a lookup table (LUT), the first analog value comprising a value associated with one of a voltage or a current.
16. The computerized memory apparatus of claim 12, wherein: each of the plurality of memory elements comprises resistive random access memory (ReRAM).
17. A method to perform matrix transformation operations, comprising: receiving, by a computerized image processing device, video data comprising one or more image blocks, wherein the video data is captured by a camera of the computerized image processing device;receiving, by the computerized image processing device, a matrix transformation opcode;configuring, by the computerized image processing device, an array of memory cells of a memory into a matrix structure, based on the matrix transformation opcode, wherein: a first plurality of the memory cells are configured into the matrix structure,a second plurality of the memory cells are configured as a memory, andthe first and second plurality of memory cells are dynamically configurable to be partitioned in either of a first subset or a second subset;reading, by the computerized image processing device, values from a first subset of the second plurality of memory cells for use in a first multiply-accumulate operation to generate a first intermediary vector-matrix product;writing, by the computerized image processing device, the first intermediary vector-matrix product back to one of the first subset of the first plurality of memory cells;reading, by the computerized image processing device, values from a second subset of the second plurality of memory cells for use in a second multiply-accumulate operation to generate a second intermediary vector-matrix product, wherein the second subset is different from the first subset but includes the one of the first subset of the first plurality of memory cells to which the first intermediary vector-matrix product is written back to;configuring, by the computerized image processing device, a memory sense component based on the matrix transformation opcode; andbased on a reading of a matrix transformation operand into the matrix structure, writing, by the computerized image processing device, a matrix transformation result from the memory sense component, wherein: (i) the matrix transformation operand comprises the one or more image blocks and is configured to modify one or more analog values of the matrix structure, (ii) the matrix transformation result comprises one or more frequency domain image coefficients, and (iii) the one or more analog values of the matrix structure accumulate the one or more frequency domain image coefficients from the video data over time.
18. The method of claim 17, wherein the configuring the array of memory cells comprises connecting a plurality of word lines and a plurality of bit lines corresponding to a row dimension and a column dimension associated with the matrix structure.
19. The method of claim 18, further comprising determining the row dimension and the column dimension from the matrix transformation opcode.
20. The method of claim 17, wherein the configuring the array of memory cells comprises setting one or more analog values of the matrix structure based on a look-up-table (LUT) data structure.
21. The method of claim 20, further comprising identifying an entry from the LUT data structure based on the matrix transformation opcode.
22. A computerized memory apparatus comprising: a plurality of memory elements each configured to store digital values as analog values, wherein the plurality of memory elements are dynamically configurable to be partitioned in either a first bank of memory elements or a second bank of memory elements;a memory sense component configured to read the analog values;a data interface configured to perform data communication with a processor apparatus;wherein the plurality of memory elements are powered independently by the processor apparatus; andcontrol logic configured to, when operated: receive one or more instructions for an input transformation from the processor apparatus, the processor apparatus entering a sleep or low power mode after the receipt of the one or more instructions by the computerized memory; andwhile the processor apparatus is in the sleep or low power mode and the plurality of memory elements are powered independently by the processor apparatus: cause configuration of the first bank of memory elements as a matrix multiplication unit (MMU) based at least on the received one or more instructions, wherein the configuration of the first bank of memory elements as the MMU enables atomic matrix fabric computations independent of matrix dimensions;cause the memory sense component to convert a first analog value, associated with a memory element of the second bank of memory elements configured as a memory, into a first digital value based at least on the received one or more instructions; andbased at least on the converted first digital value, obtain an output result from the memory sense component.
23. A computerized memory apparatus comprising: a plurality of memory elements each configured to store digital values as analog values, wherein the plurality of memory elements are dynamically configurable to be partitioned in either a first bank of memory elements or a second bank of memory elements;a memory sense component configured to read the analog values;a data interface configured to perform data communication with a processor apparatus; andcontrol logic configured to, when operated: receive one or more instructions for an input transformation from the processor apparatus, the processor apparatus entering a sleep mode after the receipt of the one or more instructions by the computerized memory;cause configuration of the first bank of memory elements as a matrix multiplication unit (MMU) based at least on the received one or more instructions, wherein the configuration of the first bank of memory elements as the MMU enables atomic matrix fabric computations independent of matrix dimensions;cause the memory sense component to convert a first analog value, associated with a memory element of the second bank of memory elements configured as a memory, into a first digital value based at least on the received one or more instructions; andbased at least on the converted first digital value, obtain an output result from the memory sense component;wherein the output result is maintained in the computerized memory during the sleep mode of the processor.

US Referenced Citations (486)

Number	Name	Date	Kind
4380046	Frosch et al.	Apr 1983	A
4435792	Bechtolsheim	Mar 1984	A
4435793	Ochii	Mar 1984	A
4727474	Batcher	Feb 1988	A
4843264	Galbraith	Jun 1989	A
4958378	Bell	Sep 1990	A
4977542	Matsuda et al.	Dec 1990	A
5023838	Herbert	Jun 1991	A
5034636	Reis et al.	Jul 1991	A
5201039	Sakamura	Apr 1993	A
5210850	Kelly et al.	May 1993	A
5253308	Johnson	Oct 1993	A
5276643	Hoffmann et al.	Jan 1994	A
5325519	Long et al.	Jun 1994	A
5367488	An	Nov 1994	A
5379257	Matsumura et al.	Jan 1995	A
5386379	Ali-Yahia et al.	Jan 1995	A
5398213	Yeon et al.	Mar 1995	A
5440482	Davis	Aug 1995	A
5446690	Tanaka et al.	Aug 1995	A
5473576	Matsui	Dec 1995	A
5481500	Reohr et al.	Jan 1996	A
5485373	Davis et al.	Jan 1996	A
5506811	McLaury	Apr 1996	A
5615404	Knoll et al.	Mar 1997	A
5638128	Hoogenboom et al.	Jun 1997	A
5638317	Tran	Jun 1997	A
5654936	Cho	Aug 1997	A
5678021	Pawate et al.	Oct 1997	A
5724291	Matano	Mar 1998	A
5724366	Furutani	Mar 1998	A
5751987	Mahant-Shetti et al.	May 1998	A
5787458	Miwa	Jul 1998	A
5854636	Watanabe et al.	Dec 1998	A
5867429	Chen et al.	Feb 1999	A
5870504	Nemoto et al.	Feb 1999	A
5915084	Wendell	Jun 1999	A
5935263	Keeth et al.	Aug 1999	A
5986942	Sugibayashi	Nov 1999	A
5991209	Chow	Nov 1999	A
5991785	Alidina et al.	Nov 1999	A
6005799	Rao	Dec 1999	A
6009020	Nagata	Dec 1999	A
6092186	Betker et al.	Jul 2000	A
6122211	Morgan et al.	Sep 2000	A
6125071	Kohno et al.	Sep 2000	A
6134164	Lattimore et al.	Oct 2000	A
6147514	Shiratake	Nov 2000	A
6151244	Fujino et al.	Nov 2000	A
6157578	Brady	Dec 2000	A
6163862	Adams et al.	Dec 2000	A
6166942	Vo et al.	Dec 2000	A
6172918	Hidaka	Jan 2001	B1
6175514	Henderson et al.	Jan 2001	B1
6181698	Hariguchi	Jan 2001	B1
6208544	Beadle et al.	Mar 2001	B1
6226215	Yoon	May 2001	B1
6301153	Takeuchi et al.	Oct 2001	B1
6301164	Manning et al.	Oct 2001	B1
6304477	Naji	Oct 2001	B1
6389507	Sherman	May 2002	B1
6418498	Martwick	Jul 2002	B1
6466499	Blodgett	Oct 2002	B1
6510098	Taylor	Jan 2003	B1
6563754	Lien et al.	May 2003	B1
6578058	Nygaard, Jr.	Jun 2003	B1
6731542	Le et al.	May 2004	B1
6754746	Leung et al.	Jun 2004	B1
6768679	Le et al.	Jul 2004	B1
6807614	Chung	Oct 2004	B2
6816422	Hamade et al.	Nov 2004	B2
6819612	Achter	Nov 2004	B1
6894549	Eliason	May 2005	B2
6943579	Hazanchuk et al.	Sep 2005	B1
6948056	Roth et al.	Sep 2005	B1
6950771	Fan et al.	Sep 2005	B1
6950898	Merritt et al.	Sep 2005	B2
6956770	Khalid et al.	Oct 2005	B2
6961272	Schreck	Nov 2005	B2
6965648	Smith et al.	Nov 2005	B1
6985394	Kim	Jan 2006	B2
6987693	Cernea et al.	Jan 2006	B2
7020017	Chen et al.	Mar 2006	B2
7020740	De Jong	Mar 2006	B2
7028170	Saulsbury	Apr 2006	B2
7045834	Tran et al.	May 2006	B2
7054178	Shiah et al.	May 2006	B1
7061817	Raad et al.	Jun 2006	B2
7079407	Dimitrelis	Jul 2006	B1
7103598	Clement	Sep 2006	B1
7133999	Skull	Nov 2006	B2
7173857	Kato et al.	Feb 2007	B2
7187585	Li et al.	Mar 2007	B2
7196928	Chen	Mar 2007	B2
7260565	Lee et al.	Aug 2007	B2
7260672	Garney	Aug 2007	B2
7372715	Han	May 2008	B2
7400532	Aritome	Jul 2008	B2
7406494	Magee	Jul 2008	B2
7447720	Beaumont	Nov 2008	B2
7454451	Beaumont	Nov 2008	B2
7457181	Lee et al.	Nov 2008	B2
7480792	Janzen et al.	Jan 2009	B2
7490190	Skull	Feb 2009	B2
7529888	Chen et al.	May 2009	B2
7535769	Cernea	May 2009	B2
7546438	Chung	Jun 2009	B2
7562198	Noda et al.	Jul 2009	B2
7574466	Beaumont	Aug 2009	B2
7602647	Li et al.	Oct 2009	B2
7613060	Smith	Nov 2009	B2
7663928	Tsai et al.	Feb 2010	B2
7685365	Rajwar et al.	Mar 2010	B2
7692466	Ahmadi	Apr 2010	B2
7752417	Manczak et al.	Jul 2010	B2
7791962	Noda et al.	Sep 2010	B2
7796453	Riho et al.	Sep 2010	B2
7805561	Skull	Sep 2010	B2
7805587	Van Dyke et al.	Sep 2010	B1
7808854	Takase	Oct 2010	B2
7827372	Bink et al.	Nov 2010	B2
7869273	Lee et al.	Jan 2011	B2
7898864	Dong	Mar 2011	B2
7924628	Danon et al.	Apr 2011	B2
7937535	Ozer et al.	May 2011	B2
7957206	Bauser	Jun 2011	B2
7979667	Allen et al.	Jul 2011	B2
7996749	Ding et al.	Aug 2011	B2
8042082	Solomon	Oct 2011	B2
8045391	Mokhlesi	Oct 2011	B2
8059438	Chang et al.	Nov 2011	B2
8095825	Hirotsu et al.	Jan 2012	B2
8117462	Snapp et al.	Feb 2012	B2
8164942	Gebara et al.	Apr 2012	B2
8208328	Hong	Jun 2012	B2
8213248	Moon et al.	Jul 2012	B2
8223568	Seo	Jul 2012	B2
8238173	Akerib et al.	Aug 2012	B2
8250342	Kostarnov et al.	Aug 2012	B1
8274841	Shimano et al.	Sep 2012	B2
8279683	Klein	Oct 2012	B2
8310884	Iwai et al.	Nov 2012	B2
8332367	Bhattacherjee et al.	Dec 2012	B2
8339824	Cooke	Dec 2012	B2
8339883	Yu et al.	Dec 2012	B2
8347154	Bahali et al.	Jan 2013	B2
8351292	Matano	Jan 2013	B2
8356144	Hessel et al.	Jan 2013	B2
8417921	Gonion et al.	Apr 2013	B2
8462532	Argyres	Jun 2013	B1
8484276	Carlson et al.	Jul 2013	B2
8495438	Roine	Jul 2013	B2
8503250	Demone	Aug 2013	B2
8526239	Kim	Sep 2013	B2
8533245	Cheung	Sep 2013	B1
8555037	Gonion	Oct 2013	B2
8588803	Hakola et al.	Nov 2013	B2
8599613	Abiko et al.	Dec 2013	B2
8605015	Guttag et al.	Dec 2013	B2
8625376	Jung et al.	Jan 2014	B2
8644101	Jun et al.	Feb 2014	B2
8650232	Stortz et al.	Feb 2014	B2
8873272	Lee	Oct 2014	B2
8964496	Manning	Feb 2015	B2
8971124	Manning	Mar 2015	B1
9015390	Klein	Apr 2015	B2
9047193	Lin et al.	Jun 2015	B2
9165023	Moskovich et al.	Oct 2015	B2
9343155	De Santis et al.	May 2016	B1
9430735	Vali et al.	Aug 2016	B1
9436402	De Santis et al.	Sep 2016	B1
9659605	Zawodny et al.	May 2017	B1
9659610	Hush	May 2017	B1
9697876	Tiwari et al.	Jul 2017	B1
9761300	Willcock	Sep 2017	B1
9947401	Navon et al.	Apr 2018	B1
10243773	Shattil	Mar 2019	B1
10340947	Oh et al.	Jul 2019	B2
10430493	Kendall	Oct 2019	B1
10440341	Luo et al.	Oct 2019	B1
10558518	Nair et al.	Feb 2020	B2
10621038	Yang	Apr 2020	B2
10621267	Strachan	Apr 2020	B2
10719296	Lee	Jul 2020	B2
10867655	Harms et al.	Dec 2020	B1
10878317	Hatcher et al.	Dec 2020	B2
10896715	Golov	Jan 2021	B2
11074318	Ma et al.	Jul 2021	B2
11449577	Luo	Sep 2022	B2
11853385	Luo	Dec 2023	B2
11928177	Luo	Mar 2024	B2
20010007112	Porterfield	Jul 2001	A1
20010008492	Higashiho	Jul 2001	A1
20010010057	Yamada	Jul 2001	A1
20010028584	Nakayama et al.	Oct 2001	A1
20010043089	Forbes et al.	Nov 2001	A1
20020059355	Peleg et al.	May 2002	A1
20030167426	Slobodnik	Sep 2003	A1
20030216964	MacLean et al.	Nov 2003	A1
20030222879	Lin et al.	Dec 2003	A1
20040003337	Cypher	Jan 2004	A1
20040073592	Kim et al.	Apr 2004	A1
20040073773	Demjanenko	Apr 2004	A1
20040085840	Vali et al.	May 2004	A1
20040095826	Perner	May 2004	A1
20040154002	Ball et al.	Aug 2004	A1
20040205289	Srinivasan	Oct 2004	A1
20040211260	Girmonsky	Oct 2004	A1
20040240251	Nozawa et al.	Dec 2004	A1
20050015557	Wang et al.	Jan 2005	A1
20050078514	Scheuerlein et al.	Apr 2005	A1
20050097417	Agrawal et al.	May 2005	A1
20050125477	Genov et al.	Jun 2005	A1
20050154271	Radsal	Jul 2005	A1
20050160311	Hartwell et al.	Jul 2005	A1
20050286282	Ogura	Dec 2005	A1
20060047937	Selvaggi et al.	Mar 2006	A1
20060069849	Rudelic	Mar 2006	A1
20060146623	Mizuno et al.	Jul 2006	A1
20060149804	Luick et al.	Jul 2006	A1
20060181917	Kang et al.	Aug 2006	A1
20060215432	Wickeraad et al.	Sep 2006	A1
20060225072	Lari et al.	Oct 2006	A1
20060291282	Liu et al.	Dec 2006	A1
20060294172	Zhong	Dec 2006	A1
20070103986	Chen	May 2007	A1
20070171747	Hunter et al.	Jul 2007	A1
20070180006	Gyoten et al.	Aug 2007	A1
20070180184	Sakashita et al.	Aug 2007	A1
20070195602	Fong et al.	Aug 2007	A1
20070261043	Ho et al.	Nov 2007	A1
20070285131	Sohn	Dec 2007	A1
20070285979	Turner	Dec 2007	A1
20070291532	Tsuji	Dec 2007	A1
20080019209	Lin	Jan 2008	A1
20080025073	Arsovski	Jan 2008	A1
20080037333	Kim et al.	Feb 2008	A1
20080052711	Forin et al.	Feb 2008	A1
20080071939	Tanaka et al.	Mar 2008	A1
20080137388	Krishnan et al.	Jun 2008	A1
20080165601	Matick et al.	Jul 2008	A1
20080178053	Gorman et al.	Jul 2008	A1
20080215937	Dreibelbis et al.	Sep 2008	A1
20080234047	Nguyen	Sep 2008	A1
20080235560	Colmer et al.	Sep 2008	A1
20090006900	Lastras-Montano et al.	Jan 2009	A1
20090067218	Graber	Mar 2009	A1
20090129586	Miyazaki et al.	May 2009	A1
20090154238	Lee	Jun 2009	A1
20090154273	Borot et al.	Jun 2009	A1
20090221890	Saffer	Sep 2009	A1
20090254697	Akerib et al.	Oct 2009	A1
20100067296	Li	Mar 2010	A1
20100091582	Vali et al.	Apr 2010	A1
20100172190	Lavi et al.	Jul 2010	A1
20100191999	Jeddeloh	Jul 2010	A1
20100210076	Gruber et al.	Aug 2010	A1
20100226183	Kim	Sep 2010	A1
20100308858	Noda et al.	Dec 2010	A1
20100332895	Billing et al.	Dec 2010	A1
20110051523	Manabe et al.	Mar 2011	A1
20110063919	Chandrasekhar et al.	Mar 2011	A1
20110093662	Walker et al.	Apr 2011	A1
20110099216	Sun et al.	Apr 2011	A1
20110103151	Kim et al.	May 2011	A1
20110119467	Cadambi et al.	May 2011	A1
20110122695	Li et al.	May 2011	A1
20110140741	Zerbe et al.	Jun 2011	A1
20110219260	Nobunaga et al.	Sep 2011	A1
20110267883	Lee et al.	Nov 2011	A1
20110317496	Bunce et al.	Dec 2011	A1
20120005397	Lim et al.	Jan 2012	A1
20120017039	Margetts	Jan 2012	A1
20120023281	Kawasaki et al.	Jan 2012	A1
20120120705	Mitsubori et al.	May 2012	A1
20120131079	Vu	May 2012	A1
20120134216	Singh	May 2012	A1
20120134225	Chow	May 2012	A1
20120134226	Chow	May 2012	A1
20120140540	Agam et al.	Jun 2012	A1
20120148044	Fang et al.	Jun 2012	A1
20120182798	Hosono et al.	Jul 2012	A1
20120198310	Tran et al.	Aug 2012	A1
20120239991	Melik-Martirosian	Sep 2012	A1
20120246380	Akerib et al.	Sep 2012	A1
20120265964	Murata et al.	Oct 2012	A1
20120281486	Rao et al.	Nov 2012	A1
20120303627	Keeton et al.	Nov 2012	A1
20130003467	Klein	Jan 2013	A1
20130061006	Hein	Mar 2013	A1
20130107623	Kavalipurapu et al.	May 2013	A1
20130117541	Choquette et al.	May 2013	A1
20130124783	Yoon et al.	May 2013	A1
20130132702	Patel et al.	May 2013	A1
20130138646	Sirer et al.	May 2013	A1
20130163362	Kim	Jun 2013	A1
20130173888	Hansen et al.	Jul 2013	A1
20130205114	Badam et al.	Aug 2013	A1
20130219112	Okin et al.	Aug 2013	A1
20130227361	Bowers et al.	Aug 2013	A1
20130283122	Anholt et al.	Oct 2013	A1
20130286705	Grover et al.	Oct 2013	A1
20130321207	Monogioudis et al.	Dec 2013	A1
20130326154	Haswell	Dec 2013	A1
20130332707	Gueron et al.	Dec 2013	A1
20130332799	Cho et al.	Dec 2013	A1
20140089725	Ackaret et al.	Mar 2014	A1
20140101519	Lee et al.	Apr 2014	A1
20140143630	Mu et al.	May 2014	A1
20140172937	Linderman et al.	Jun 2014	A1
20140185395	Seo	Jul 2014	A1
20140215185	Danielsen	Jul 2014	A1
20140219003	Ebsen et al.	Aug 2014	A1
20140245105	Chung et al.	Aug 2014	A1
20140250279	Manning	Sep 2014	A1
20140258646	Goss et al.	Sep 2014	A1
20140344934	Jorgensen	Nov 2014	A1
20140365548	Mortensen	Dec 2014	A1
20150016191	Tsai et al.	Jan 2015	A1
20150039960	Varanasi et al.	Feb 2015	A1
20150063052	Manning	Mar 2015	A1
20150070979	Zhu et al.	Mar 2015	A1
20150078108	Cowles et al.	Mar 2015	A1
20150120987	Wheeler	Apr 2015	A1
20150134713	Wheeler	May 2015	A1
20150185799	Robles et al.	Jul 2015	A1
20150270015	Murphy et al.	Sep 2015	A1
20150279466	Manning	Oct 2015	A1
20150288710	Zeitlin et al.	Oct 2015	A1
20150324290	Leidel et al.	Nov 2015	A1
20150325272	Murphy	Nov 2015	A1
20150340080	Queru	Nov 2015	A1
20150356009	Wheeler et al.	Dec 2015	A1
20150356022	Leidel et al.	Dec 2015	A1
20150357007	Manning et al.	Dec 2015	A1
20150357008	Manning et al.	Dec 2015	A1
20150357019	Wheeler et al.	Dec 2015	A1
20150357020	Manning	Dec 2015	A1
20150357021	Hush	Dec 2015	A1
20150357022	Hush	Dec 2015	A1
20150357023	Hush	Dec 2015	A1
20150357024	Hush et al.	Dec 2015	A1
20150357027	Shu et al.	Dec 2015	A1
20150357047	Tiwari	Dec 2015	A1
20160062672	Wheeler	Mar 2016	A1
20160062673	Tiwari	Mar 2016	A1
20160062692	Finkbeiner et al.	Mar 2016	A1
20160062733	Tiwari	Mar 2016	A1
20160063284	Tiwari	Mar 2016	A1
20160064045	La Fratta	Mar 2016	A1
20160064047	Tiwari	Mar 2016	A1
20160098208	Willcock	Apr 2016	A1
20160098209	Leidel et al.	Apr 2016	A1
20160110135	Wheeler et al.	Apr 2016	A1
20160118137	Zhang	Apr 2016	A1
20160125919	Hush	May 2016	A1
20160154596	Willcock et al.	Jun 2016	A1
20160155482	La Fratta	Jun 2016	A1
20160188250	Wheeler	Jun 2016	A1
20160196142	Wheeler et al.	Jul 2016	A1
20160196856	Tiwari et al.	Jul 2016	A1
20160225422	Tiwari et al.	Aug 2016	A1
20160231935	Chu	Aug 2016	A1
20160255502	Rajadurai et al.	Sep 2016	A1
20160266873	Tiwari et al.	Sep 2016	A1
20160266899	Tiwari	Sep 2016	A1
20160267951	Tiwari	Sep 2016	A1
20160292080	Leidel et al.	Oct 2016	A1
20160306584	Zawodny et al.	Oct 2016	A1
20160306614	Leidel	Oct 2016	A1
20160350230	Murphy	Dec 2016	A1
20160365129	Willcock	Dec 2016	A1
20160371033	La Fratta et al.	Dec 2016	A1
20160375360	Poisner et al.	Dec 2016	A1
20160379115	Burger	Dec 2016	A1
20170026185	Moses	Jan 2017	A1
20170052906	Lea	Feb 2017	A1
20170178701	Willcock et al.	Jun 2017	A1
20170188030	Sakurai et al.	Jun 2017	A1
20170192691	Zhang et al.	Jul 2017	A1
20170192844	Lea et al.	Jul 2017	A1
20170192936	Guo	Jul 2017	A1
20170213597	Micheloni	Jul 2017	A1
20170220526	Buchanan	Aug 2017	A1
20170228192	Willcock et al.	Aug 2017	A1
20170235515	Lea et al.	Aug 2017	A1
20170236564	Zawodny et al.	Aug 2017	A1
20170242902	Crawford, Jr. et al.	Aug 2017	A1
20170243623	Kirsch et al.	Aug 2017	A1
20170262369	Murphy	Sep 2017	A1
20170263306	Murphy	Sep 2017	A1
20170269865	Willcock et al.	Sep 2017	A1
20170269903	Tiwari	Sep 2017	A1
20170277433	Willcock	Sep 2017	A1
20170277440	Willcock	Sep 2017	A1
20170277581	Lea et al.	Sep 2017	A1
20170277637	Willcock et al.	Sep 2017	A1
20170278559	Hush	Sep 2017	A1
20170278584	Rosti	Sep 2017	A1
20170285988	Dobelstein et al.	Oct 2017	A1
20170293434	Tiwari	Oct 2017	A1
20170293912	Furche et al.	Oct 2017	A1
20170301379	Hush	Oct 2017	A1
20170308328	Lim et al.	Oct 2017	A1
20170309314	Zawodny et al.	Oct 2017	A1
20170329577	Tiwari	Nov 2017	A1
20170336989	Zawodny et al.	Nov 2017	A1
20170337126	Zawodny et al.	Nov 2017	A1
20170337953	Zawodny et al.	Nov 2017	A1
20170352391	Hush	Dec 2017	A1
20170371539	Mai et al.	Dec 2017	A1
20180012636	Alzheimer et al.	Jan 2018	A1
20180018559	Yakopcic	Jan 2018	A1
20180024769	Howe et al.	Jan 2018	A1
20180024926	Penney et al.	Jan 2018	A1
20180025759	Penney et al.	Jan 2018	A1
20180025768	Hush	Jan 2018	A1
20180032458	Bell	Feb 2018	A1
20180033478	Tanaka et al.	Feb 2018	A1
20180039484	La Fratta et al.	Feb 2018	A1
20180046405	Hush et al.	Feb 2018	A1
20180046461	Tiwari	Feb 2018	A1
20180060069	Rosti et al.	Mar 2018	A1
20180067802	Ha et al.	Mar 2018	A1
20180074754	Crawford, Jr.	Mar 2018	A1
20180075899	Hush	Mar 2018	A1
20180075926	Sagiv et al.	Mar 2018	A1
20180088850	Willcock	Mar 2018	A1
20180102147	Willcock et al.	Apr 2018	A1
20180108397	Venkata et al.	Apr 2018	A1
20180129558	Das	May 2018	A1
20180130515	Zawodny	May 2018	A1
20180136871	Leidel	May 2018	A1
20180242190	Khoryaev et al.	Aug 2018	A1
20180301189	Hu et al.	Oct 2018	A1
20180309451	Lu et al.	Oct 2018	A1
20180314590	Kothamasu	Nov 2018	A1
20180321942	Yu et al.	Nov 2018	A1
20180336552	Bohli et al.	Nov 2018	A1
20180350433	Hu	Dec 2018	A1
20180364785	Hu et al.	Dec 2018	A1
20180373675	Strachan et al.	Dec 2018	A1
20180373902	Muralimanohar et al.	Dec 2018	A1
20190036772	Agerstam et al.	Jan 2019	A1
20190066780	Hu et al.	Feb 2019	A1
20190080108	Gomez Claros et al.	Mar 2019	A1
20190179869	Park	Jun 2019	A1
20190180839	Kim	Jun 2019	A1
20190205741	Gupta et al.	Jul 2019	A1
20190246418	Loehr et al.	Aug 2019	A1
20190295680	Anzou	Sep 2019	A1
20190304056	Grajewski et al.	Oct 2019	A1
20190333056	Wilkinson et al.	Oct 2019	A1
20190347039	Velusamy et al.	Nov 2019	A1
20190349426	Smith et al.	Nov 2019	A1
20190354421	Brandt et al.	Nov 2019	A1
20190391829	Cronie et al.	Dec 2019	A1
20200004583	Kelly et al.	Jan 2020	A1
20200012563	Yang	Jan 2020	A1
20200020393	Al-Shamma	Jan 2020	A1
20200027079	Kurian	Jan 2020	A1
20200034528	Yang et al.	Jan 2020	A1
20200051628	Joo et al.	Feb 2020	A1
20200065029	Kim et al.	Feb 2020	A1
20200065650	Tran	Feb 2020	A1
20200097359	O'Connor et al.	Mar 2020	A1
20200143470	Pohl et al.	May 2020	A1
20200186607	Murphy et al.	Jun 2020	A1
20200193280	Torng et al.	Jun 2020	A1
20200201697	Torng et al.	Jun 2020	A1
20200210516	Espig	Jul 2020	A1
20200213091	Mai	Jul 2020	A1
20200221518	Schmitz et al.	Jul 2020	A1
20200226233	Penugonda et al.	Jul 2020	A1
20200257552	Park et al.	Aug 2020	A1
20200264688	Harms	Aug 2020	A1
20200264689	Harms	Aug 2020	A1
20200265915	Harms	Aug 2020	A1
20200279012	Khaddam-Aljameh	Sep 2020	A1
20200381057	Park et al.	Dec 2020	A1
20210098047	Harms et al.	Apr 2021	A1
20210149984	Luo	May 2021	A1
20210173893	Luo	Jun 2021	A1
20210274581	Schmitz et al.	Sep 2021	A1
20230014169	Luo	Jan 2023	A1
20240078286	Luo	Mar 2024	A1

Foreign Referenced Citations (28)

Number	Date	Country
1142162	Feb 1997	CN
1452396	Oct 2003	CN
101330616	Dec 2008	CN
102141905	Aug 2011	CN
108960418	Dec 2018	CN
109117416	Jan 2019	CN
112201287	Jan 2021	CN
113767436	Dec 2021	CN
0214718	Mar 1987	EP
2026209	Feb 2009	EP
H0831168	Feb 1996	JP
2009259193	Nov 2009	JP
100211482	Aug 1999	KR
20100134235	Dec 2010	KR
20130049421	May 2013	KR
135267	May 2001	WO
WO-0165359	Sep 2001	WO
WO-2010079451	Jul 2010	WO
WO-2011031260	Mar 2011	WO
WO-2013062596	May 2013	WO
WO-2013081588	Jun 2013	WO
WO-2013095592	Jun 2013	WO
WO-2017137888	Aug 2017	WO
WO-2017220115	Dec 2017	WO
2020146417	Jul 2020	WO
2020168114	Aug 2020	WO
2020118047	Nov 2020	WO
2020227145	Nov 2020	WO

Non-Patent Literature Citations (33)

Entry
L. Hongxia and H. Shitan, “High Performance Algorithm for Twiddle Factor of Variable-size FFT Processor and its Implementation,” 2012 International Conference on Industrial Control and Electronics Engineering, 2012, pp. 1078-1081, doi: 10.1109/ICICEE.2012.285. (Year: 2012).
R. Cai, A. Ren, Y. Wang and B. Yuan, “Memristor-Based Discrete Fourier Transform for Improving Performance and Energy Efficiency,” 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2016, pp. 643-648, doi: 10.1109/ISVLSI.2016.124. (Year: 2016).
P. Chi et al., “PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory,”2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016, pp. 27-39, doi: 10.1109/ISCA 2016.13. (Year: 2016).
Mittal, S. A Survey of ReRAM-Based Architectures for Processing-In-Memory and Neural Networks. Mach. Learn. Knowl. Extr. Jan. 2019, 75-114. https://doi.org/10.3390/make1010005 (Year: 2018).
Yuhao Wang et al., “Optimizing Boolean embedding matrix for compressive sensing in RRAM crossbar,” 2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), 2015, pp. 13-18, doi: 10.1109/ISLPED.2015.7273483. (Year: 2015).
S. Liu, Y. Wang, M. Fardad and P. K. Varshney, “A Memristor-Based Optimization Framework for Artificial Intelligence Applications,” in IEEE Circuits and Systems Magazine, vol. 18, No. 1, pp. 29-44, Firstquarter 2018, doi: 10.1109/MCAS.2017.2785421. (Year: 2018).
M. Hu and J. P. Strachan, “Accelerating Discrete Fourier Transforms with dot-product engine,” 2016 IEEE International Conference on Rebooting Computing (ICRC), Nov. 10, 2016, pp. 1-5, doi: 10.1109/ICRC.2016.7738682. (Year: 2016).
Nunez-Yanez, J., Amiri, S., Hosseinabady, M. et al. Simultaneous multiprocessing in a software-defined heterogeneous FPGA. J Supercomput 75, 4078-4095 (2019). https://doi.org/10.1007/s11227-018-2367-9 (Year: 2018).
Adibi J., et al., “Processing-In-Memory Technology for Knowledge Discovery Algorithms,” Proceeding of the Second International Workshop on Data Management on New Hardware, Jun. 25, 2006, 10 pages, Retrieved from the internet URL: http://www.cs.cmu.edu/˜damon2006/pdf/adibi06inmemory.pdf.
Boyd S. W., et al., “On the General Applicability of Instruction-Set Randomization,” IEEE Transactions on Dependable and Secure Computing, 2010, vol. 7 (3), 14 pages.
Cardarilli G.C., et al., “Design of a Fault Tolerant Solid State Mass Memory,” IEEE Transactions on Reliability, Dec. 2003, vol. 52(4), pp. 476-491.
Debnath B., et al., “Bloomflash: Bloom Filter on Flash-Based Storage,” 31st Annual Conference on Distributed Computing Systems, Jun. 20-24, 2011, 10 pages.
Derby J. H., et al., “A High-Performance Embedded DSP Core with Novel SIMD Features,” IEEE International Conference on Acoustics, Speech, and Signal Processing, Apr. 6-10, 2003, vol. 2, pp. 301-304.
Draper J., et al., “The Architecture of the DIVA Processing-In-Memory Chip,” International Conference on Supercomputing, Jun. 22-26, 2002, 12 pages, Retrieved from the internet URL: http://www.isi.edu/˜draper/papers/ics02.pdf.
Dybdahl H., et al., “Destructive-Read in Embedded DRAM, Impact on Power Consumption,” Journal of Embedded Computing-Issues in Embedded Single-chip Multicore Architectures, Apr. 2006, vol. 2 (2), 10 pages.
Elliott D. G., et al., “Computational RAM: Implementing Processors in Memory,” IEEE Design and Test of Computers Magazine, Jan.-Mar. 1999, vol. 16 (1), 10 pages.
Kogge P. M., et al., “Processing In Memory: Chips to Petaflops,” May 23, 1997, 8 pages, Retrieved from the internet URL: http://www.cs_ucf.edu/courses/cda5106/summer02/papers/kogge97PIM.pdf.
Message Passing Interface Forum 1.1 “4.9.3 MINLOC and MAXLOC”, Jun. 12, 1995, 5 pages, Retrieved from the internet [URL: https://www.mpi-forum.org/docs/mpi-1.1/mpi-11-html/node79.html].
Pagiamtzis K., “Content-Addressable Memory Introduction”, Jun. 25, 2007, 6 pages, Retrieved from the internet [URL: http://www.pagiamtzis.com/cam/camintro].
Pagiamtzis K., et al., “Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey,” IEEE Journal of Solid-State Circuits, Mar. 2006, vol. 41 (3), 16 pages.
Stojmenovic I, “Multiplicative Circulant Networks Topological Properties and Communication Algorithms,” Discrete Applied Mathematics, 1997, vol. 77, pp. 281-305.
Co-pending U.S. Appl. No. 13/449,082, entitled, “Methods and Apparatus for Pattern Matching,” filed Apr. 17, 2012, 37 pages.
Co-pending U.S. Appl. No. 13/774,553, entitled, “Neural Network in a Memory Device,” filed Feb. 22, 2013, 63 pages.
Co-pending U.S. Appl. No. 13/774,636, entitled, “Memory as a Programmable Logic Device,” filed Feb. 22, 2013, 30 pages.
Co-pending U.S. Appl. No. 13/743,686, entitled, “Weighted Search and Compare in a Memory Device,” filed Jan. 17, 2013, 25 pages.
Co-pending U.S. Appl. No. 13/796,189, entitled, “Performing Complex Arithmetic Functions in a Memory Device,” filed Mar. 12, 2013, 23 pages.
Methods and Apparatus for Performing Video Processing Matrix Operations Within a Memory Array, U.S. Appl. No. 16/689,981, filed Nov. 20, 2019, Inventor: Fa-Long Luo, Status: Notice of Allowance Mailed—Application Received in Office of Publications, Status Date: Jun. 29, 2021.
Methods and Apparatus for Performing Diversity Matrix Operations Within a Memory Array, U.S. Appl. No. 16/705,096, filed Dec. 5, 2019, Inventor: Fa-Long Luo, Status: Docketed New Case—Ready for Examination, Status Date: Jan. 10, 2020.
Methods and Apparatus for Performing Video Processing Matrix Operations Within a Memory Array, U.S. Appl. No. 17/948,126, filed Sep. 19, 2022, Inventor: Fa-Long Luo, Status: Application Undergoing Preexam Processing, Status Date: Sep. 19, 2022.
International Search Report and Written Opinion for PCT Application No. PCT/US2013/043702, mailed Sep. 26, 2013, 11 pages.
Athreyas, Nihard, et al., “Memristor-CMOS Analog Coprocessor for Acceleration of High-Performance Computing Applications.” ACM Journal on Emerging Technologies in Computing Systems (JETC), ACM, Nov. 1, 2018.
Extended European Search Report, EP 20802085.9, mailed on May 19, 2023.
Athreyas, Nihar, et al., “Memristor-CMOS Analog Coprocessor for Acceleration of High-Performance Computing Applications.” ACM Journal on Emerging Technologies in Computing Systems, vol. 14 No. 3, Nov. 1, 2018.

Related Publications (1)

	Number	Date	Country
	20200349217 A1	Nov 2020	US

Methods and apparatus for performing matrix transformations within a memory array

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

Field of Search

CPC

International Classifications