The present invention relates to a tensor processor performing matrix multiplication.
For a general-purpose processor offering high computational flexibility, matrix operations take place serially, one-at-a-time, while requiring continuous access to the cache memory, thus generating the so called “von Neumann bottleneck”. Specialized architectures for neural networks (NN) such as Graphic Process Units (GPUs) and Tensor Process Units (TPUs), have been engineered to reduce the effect of the von Neumann bottleneck enabling cutting-edge machine learning models. The paradigm of these architectures is to offer domain-specificity such as being optimized for performing convolutions or Matrix Vector Multiplications (MVMs) operations, unlike CPUs, in parallel deploying for instance via systolic algorithms.
GPUs have thousands of processing cores optimized for matrix math operations, providing tens to hundreds of T FLOPS (Floating point operations) of performance which makes GPUs the obvious computing platform for deep (i.e. multi-layered) NN-based artificial intelligence (AI) such as machine-learning (ML) applications. GPUs and TPUs are particularly beneficial with respect to CPUs, but when used to implement deep NN performing inference on large 2-dimensional data sets such as images, they are rather power-hungry and require long computation time (>tens of ms). Moreover, smaller matrix multiplication for less complex inference tasks (e.g. MIST, CIFAR-10 datasets) are still challenged by a non-negligible latency predominantly due to the access overhead of the various memory hierarchies and the latency in executing each instruction in the GPU.
Given this context of computational hardware for obtaining architectures that mimic efficiently some functionality of the biological circuitry of the brain, it is necessary to explore and reinvent the operational paradigms of current logic computing platforms when performing matrix algebra, by replacing sequential and temporized operations, and their associated continuous access to memory, with massively parallelized distributed analog dynamical units, towards delivering efficient post-CMOS devices and systems summarized as non von Neumann architectures. In this paradigm shift the wave nature of light and related inherent operations, such as interference and diffraction, can play a major role in enhancing computational throughput and concurrently reducing the power consumption of neuromorphic platforms.
In recent years, the revolutionizing impact of NNs contributed to the development of a plethora of emerging technologies, ranging from free space diffractive optics to nanophotonic processors aiming to improve the computational efficiency of specific tasks performed by NN. Integrated photonic platforms can indeed provide parallel, power-efficient and low-latency computing, which is possible because analog wave chips can a) perform the dot product inherently using light matter interactions such as via a phase shifter or modulator, b) enable signal accumulation (summation) by either electromagnetic coherent interference or incoherent accumulation through detectors, and c) enable parallelism strategies and higher throughput using multiplexing schemes such as wavelength- or polarization division multiplexing, for example.
A system comprising of an engine receiving an input(s) and configured to conduct optical and/or electro-optical tensor operations of the input(s) (one or more physical inputs) by means of performing optical or electro-optical, or all-optical dot-product multiplications, and, either coherent or incoherent summation, thus performing multiply-accumulate (MAC) operations. The entire photonic tensor core (PTC) processor is comprised of modular PTC sub-modules, which perform said multiply-accumulate (MAC) operations.
The PTC sub-modules comprise of a photonic dot product engine (PDPE) having (an) first input(s) and (a) second input(s). The first and/or second input is a matrix, or a vector, or a scalar. The PTC and PDPE have integrated photonics, and/or fiber optics, and/or optical free-space, and/or a combination of these that optically performs the dot-product multiplication of the first input and the second input. A plurality of PTC sub-modules form a Photonic Tensor Core (PTC) processor unit.
In describing the illustrative, non-limiting embodiments of the invention illustrated in the drawings, specific terminology will be resorted for the sake of clarity. However, the invention is not intended to be limited to the specific terms so selected, and it is to be understood that each specific term includes all technical equivalents that operate in similar manner to accomplish a similar purpose. Several embodiments of the invention are described for illustrative purposes, it being understood that the invention may be embodied in other forms not specifically shown in the drawings.
Turning to the drawings,
In Table 1, V and M stand for Vector and Matrix. In the example embodiment shown in the figures, we consider the dot product engine (100) having 4 reconfigurable inputs (2) and optional (3) and 4 inputs (1) and optional (4). Each dot product engine (4 input (1) and 4 reconfigurable elements (1)) performs 4 multiplications and using the post-multiplication accumulations (40) and (26). Different tensor operations can be decomposed into multiplications and additions which, according to the algorithm complexity (function of the dimension of the matrices) require corresponding utilization.
The first input A (1) are optical signals that are either modulated (i.e. carrying encoded data, termed herein Case 2), or are un-modulated photons arriving (herein termed Case 1), that is, impeding the input ports of A. For the latter, this can be, as an example, be a grating coupler of a photonic integrated circuit (PIC), or a fiber optic system, or a free-space implementation using display light processing (DLP) technology such as a spatial light modulator (SLM), or a digital-mirror-display (DMD) for example.
As further shown, the Tensor Assembly (100) can, optionally, include one or more Digital-to-Analog Converters (DAC) (4), (3) at each of the first and second inputs, respectively. The input time variant signals (input matrix A) can be electrical data (Case 1), and/or Optical data (Case 2). The electrical data entering (1) and the kernel input (2) can either be analog and/or digital. Referring momentarily to
To provide some illustrative examples; Cases i & iv rely on photonic nonvolatile memories such those provided by phase-change materials or a nearby electrical capacitor or similar. For this exemplary Photonic Memory-based option, depending on whether the spectral filter is actively tuned to perform the dot-product or just passive (with the dot-product performed post-filter) separates Case i from Case iv, for example.
The spectral filter can be any-type of frequency filters, such as tunable microring resonators, for example. Refer to
Physically the PDPE takes a signal (A) and amplitude-weights it based on a value B. For example, if data A is a number and B a number between 0-1, then the ‘weighting’ i.e. dot-product=A-value times B. This is one multiplication, and there are N performed per Di,j PTC sub-module.
Thus, the PDPE (5) can perform matrix-matrix, matrix-vector, or vector-matrix multiplication. That is, the entire a tensor-core processor (50) (
In either case, the photonic PDPE performs these multiplications more synergistically than electronic counterparts, because of the inherent parallelism such as given by multiplexing options. That is without integrations, meaning that all multiplications happen at the same time with short runtime at similar power consumption to electronics.
Thus,
The dot-product options (24) refer to the various configurations of the PDPE (5) itself, which are set forth in
As shown, each Dot Product implementation has twelve (2×6) implementation options all detailed in
For component scaling, Cases i-iii, the #DACs=2N3 (Case 2) and N2 (Case 1), the spectral filter (e.g. MRRs) the number of components scales with 2N3, but note that all spectral filters are ‘passive’ or minimal (e.g. coarse WDM) spectral tuning; for Cases iv & vi, the #DACs=# spectral filters (e.g. MRRs)=2N2 (Case 2) and N2 (Case 1), but note that for the spectral filter sensitive ‘active’ N-bit turning is required.
The Runtime Latency scales as follows: Case i-iii and Case 1: Σ{TOF+Rx}; Case 2: Σ{TOF+A-DAC+A-RC+Rx}, if the kernel reconfiguration is required, then add Σ{B−DAC+B-RC. Definitions: TOF=time-of-flight, Rx=receiver, AB-RC=RC-delay time from A/B-inputs. For Runtime Latency, Σ{TOF+MRR-RC+MRR-DAC Rx}, if kernel reconfiguration is required, then Σ{TOF+B-DAC+B-RC+Rx}; and for Cases iv-vi, Case 1, Nx{Σ{TOF+Rx+(N−1)×{MRR-RC+MRR-DAC}}}, where MRR-RC is the latency from the tunable spectral filters, such as MRRs, and MRR-DAC the DAC latency for tuning. For Cases iv-vi, Case 2, Nx{Σ{TOF+Rx+MRR-RC+MRR-DAC}}.
Referring to
Each Tensor Assembly (100) has an output (6) termed D. This output (6) is either an optical signal, or an electrical. After the dot-product multiplication the results is in the optical domain. The summation can be performed in two conceptually-different ways, either coherently optically (Case z) or electrically using a single photodetector (Cases a, b), or electrically using a combination of photodetectors (i.e. balanced detectors) (Cases c, d). For example,
Referring to
The combined input for all the wavelengths is received at a multiplexer (MUX) (8), which combines the first input signals for all the wavelengths into a single first signal and placed on a common input bus. Note, if desired, this could also be omitted, and signal could be multiplied with B (2) without multiplexing. Turning back to
The multiplied output from each wavelength are combined to form a combined optical signal (42) across all the wavelengths. That combined optical signal (42) is received by a photodetector (44), which sums the optical data across all wavelengths and converts it to a digital signal output (46), D0,0(t)=A0,i(t)·Bi,0(t). That output (46) forms the output (6) (
The PDPE (5) of
It is further noted that, in another embodiment, an amplifier (58) need not be provided in the Tensor Assembly (100) of
See the table below Table 2 for some performance gains for the optical input case. The optical output case for coherent summation has no RC-delay, but requires phase stabilization.
The passive filtering has more control on the inter-channel crosstalk and potentially extends the number of wavelengths in a Dense WDM (DWDM) scheme without being affected by the induced quality factor variation caused by the variation of absorption coefficient. The PTC (the D's) can be increased by a factor of N, though N more wavelengths are needed. Since, the spectral filters are used passively only.
The different wavelengths are weighted in a seemingly quantized electro-absorption scheme (i.e. amplitude modulation), thus performing element-wise multiplication. The element-wise multiplications are thence incoherently summed (Cases a-d) up using a photodetector (44) or balanced photodetectors (45, 47), followed eventually by an amplification stage (46, 58), such as a trans-impedance amplifier as illustrated in
The optical engine (5) unit system can perform matrix-matrix, matrix-vector, or vector-matrix multiplications optically using integrated photonics, optical free-space, or a combination thereof, herein termed Photonic Tensor Core (PTC). Turning to
The invention has a wide variety of application, from (high-to-tech) to Optical Artificial intelligence Hardware; Photonic Machine Learning; Photonic Tensor Core. Being vector matrix, dot product and matrix-matrix multiplication fundamental operations for Neural Network, using a photonic accelerator, i.e. PTC which can perform such operations, speeds up the intelligent decisions of NN, while also saving energy.
The architecture has a plurality (e.g. array) of PTC sub-modules (5) that make up a photonic tensor cores (50) that enable real-time intelligent computing at the edge of ultra-high-speed mobile networks (5G and beyond) and internet-connected devices, with throughputs of the order of Peta-operations-per second in 10's of picosecond-short delays, which is 2 orders of magnitude faster and more efficient than currently electronic architectures. The product includes a photonic chip, which integrates reprogrammable multi-state low losses photonic memory, able to perform dot-products and vector matrix multiplications, operations at the heart of machine learning algorithm, completely parallelly and inherently with time complexity of O(1). Time delay after programming the cores (for already trained NN) is given by the time-of-flight of the photon in the chip which is few tens ps. The core can be easily programmed using multistate photonic memories, thus not requiring additional Digital to Analog Converters (DAC).
There are currently two major bottlenecks in the energy efficiency of artificial intelligence (AI) accelerators: data movement, and the performance of MAC operations, or tensor operations. Light is an established communication medium and has traditionally been used to address data movement on a larger scale. As photonic links are scaled to shorter distances and some of their practical problems have been addressed, photonic devices have the potential to deliver both of these bottlenecks on-chip simultaneously. Such photonic systems have been proposed in various configurations to accelerate NN operations. However, their main advantage comes from addressing MAC operations directly. The claimed PTC unit enables seamless system control, effective integration, while delivering high computational performance and competitive cost due to the integrated photonics platform.
Hardware for Machine Intelligence: Most NNs unravel multiple layers of interconnected neurons/nodes. Each neuron and layer, as well as the network interconnectivity, is essential to perform the task which the network has been trained for. In their connected layer, NNs strongly rely on vector matrix math operations, in which large matrices of input data and weights are multiplied, according to the training. Complex multi-layered deep NNs, in fact, require a sizeable amount of bandwidth and low latency for satisfying the vast operation required for performing large matrix multiplication without sacrificing efficiency and speed. Since the dawn of the computing era, due to the ubiquity of matrix math, which extends to neuromorphic computing, researchers have been investigating optimized ways to efficiently multiply matrices. A NN requires convolutional layers (CONV) and fully-connected layers (FC) to perform classification tasks. Thus, the PTC by means of doing VMMs (via MACs) performs the CONV layer of a NN.
Rationale for Photonic in Intelligent Information Processing: Smaller matrix multiplication for less complex inference tasks are still challenged by a non-negligible latency predominantly due to the access overhead of the various memory hierarchies and the latency in executing each instruction in the GPU. Within this paradigm shift the ‘wave’ nature of light and related inherent operations, such as interference and diffraction, can play a major role in enhancing computational throughput and concurrently reducing the power consumption of neuromorphic platforms; in recent years, the revolutionizing impact of NNs contributed to the development of a plethora of emerging technologies, ranging from free-space diffractive optics to nanophotonic processors aiming to improve the computational efficiency of specific tasks performed by NN.
Integrated photonic platforms can provide parallel, power-efficient and low-latency computing, which is possible because analog wave chips can a) perform the dot-product inherently such as via a phase shifters or amplitude modulating components, b) enable signal accumulation (summation) by either electromagnetic coherent interference or incoherent accumulation through photodetectors, and c) enable parallelism strategies and higher throughput using a variety of MUX schemes (e.g. wavelength, polarization, frequency, orbital-angular-momentum). These MUX-options are, at first order, ‘orthogonal’ to each other, thus allowed for a 2nd-order MUX of simultaneous use. Additionally, assisted by state-of-the-art theoretical frameworks, future technologies should perform computing tasks in the domain in which their time varying input signals lay, thus exploiting and leveraging their intrinsic physical operations. In this view, photons are an ideal match for computing node-distributed networks and engines performing intelligent tasks over large data at the edge of a network (e.g. 5G, MIMO, data-center, astronomic telescope arrays, particle-accelerator sensory networks, etc), where the data signals may exist already in the form of photons (e.g. surveillance camera, optical sensor, etc), thus pre-processing/-filtering information for early feature extraction, and/or intelligently regulating the amount of data traffic that is allowed to proceed downstream towards in-depth compute and decision-making systems such as to data-centers, cloud systems, operator headquarters.
However, the functionality of memory for storing the trained weights is not straightforwardly achieved in optics or at least in its non-volatile implementation, and therefore usually requires additional circuitry and components (i.e. DAC, memory) and related consumption of static power, sinking the overall benefits (energy efficiency and speed) of photonics. Therefore, computing AI-systems and machine-learning (ML) tasks, while transferring and storing data exclusively in the optical domain, is highly desirable because of the inherently large bandwidth, low residual crosstalk, and short-delay of optical information transfer.
The invention can also be used for a variety of Use-Cases/Applications ranging from 5G networks, scientific data processing, data centers, data security. Note, VMM-based processing is performing machine-learning tasks, and hence can be used in an ubiquitous across the board and a plethora of applications.
The present invention is significantly faster (1-2 orders of magnitude) and 1 order of magnitude more efficient when performing matrix multiplication with 8-bit precision with respect to current electronic application based on tensor computing.
An illustrative initial performance analysis of a PTC for a selected physical options is as follows: considering a photonic foundry Ge-photodetectors, microring resonator (radius=10 μm) and AIM-photonics disc-modulators, the latency of an individual photonic tensor sub-unit (e.g. unit D2,1) requires Σ{E2O+TOF+Rx+readout}=˜65 ps for processing a 4×4 matrix multiplication resulting in computing 64 MACs at 4 bit precision. This delivers a total 0.5-2 POPS/s throughput for ˜250 4×4 PTC units when limiting the maximum die-area to 800 mm2 (assumed: 4-bit DACs area=0.05 mm2) limited mainly by the E2O (i.e. DACs). For an optical data input (e.g. camera), the peak throughput increases to 16 POPS/s for only a few watts of power. If pipelining could be used, the 65 ps drops to ˜20 ps latency, thus improving throughputs by 3×. Hence one could consider sharing DAC usage amongst cores. (Table 2).
Table 2 is a Tensor Core performance comparison. Electronic data-fed Photonic Tensor Core (PTC) offers 2-10× throughput improvement over NVIDIA's T4, and for optical data (e.g. camera) improvements are ˜100× (chip area limited to a single die ˜800 mm2). *10:1 DAC reuse. **Optical Data input (no DACs). ***Inference only. In Table 2, column 2 is case 2, column 3 is case 1, and column 4 is prior art in electronics.
The foregoing description and drawings should be considered as illustrative only of the principles of the invention. The invention may be configured in a variety of manners and is not intended to be limited by the embodiment. Numerous applications of the invention will readily occur to those skilled in the art. Therefore, it is not desired to limit the invention to the specific examples disclosed or the exact construction and operation shown and described. Rather, all suitable modifications and equivalents may be resorted to, falling within the scope of the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2020/028516 | 4/16/2020 | WO |