Machine learning is becoming ubiquitous in edge computing applications, where large networks of low-power smart sensors preprocess their data remotely before relaying it to a central server. Since much of this preprocessing relies on deep neural networks (DNNs), great effort has gone into developing size, weight, and power (SWaP)-constrained hardware and efficient models for DNN inference at the edge. However, many state-of-the-art DNNs are so large that they can only be run in a data center, as their model sizes exceed the memories of SWaP-constrained edge processors. Such DNNs cannot be run on the edge, so sensors must transmit their data to the server for analysis, leading to severe bandwidth bottlenecks.
To address these problems with running DNN inference at the edge, we introduce NetCast, an optical neural network architecture that circumvents limitations on DNN size, allowing DNNs of arbitrary size to be run on SWaP-constrained edge devices. NetCast uses a server-client protocol and architecture that exploit wavelength-division multiplexing (WDM), difference detection and integration, optical weight delivery, and the extremely large bandwidth of optical links to enable low-power DNN inference at the edge for networks of arbitrary size, unbounded by the SWaP constraints of edge devices. This enables the edge deployment of whole new classes of neural networks that have heretofore been restricted to data centers.
More generally, NetCast provides a server-client architecture for performing DNN inference in SWaP-constrained edge devices. By broadcasting the synaptic weights optically from a central server, this architecture significantly reduces the memory and power requirements of the edge device, enabling data center-scale deep learning on low-power platforms that is not possible today.
The central server encodes a matrix (the DNN weights) into an optical pulse train. It transmits the encoded optical pulse train over a link (e.g., a free-space or fiber link, potentially with optical fan-out) and to one or more clients (edge devices). Each client uses a combination of optical modulation, wavelength multiplexing, and photodetection to compute the matrix-vector product Σnwmnxn between the weights (received over the link) and the DNN layer inputs, also called activations, which are stored on the client. Many layers are run sequentially, allowing each client to perform inference for DNNs of arbitrary size and depth without needing to store the weights in memory.
This client-server architecture has several advantages over existing applications. At present, to perform deep learning on edge devices, there are limited options, each with its own drawback(s). These options include: (1) upload the data and run the DNN in the cloud at the cost of bandwidth, latency, and privacy issues; (2) run the full DNN on the edge device—but note the memory and power requirements often exceed the device's SWaP constraints; or (3) compress the DNN so that it can run with lower power and memory—often not possible, and will degrade the DNN's performance (classification accuracy, etc.). In contrast, the present technology can simultaneously provide local data storage, SWaP constraint satisfaction, and high-performing (uncompressed) DNNs.
Applications for the NetCast client-server protocol and architecture include: bringing high-performance deep learning to light-weight edge or fog devices in the Internet-of-Things; enabling low-power fiber-coupled smart sensors on advanced machinery (aircraft, cars, ships, satellites, etc.), distributing DNNs to large free-space sensor networks (e.g., for environmental monitoring, disaster relief, mining, oil/gas exploration, geospatial intelligence, or security). For highly utilized DNNs, data centers can also use the architecture to reduce the energy consumption of DNN inference.
NetCast can be implemented as follows. A server generates a weight signal comprising an optical carrier modulated with a set of spectrally multiplexed weights for a DNN, then transmits the weight signal to a client via an optical link The client receives the weight signal and computes a matrix-vector product of (i) the set of spectrally multiplexed weights modulated onto the optical carrier and (ii) inputs to a layer of the DNN. The server can store the set of spectrally multiplexed weights in its (local) memory and retrieve the set of spectrally multiplexed weights from its (local) memory.
The server can generate the weight signal by, at each of a plurality of time steps, modulating WDM channels of the optical carrier with respective entries of a column of a weight matrix of the DNN. In this case, the client can compute the matrix-vector product by modulating the weight signal with the inputs to the layer of the DNN, demultiplexing the WDM channels of the weight signal modulated with the input to the layer of the DNN, and sensing powers of the respective WDM channels of the weight signal modulated with the input to the layer of the DNN. The client can modulate the weight signal with the inputs to the layer of the DNN by intensity-modulating inputs to a Mach-Zehnder modulator with amplitudes of the inputs to the layer of the DNN and encoding signs of the inputs to the layer of the DNN with the Mach-Zehnder modulator.
The server can also generate the weight signal by modulating an intensity of the optical carrier with amplitudes of the set of spectrally multiplexed weights before coupling the optical carrier into a set of ring resonators and modulating the optical carrier with signs of the set of spectrally multiplexed weights using the ring resonators. Or the server can generate the weight signal by encoding the set of spectrally multiplexed weights in a complex amplitude of the optical carrier, in which case the client computes the matrix-vector product in part by detecting interference of the weight signal with a local oscillator modulated with the inputs to the layer of the DNN.
The spectrally multiplexed weights may form a weight matrix, in which case the client can compute the matrix-vector product by weighting columns of the weight matrix with the inputs to the layer of the DNN to produce spectrally multiplexed products; demultiplexing the spectrally multiplexed products; and detecting the spectrally multiplexed products with respective photodetectors. In this case, weighting the columns of the weight matrix with the inputs to the layer of the DNN may include simultaneously modulating a plurality of wavelength channels. Alternatively, the client can weight rows of the weight matrix with the inputs to the layer of the DNN to produce temporally multiplexed products and detecting the temporally multiplexed products with at least one (and perhaps only one) photodetector. In this case, weighting the rows of the weight matrix with the inputs to the layer of the DNN may include independently modulating each of a plurality of wavelength channels.
A NetCast system may include both a server and one or more clients. The server may include a first memory, a (laser) source, and a first modulator operably coupled to the first memory and the source. In operation, the first memory stores weights (a weight matrix) for the DNN. The source emits an optical carrier (e.g., a frequency comb). And the first modulator generates a weight signal comprising the weights modulated onto wavelength-division multiplexed (WDM) channels of the optical carrier. The client, which is operably coupled to the server via an optical link, includes a second memory, a second modulator, and a frequency-selective detector. In operation, the second memory stores activations for a layer of the DNN. The second modulator, which is operably coupled to the second memory, modulates the activations onto the weight signal, thereby generating a matrix-vector product of the weights and the activations. And the frequency-selective detector, which is operably coupled to the modulator, detects the WDM channels of the matrix-vector product.
The first modulator can modulate the WDM channels of the optical carrier with respective entries of a column of a weight matrix of the DNN over respective time steps. It can include micro-ring resonators configured to modulate WDM channels. The frequency-selective detector can include one pair of ring resonators for each WDM channel and one balanced detector for each pair of ring resonators.
In some cases, the first modulator can modulate signs of the weights onto the optical carrier, in which case the client further includes an intensity modulator, operably coupled to the first modulator, to modulate amplitudes of the weights onto the optical carrier. Similarly, the second modulator can modulate signs of the activations onto the weight signal, in which case the client includes at least one intensity modulator, operably coupled to the second modulator, to modulate amplitudes of the activations onto the weight signal.
A coherent NetCast system also includes a server and at least one client. The coherent NetCast server includes a first memory to store the weights for the DNN, a laser source to generate a frequency comb, and a frequency-selective modulator, operably coupled to the first memory and the laser source, to generate a weight signal comprising the weights modulated onto WDM channels of the frequency comb. The client is operably coupled to the server via an optical link and includes a second memory, a local oscillator (LO), a modulator, and a frequency-selective detector. The second memory stores activations for a layer of the DNN. The LO generates an LO frequency comb phase-locked to the frequency comb. The modulator is operably coupled to the second memory and to the LO and modulates the activations onto the LO frequency comb. And the frequency-selective detector is operably coupled to the modulator and detects interference of the weight signal and the LO frequency comb, thereby producing a matrix-vector product of the weight signals and the activations.
The frequency-selective modulator can include one pair of ring resonators for each of the WDM channels arranged on different arms of a Mach-Zehnder interferometer. The frequency-selective detector can include one pair of ring resonators for each of the WDM channels and one balanced detector for each pair of ring resonators.
All combinations of the foregoing concepts and additional concepts discussed in greater detail below (provided such concepts are not mutually inconsistent) are contemplated as being part of the inventive subject matter disclosed herein. In particular, all combinations of claimed subject matter appearing at the end of this disclosure are contemplated as being part of the inventive subject matter disclosed herein. Terminology explicitly employed herein that also may appear in any disclosure incorporated by reference should be accorded a meaning most consistent with the particular concepts disclosed herein.
The skilled artisan will understand that the drawings primarily are for illustrative purposes and are not intended to limit the scope of the inventive subject matter described herein. The drawings are not necessarily to scale; in some instances, various aspects of the inventive subject matter disclosed herein may be shown exaggerated or enlarged in the drawings to facilitate an understanding of different features. In the drawings, like reference characters generally refer to like features (e.g., functionally similar and/or structurally similar elements).
The output port of the beam splitter 113 is coupled to the optical link 120, which can be a fiber link 121 (e.g., polarization-maintaining fiber (PMF) or single-mode fiber (SMF) with polarization control at the output), free-space link 122, or optical link with fan-outs 123 for connecting to multiple clients 130. If the server 110 is connected to multiple clients 110, it can be connected to each client 110 via a different (type of) optical link 120. In addition, a given optical link 120 may include multiple segments, including multiple fiber or free-space segments connected by amplifiers or repeaters.
Each client 130 includes a PBS 131 with two output ports, which are coupled to respective input ports of a Mach-Zehnder modulator (MZM) 133 with a phase modulator 132 in the path from one PBS output to the corresponding MZM input. The outputs of the MZM 133 are demultiplexed into an array of difference detectors 135, one per wavelength channel. Demultiplexing can be achieved with various passive optics, including arrayed waveguide gratings, unbalanced Mach-Zehnder trees, and ring filter arrays (shown here). In the ring-based implementation, the light is filtered with banks of WDM ring resonators 134. The ring resonators 134 in each bank are tuned to the same resonance frequencies ω1 through coo as the micro-ring modulators 112 in the client 110. Each resonator 134 is paired with a corresponding resonator in the other bank that is tuned to the same resonance frequency. These pairs of resonators 134 are evanescently coupled to respective differential detectors 135, such that each differential detector 135 is coupled to a pair of resonators 134 resonant at the same frequency (e.g., ω1). In this arrangement, the pairs of resonators 134 act as passband filters that couple light at a particular frequency from the MZM 133 to the respective differential detectors 135.
The differential detectors 135 are coupled to an analog-to-digital converter (ADC) 136 that converts analog signals from the differential detectors 135 into digital signals that can be stored in a RAM 137. The RAM 137 also stores inputs to one or more layers of the DNN. The RAM 136 is coupled to a DAC 138 that is coupled in turn to the MZM 133. The DAC 138 drives the MZM 133 with the DNN layer inputs stored in the RAM 137 as described below.
The NetCast optical neural network 100 works as follows. Data is encoded using a combination of time multiplexing and WDM: the server 110 and client 130 perform an M×N matrix-vector product in N time steps over M wavelength channels. At each time step (indexed by n), the server 110 broadcasts a column wn of the weight matrix to the client 130 via the optical link 120. The server 110 modulates the weight matrix elements, which are stored in the RAM 113, on the frequency comb to produce a weight signal using the broadband modulator (e.g., micro-ring resonators 112). Then the server 110 transmits this weight signal to the client 130 via the optical link 120. The MZM 133 in the client 130 multiplies the weight signal with the input to the corresponding DNN layer, which is stored in the client RAM 137. The pair of 1-to-M WDMs (e.g., M ring resonators 134) and M difference photodetectors 135 (one set per wavelength) in the client 130 demultiplex the outputs of the MZM 133. These outputs are the products of the weights with the input vector stored in the client's RAM 137, wmnxn. Integrating over all N time steps, the total charge accumulated on each difference detector 135 is
γm=ΣnWmnxn (1)
performing the desired matrix-vector product.
where Δmn is the cavity detuning of the mth ring modulator 112 (couples to ωm) at time step n.
The PBS 115 combines the through- and drop-port outputs of the ring modulators 112 to orthogonal polarizations of a polarization-maintaining output fiber (PMF) optical fiber link 121, which transmits the combined through- and drop-port outputs to the client 130 as a weight signal. If the through and drop beams have the same polarization (e.g., transverse electric (TE)), there may also be a polarization rotator coupled to one input port of the PBS 115 to rotate the polarization of one input to the PBS 115 (e.g., from TE to transverse magnetic (TM)), so that the inputs are coupled to the same output port of the PBS 115 as orthogonal modes (e.g., TE and TM modes propagating in the same waveguide 121). The optical link 120 may be over fiber or free space and may include optical fan-out to multiple clients as explained above. If the link loss or fan-out ratio is large, the server output can be pre-amplified by an erbium-doped fiber amplifier (EDFA) or another suitable optical amplifier (not shown).
At the end of the link 120, the weight signal enters the client 130, where the second PBS 131 separates the polarizations and the phase shifter 132 (
Finally, the WDM channels are demultiplexed using the ring resonators 134 and the power in each channel is read out on a corresponding photodetector 135. In this case, with a ring-based WDM transmitter, the difference current between the MZM outputs evaluates to:
The first term in Eq. (4) is a product between a DNN weight (encoded as |tmn|2−|rmn|2) and an activation (encoded as cos(2θn)). The second term Re[t*mnrmn]sin(2θn) is unwanted: it comes from interference between the through- and drop-port outputs on the MZM 133. This interference can be suppressed or eliminated by ensuring the fields are ±π/2 out of phase (true in the critically coupled case Eq. (2)), by offsetting them with a time delay (though this reduces the throughput by a factor of two), or by using two MZMs rather than one (at the cost of extra complexity).
NetCast uses time multiplexing, and the matrix-vector product is derived by integrating over multiple time steps. For clarity, label the wavelength channels with index m and time steps with index n. In each time step n, the weight server 110 outputs a column of this matrix w:,n, where the weights are related to the modulator transmission coefficients (and hence the detuning) and the activation xn is encoded in the MZM phase:
For lossless modulators (k1=k2=k/2), the range of accessible weights is wmn∈[−1, +1]; for lossy modulators, the lower bound is stricter: wmn∈[−1, +1]; wmn∈[−1+2kabs/k, +1]. To reach all activations in the full range xn ∈[−1,1], the modulation should hit all points in θ∈[−π/2,]; [−π/2,π/2]; this condition can be achieve using a driver with Vpp=Vπ.
After integrating Eq. (4) over the time steps, the difference charge for detector pair m is:
γm=ΣnΔImn=Σnwmnxn (7)
which is the desired matrix-vector product.
At a high level, the NetCast architecture encodes the neural network (the weights) into optical pulses and broadcasts it to lightweight clients 130 for processing, hence the name NetCast.
The NetCast concept is very flexible. For example, if one has a stable local oscillator, one can use homodyne detection rather than differential power detection to create a coherent version. While NetCast does not rely on coherent detection or interference, coherent detection can improve performance. In addition, one can replace the fast MZM with an array of slow ring modulators to integrate the signal over frequency rather than time (computing xTw instead of wx). Finally, there are a number of ways to reduce the noise incurred in differential detection if many of the signals are small.
This architecture 300 is called a coherent architecture because the weight data is encoded in coherent amplitudes, and the client 330 performs coherent homodyne detection using a local oscillator (LO) 340. A tap coupler (e.g., a 90:10 beam splitter) 341 couples a small fraction of the output of the LO 340 to one port of a differential detector 342 and the remainder to the input of an MZM 333. Likewise, the other port of the differential detector 342 receives a fraction of the weight signal from the server 310 via another tap coupler 332. The output of the differential detector 342 drives a phase-locking circuit 343 that stabilizes the carrier frequency and repetition rate of the LO 340 in a phase-locked loop (PLL). The second tap coupler 332 couples the remainder of the weight signal to a 50:50 beam splitter 344 at whose other input port is coupled to the output of the MZM 333. The output ports of this 50:50 beam splitter 344 are fed to respective input ports of a WDM homodyne detector 334.
For concreteness,
As in
One advantage of coherent detection at the client 330 is increased data rate. The coherent scheme shown in
Another advantage of the coherent scheme is increased signal-to-noise ratio (SNR), especially at low signal powers. This is especially relevant for long-distance free-space links where the transmission efficiency is very low. Homodyne detection with a sufficiently strong LO allows this signal to be measured down to the quantum limit, rather than being swamped by Johnson noise.
Assume that inputs and weights are scaled to lie in the range xn, wmn∈[−1,1]. The comb line amplitudes input to the homodyne detector, normalized to photon number, are αmn(w)=αwwmn and αmn(x)=αxxn. In the weak-signal limit αw«αx, the difference charge accumulated on each photodetector, per time step, is:
Q/e=2αwαxwmnxn, Q/erms≡αx|xn| (8)
The mean and standard deviation of the output signal are therefore:
As expected, the SNR depends inversely on the energy per weight pulse (before modulation) |αw|2. The ONN's performance may be impaired if the SNR is too low; this sets a lower bound to the optical received power, analogous to the ONN standard quantum limit.
The same protocol can also work if the weight data is sent over an RF link; in this case a mixer is used in place of an optical homodyne detector. An advantage of using an optical link is the much higher data capacity, driven by the 104-105× higher carrier frequency.
NetCast is very extensible: it can detect coherently or incoherently, integrate over frequency or time, and in the case of incoherent detection, additional complexity can lower the receiver noise.
In the TIFS client 130, the optical signal is modulated by a broadband MZM 133, which modulates all wavelength channels simultaneously. This weights the columns of the weight matrix Wmn by activations xn. The resulting wavelength channels are demultiplexed 134′ and the product is detected on the difference detector 135′ after time integration (sum over the rows of the weighted matrix, Σmwmnxm).
In the FITS client 130′, the optical signal is sent through a weight bank 134, which independently modulates each wavelength channel. This weights the rows of the weight matrix wmn by activations xn. The resulting signal is detected on a difference detector; at time step n, the difference current is the sum of all contributing wavelength channels (sum over the rows of the weighted matrix, Σmwmnxm).
The low-noise incoherent servers 410 and clients 430 and 430′, shown in the bottom row of
Simple and low-noise incoherent servers and clients can be mixed and matched depending on the desired neural network performance and system complexity. To show the advantage of the low-noise configurations, consider the following four cases, named S/S, S/LN, LN/S, LN/LN (simple server/simple client, simple server/low-noise client, etc.). In each case, start with an unweighted frequency comb with amplitudes αw, where Nwt=|αw|2 is the number of photons per weight (at the source), and normalize variables so that w, x ∈[−1,1].
†Weight and PD input powers for case wmn > 0, xn > 0 shown. The other cases are analogous and Qtot and Qdet do not change.
These cases are enumerated in Table 1. While they collect the same differential charge Qdet=wmnxnNwt, the total PD charge, which sets the shot-noise limit, varies considerably if many of the inputs or weights are small (or zero). This is generally true, especially for DNN weights which are often pruned to save memory.
From the PD charge, it is possible to calculate the shot noise on the logical output γm. In general, we will have:
γm=∈nwmnxn+N(0, σm2) (10)
The right column of Table 1 compares the noise amplitudes σm for the four incoherent schemes (as well as the coherent scheme, Eq. (9)). As expected, the low-noise and coherent schemes have lower noise amplitudes than the simple scheme. Also, because (∥x∥2)2≤∥x∥1 (application of Holder's inequality), the coherent scheme is superior to S/LN. But whether LN/LN or Coherent is best may depend on the weights.
Because time and frequency are Fourier conjugates, the noise analysis is the same for the FITS and TIFS integration schemes, with the replacements w→wT and N→M (swap time bins with frequency channels). In addition, a side benefit of the low-noise schemes is robustness to phase errors: because the MZMs are always in a BAR or CROSS configuration, there is no interference between α+ and α− and the relative phase no longer matters.
If the client runs as a matrix-vector multiplier, e.g., as shown in
Fundamentally, the channel capacity of the optical link between the server and client is usually limited by crosstalk. In this architecture, crosstalk takes two forms: (1) temporal crosstalk and (2) frequency crosstalk. Temporal crosstalk arises from the finite photon lifetime in the ring modulators and their finite RC time constant. Lumping these together gives an approximate modulator response time τ=√{square root over (1/k2+(RC)2)}. For efficient modulators, RC ≈k, so τ≈√{square root over (2)}/k. Temporal crosstalk can have the form Xt=e−T/96 , where T is the time between weights. This sets an upper limit on the symbol rate R=1/T of the modulators:
where ƒ0 is the optical carrier frequency and Q is the ring's quality factor.
Frequency crosstalk occurs among channels of the WDM receiver (even for a perfect WDM, the transmitter rings have frequency crosstalk). This is set by the Lorentzian lineshape X ω=(1/2K)2/(Δω2+(1/2K)2), where Δω is the spacing between neighboring WDM channels. In the low-crosstalk case Δω»K, this gives a minimum channel spacing:
Analog crosstalk should be sufficiently low for the DNN to function. An analog crosstalk of Xt≲0.05 is usually sufficient. Assuming spatial crosstalk has a similar threshold (Xt=Xω=X), the channel capacity is bounded by:
Here B is the bandwidth (in Hz) and C0 is the normalized symbol rate (units 1/Hz-s).
Table 2 shows the capacity as a function of crosstalk. These values are in the same ballpark as the HBM memory bandwidth of high-end GPUs (e.g., 6-12 Tbps). In the matrix-vector case of 1 MAC/wt, it may not be possible to reach GPU- or TPU-level arithmetic performance (>50 TMAC/s). This could involve optical fan-out in the client to reuse weights (as mentioned above; GPUs and TPUs do this anyway) or operating beyond the C-band.
There may also be practical bandwidth limits set by dispersion in the MZM, long fiber links, PBS, or free-space optics. Many of these bandwidth limits can be circumvented with appropriate engineering.
The server should emit enough laser power to maintain a reasonable SNR at the detector. The noise can be modeled as a Gaussian term in the matrix-vector product of each DNN layer. Following Eq. (10), one writes:
ym=∈nWmnxn+N(0, τ2), τ=√{square root over (τj2+τs2)} (14)
Here, τj and τs are the Johnson- and shot-noise contributions, respectively. Johnson noise gives rise to so-called kTC noise fluctuations on the charge of a capacitor; these fluctuations scale as (ΔQ)ms=√{square root over (kTC)} and can dominate for readout circuits (detector and transimpedance amplifier (TIA)) with large capacitance. Shot noise, due to the quantization of light into photons, may dominate in the case of high optical powers or coherent detection (with a strong LO).
There are at least two ways to define the basis for benchmarking laser power. First, the basis can be defined based on the source power in the frequency comb at the weight server before the WDM-MZM. Denote this as Nsrc. This is the same as Nwt used elsewhere in this specification. Second, the basis can be defined based on the transmitted power (averaged) at the weight server's output, denoted Ntr. This may be much lower than Nsrc if many weights are zero and a low-noise or coherent detection scheme is used. Received power (at the client) is just Ntr times the link efficiency. Source power is a convenient basis without practical amplifiers, but as long as it is possible to amplify the signal efficiently without too much dispersion, nonlinearity, or crosstalk, transmitted power may be a more convenient basis. Plus using transmitted power leads to more favorable results in many cases.
To calculate the energy bound imposed by noise in the ONN, consider running the neural network with additive Gaussian noise in each layer (Eq. (14)) and computing the noise limit, the largest tolerable noise amplitude τmax. This depends on the DNN and the tolerance to error.
The largest tolerable noise amplitude τmax can be used to obtain a conservative estimate for the energy metric (either Nsrc or Ntr) since τ=√{square root over (τj2+τs2)} depends on the optical energy. First, the Johnson noise scales inversely with Nsrc and sets a lower bound on it:
Table 3 lists the kTC noise, the corresponding minimum energy per MAC Emin, and the minimum power (at a rate of 1 TMAC/s).
ΔQ/e rms
†Power Pmin calculated at 1 TMAC/s.
|xn|
|xn|
|xn| 2
|wmn|
|wmn|
|wmn| 2
|wmnxn|
|wmnxn|
|wmn| |wmnxn|
|xn|
|xn2|
|xn|2 |wmn|2
The shot noise term as scales inversely with the square root of power. This sets a lower bound on the optical power called the Standard Quantum Limit (SQL) because it arises from fundamental quantum fluctuations in coherent states (rather than thermal fluctuations, which can be avoided with a sufficiently small capacitance, or using avalanching or on-chip gain before the detector). The SQL may be relevant here for two reasons: (1) optical power budgets are much lower owing to laser efficiency, free-carrier effects, and nonlinear effects—while chips can tolerate 100 W of heating, most silicon-on-insulator (SOI) waveguides take at most 100 mW; and (2) links can be very low efficiency in many applications (e.g., long-distance free-space). Therefore, unlike the HD-ONN, a NetCast system may operate near the SQL.
Define coefficients Fsrc and Ftr by:
The power bound set by shot noise is therefore:
Thus, the energy bound is closely related to the coefficients Fsrc, Ftr. These coefficients can be obtained by the form of τ (Table 1); Table 4 lists the coefficients for each scheme. As mentioned above, by reducing the noise in the case of sparse or nearly-sparse weights or activations (|xn|,|wmn|«1), low-noise designs can reduce the required laser power by a large factor. These factors Fsrc and Ftr, shown in Table 5 for the same MNIST neural networks, allow for a 103× reduction in optical power consumption compared to the “simple” design.
At first glance, such a reduction seems unimportant because, even with the simple design, the noise-limited power is Emin=1.4 fJ/MAC, sufficiently low that on-chip electronics, e.g., DACs, ADCs, and memory, are likely to dominate. However, this noise-limited power means that even at a modest throughput of 1 TMAC/s there should be 1.4 mW of optical power at the receiver. Given that lasers and EDFAs support at most 10-100 mW, this places a limit on the allowed optical fan-out, to say nothing of link loss or eye safety. For especially lossy links (e.g., drones connected at long distance over free space), there is a strong incentive to reduce Emin as much as possible, even if it doesn't affect the client-side power budget.
Fortunately, both the coherent scheme and the LN/LN incoherent schemes can operate at very low transmitted energies of a few photons/MAC, enabling Pmin<1 μW even at 1 TMAC/s. With such a client, a 10 mW source can tolerate link losses (or fan-out ratios) of up to 104. Alternatively, a lower-loss link could deliver enough power for 100 TMAC/s of computation, beating the TPU with a sub-mW (optical) power budget.
For the low-noise incoherent schemes, Johnson noise may dominate over shot noise because the shot-noise bound is so low. To suppress Johnson noise, signal pre-amplification (e.g., with an EDFA or a semiconductor optical amplifier) or avalanching detectors can be used.
†Power Pmin calculated at 1 TMAC/s.)
Electrical power consumption at the client depends on: (1) fetching activations (the inputs to the DNN layer) from client memory, (2) driving the MZM, and (3) reading and digitizing the detector outputs.
By broadcasting the weights from the server to the client(s), NetCast eliminates the need to retrieve weights from client memory. In general, the weights of a DNN take up much more memory than the activations. For a fully connected layer, weights take up O(N2) memory while activations only take up O(N) (batching evens this out a bit, but the size of the mini-batch is usually smaller than N). Moreover, unlike the weights, all of which should be stored somewhere, during inference only the current layer's activations need to be stored at any time (excepting branch points and residual layers). Thus, the ratio of weights to activations should increase with the depth of the network and the size of its layers.
Without the weights, the client may be able to store the entire DNN's state in on-chip memory, eliminating dynamic random-access memory (DRAM) reads on the client side. Moreover, even when reading from on-chip memory, there is a data reuse factor of M from wavelength multiplexing in the MZM as shown in
Driving the MZM at the client does not consume much electrical power either. A free carrier-based uni-traveling-carrier (UTC) MZM transmitter uses O(1) pJ/bit. As with the memory reads, WDM amortizes the driver cost over M channels, so the energy per MAC is O(1/M) pJ. With many channels, the driving cost can be driven below tens of femtojoules/MAC. (This assumes the MZM is UTC over the whole bandwidth and neglects dispersion). More exotic modulators (e.g., based on LiNbO3, organic polymers, BaTiO3, or photonic crystals) could reduce the modulation cost to femtojoules, which would again be amortized by the 1/M factor from WDM. However, few-fJ/MAC performance is already possible with modulators available in foundries today.
Reading and digitizing the detector outputs at the client also consumes small amounts of electrical power. Readout and digitization power consumption is usually dominated by the analog-to-digital conversion (ADC), which is O(1) pJ/sample at 8 bits of precision. It may be possible to scale ADC energies down to 100 fJ or less by sacrificing a bit or two without harming performance. In any event, after dividing by N >100, the ADC cost is at most tens of femtojoules/MAC.
The client may consume power for other operations, including tuning and controlling the ring resonators used as filters. Thermal ring tuning can raise the system-level power consumption figure for ring modulators from fJ/bit to pJ/bit. If the receiver WDM (designed with ring arrays as in
In the highest power consumption scenario, the weight server stores all of its weights in DRAM and achieves zero local data reuse, so the power budget is dominated by DRAM reads (about 20 pJ/wt at 8-bit precision). At a target bandwidth of 1 Twt/s, this is approximately 20 W. The transmitter may add a few watts (assuming O(1) pJ/wt as before), and then there is the optical power considered earlier.
The NetCast server-client architecture can lead to entirely new dataflows because the server is freed from the tasks of computation and memory writes. For example, the weight server may be constructed as a wafer-scale weight server that stores the weights in static random-access memory (SRAM). With commensurate modulator improvements, the energy consumption can be reduced by orders of magnitude. In a wafer-scale server, the data should be stored locally to avoid both off- and on-chip interconnect costs.
At first glance, a switching tree may seem energy-intensive if each leaf on the tree contains one weight and the switches are toggled every clock cycle. But in this case, each leaf can contain many weights and can wait for many clock cycles before switching. This greatly reduces the burden on the switching network. Even in the case where weights are stored in DRAM, however, NetCast should operate at reasonable powers with existing technology.
There are many edge computing scenarios where smart sensors have a direct line of sight or a fiber-optic connection to a server but are power-starved. For example, complex machinery like aircraft contain hundreds of sensors that can be linked through fibers inside the airframe, as shown in
NetCast offers several advantages over other schemes of edge processing with DNNs. To start, it integrates the optical power in the analog domain and reads it out at the end, so the energy consumption is O(1/N) times smaller than digital optical neural networks. It can be used to implement large DNNs (e.g., with more than 108 weights), which is not possible with today's integrated circuits. It can operate without phase coherence, which relaxes requirements on the stability of the links connecting the server to the clients. In addition, the links are not imaging links; they can be fiber-optic links or single-mode free-space links with simple Gaussian optics. Finally, the chip area scales as O(M), not O(MN) or O(N2), because NetCast is output-stationary, unlike schemes that are weight-stationary.
Another exciting possibility is to perform distributed training using two-way optical links between the server and the client. Training allows the server to update its weights in real time from data being processed on the clients. This following method for training is compatible with NetCast and runs on similar hardware.
DNN training is a two-step process. First, the gradients of the loss function J with respect to activations Xn=∂J/∂xn,ψm=∂J/∂ym are computed by back-propagation. Within each layer, the backpropagation relation is:
and between layers it is:
In vectorized form, Eq. (18) can be written as the matrix product X=wTψ, while Eq. (19) is an elementwise weighting of the vector elements ψ=g′(x)X.
Second, compute the weight update δmn=∂J/∂wmn, i.e., the gradient of J with respect to the weights:
which is just the vector outer product δ=ψxT. These relations are summarized in Table 6 and illustrated in
Backpropagation relies on a matrix-vector product. In terms of optics, this is straightforward to perform in NetCast: simply swap w for wT and everything runs the same as for inference. For the weight update, given the activation x and gradient ψ, compute the outer product δ=ωxT, and transmit the result (encoded optically in a compatible format) to the server.
Since the weight update is a matrix, it can be encoded in the same time-frequency format as the weight matrix as shown in
In the simple client 730a of
Qdet=|αmn(+)|2−|αmn(−)|2∝ψmXn=δmn (21)
If many of the activations or weights are very small, it can be difficult to resolve the signal Qdet because of the large shot noise. The low-noise client 730a′ in
The coherent server 710b and client 730b share a common LO and so can encode the weights coherently. This involves cascading a frequency comb from a comb source 731 through a slow WDM-MZM 732b into a fast broadband MZM 733 on the client side and beating the resulting training signal against a LO comb from an LO 711 in a WDM homodyne detector 712b at the server 710b. In this case, the signal field (rather than power) scales as ψmxn. With an LO amplitude α, the charge in each detector is Q±=(1/2)(α±√{square root over (Nsrc)}ψmxn)2 and the difference charge scales as ψmxn.
|ψm| |xn|
|ψm| |xn|
|ψm| 2 |xn| 2
|ψm|2 |xn|2
Like inference, the accuracy of training in NetCast is limited by detector noise, which is a function of the optical power. In the large-signal limit, this noise leads to a Gaussian term in the calculated outer product:
δmn=ψmxn+N(0, τmn2) (22)
While σmn often depends on the specific matrix element, it can be more convenient to look at the average σ2=(σmn2). This noise variance is a sum of Johnson and shot-noise terms σ2=σj2+σs2, which scale as σj∝Nsrc−1, σS∝Nsrc−1/2. Table 7 compares the noise amplitudes for the three training schemes in
If training is really distributed, the server may receive weight updates from multiple clients. While the client-side power budget for weight transmission is quite low (O(M)+O(N) for an M×N matrix), on the server side, it is O(MN) since every weight is read to memory. If the server processes the weight updates of the clients independently, it may run into severe bandwidth and energy bottlenecks. Therefore, it can be highly advantageous to combine these updates optically before the server reads them out.
{circumflex over (α)}k=αk+N(0,1/4)⇒Σk{circumflex over (α)}k=Σkαk+N(0,1/4K) (23)
to first combining the fields optically (α=K−1/2Σkαk) and then performing homodyne detection:
{circumflex over (α)}k=K−1/2Σk+N(0,1/4)=K−1/2[Σk{circumflex over (α)}k+N(0, K/4)] (24)
The results in Eqs. (23) and (24) differ by a scaling factor; the SNR is the same. Therefore, in the coherent scheme, the weight updates can be combined without loss of signal. Beyond this, another advantage of the coherent scheme is speed: without interleaving, it is much faster in the case of many clients. In the incoherent case, interleaving can limit the weight update rate to the bounds derived above. By contrast, with coherent optics, these weight updates are optically batched and the bound no longer applies. This could be a major advantage in systems that have many clients and are (optical) throughput-limited.
While various inventive embodiments have been described and illustrated herein, those of ordinary skill in the art will readily envision a variety of other means and/or structures for performing the function and/or obtaining the results and/or one or more of the advantages described herein, and each of such variations and/or modifications is deemed to be within the scope of the inventive embodiments described herein. More generally, those skilled in the art will readily appreciate that all parameters, dimensions, materials, and configurations described herein are meant to be exemplary and that the actual parameters, dimensions, materials, and/or configurations will depend upon the specific application or applications for which the inventive teachings is/are used. Those skilled in the art will recognize or be able to ascertain, using no more than routine experimentation, many equivalents to the specific inventive embodiments described herein. It is, therefore, to be understood that the foregoing embodiments are presented by way of example only and that, within the scope of the appended claims and equivalents thereto, inventive embodiments may be practiced otherwise than as specifically described and claimed. Inventive embodiments of the present disclosure are directed to each individual feature, system, article, material, kit, and/or method described herein. In addition, any combination of two or more such features, systems, articles, materials, kits, and/or methods, if such features, systems, articles, materials, kits, and/or methods are not mutually inconsistent, is included within the inventive scope of the present disclosure.
Also, various inventive concepts may be embodied as one or more methods, of which an example has been provided. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.
All definitions, as defined and used herein, should be understood to control over dictionary definitions, definitions in documents incorporated by reference, and/or ordinary meanings of the defined terms.
The indefinite articles “a” and “an,” as used herein in the specification and in the claims, unless clearly indicated to the contrary, should be understood to mean “at least one.”
The phrase “and/or,” as used herein in the specification and in the claims, should be understood to mean “either or both” of the elements so conjoined, i.e., elements that are conjunctively present in some cases and disjunctively present in other cases. Multiple elements listed with “and/or” should be construed in the same fashion, i.e., “one or more” of the elements so conjoined. Other elements may optionally be present other than the elements specifically identified by the “and/or” clause, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, a reference to “A and/or B”, when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.
As used herein in the specification and in the claims, “or” should be understood to have the same meaning as “and/or” as defined above. For example, when separating items in a list, “or” or “and/or” shall be interpreted as being inclusive, i.e., the inclusion of at least one, but also including more than one, of a number or list of elements, and, optionally, additional unlisted items. Only terms clearly indicated to the contrary, such as “only one of” or “exactly one of,” or, when used in the claims, “consisting of,” will refer to the inclusion of exactly one element of a number or list of elements. In general, the term “or” as used herein shall only be interpreted as indicating exclusive alternatives (i.e., “one or the other but not both”) when preceded by terms of exclusivity, such as “either,” “one of,” “only one of,” or “exactly one of” “Consisting essentially of” when used in the claims, shall have its ordinary meaning as used in the field of patent law.
As used herein in the specification and in the claims, the phrase “at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements. This definition also allows that elements may optionally be present other than the elements specifically identified within the list of elements to which the phrase “at least one” refers, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, “at least one of A and B” (or, equivalently, “at least one of A or B,” or, equivalently “at least one of A and/or B”) can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements); etc.
As used herein in the specification and in the claims, when a numerical range is expressed in terms of two values connected by the word “between,” it should be understood that the range includes the two values as part of the range.
In the claims, as well as in the specification above, all transitional phrases such as “comprising,” “including,” “carrying,” “having,” “containing,” “involving,” “holding,” “composed of,” and the like are to be understood to be open-ended, i.e., to mean including but not limited to. Only the transitional phrases “consisting of” and “consisting essentially of” shall be closed or semi-closed transitional phrases, respectively, as set forth in the United States Patent Office Manual of Patent Examining Procedures, Section 2111.03.
This application claims the priority benefit, under 35 U.S.C. 119(e), of U.S. application Ser. No. 63/084,600, filed Sep. 29, 2020, which is incorporated herein by reference in its entirety for all purposes.
This invention was made with Government support under Grant No. ECCS1344005 awarded by the National Science Foundation (NSF), and under Grant No. W911NF-18-2-0048 awarded by the Army Research Office (ARO). The Government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2021/043593 | 7/29/2021 | WO |
Number | Date | Country | |
---|---|---|---|
63084600 | Sep 2020 | US |