This invention relates to training methods for optical neural networks.
Training a neural network often involves adjustments of hardware parameters of the network in the presence of training data in order to improve performance (e.g., classifying the training data into categories). Training optical neural networks presents special challenges, since the internal signals within the network are optical, not electrical. As a result, it can be more difficult to obtain required information for training in optical neural networks than in other kinds of neural networks. One kind of information that is frequently needed in such training is derivatives of a cost function being optimized vs. control inputs. Often such derivatives are referred to as gradients, since there are typically a large number of control inputs that need to be optimally set, so derivatives are taken with respect to many different control inputs.
One approach that has been considered is to physically implement the adjoint variable method in photonic hardware in order to provide photonic signals that provide the necessary gradient information. This approach requires intensity measurements inside the mesh combined with some further computations and optical processes. Accordingly, it would be an advance in the art to provide improved training of optical neural networks.
We provide an efficient and fast way of measuring such gradients simultaneously and in parallel. This approach may allow optical neural networks to implement fast and low power learning.
An exemplary embodiment is a method of training a photonic neural network, where the method includes:
Two or more predetermined input training patterns can be provided to the optical inputs of the optical network at various times, whereby the two or more derivatives of the cost function are analog time averages over the two or more predetermined input training patterns.
Two or more predetermined input training patterns can be provided to the optical inputs of the optical network at two or more distinct wavelengths, whereby the two or more derivatives of the cost function are analog wavelength averages over the two or more predetermined input training patterns.
The one or more predetermined input training patterns can be provided as modulated input training patterns, whereby frequency components in the cost function output resulting from the two or more distinct dither frequencies are heterodyne shifted away from the two or more distinct dither frequencies.
The method can further include adjusting the control signals to optimize the cost function with an optimization method that makes use of the two or more derivatives of the cost function with respect to the control signals, whereby the photonic neural network is trained according to the one or more predetermined input training patterns.
The optical network can include two or more meshes of linear optical components connected in alternating series via one or more nonlinearity units, where the control inputs include at least inputs to each of the two or more meshes of linear optical components. E.g., meshes W1, W2, . . . alternating with nonlinearity units f1, f2, . . . on
The optical network can include at least one optical element having a compound control input, where the compound control input includes a first input and a second input, where the first input has a lower bandwidth than the second input, and where a dither of the compound control input is delivered via the second input. An example of this is shown on
Applications include optical neural networks and machine learning, optical communications systems, and self-training sensing systems.
In some exemplary embodiments:
Significant advantages are provided. This approach would substantially reduce the time required for optimizing arbitrary optical networks so they solve particular problems, including learning in neural network and machine learning applications. Generally, optical approaches have the major potential advantage of performing matrix multiplications quickly and at no marginal cost, in strong contrast to electronic systems.
Further advantages include:
It is now possible to make complicated optical networks of many components. These have applications in many areas, including in separating mixed modes in optical communications, for quantum computing, and for optical neural networks. Silicon photonics technology is a particularly useful approach to making such networks. Typical networks involve meshes of Mach-Zehnder waveguide interferometers. The Mach-Zehnder interferometers can function as 2×2 adjustable blocks, which can set the relative phase and amplitude relation between two optical inputs and two optical outputs, with such inputs and outputs typically being provided in the form of optical waves in waveguides. Other forms of equivalent 2×2 blocks are possible, as in U.S. Pat. No. 10,338,319, which provides an alternative approach including a controllable coupling element. Generally, with controllable phase shifter elements and, optionally, other controllable coupling elements, complex mathematical operations are possible in such meshes of 2×2 optical elements, such as arbitrary unitary and non-unitary matrix operations. See also U.S. Pat. No. 10,534,189, hereby incorporated by reference in its entirety. Below, we will refer to all such elements that can be controlled so as to affect the operation of the optical network, such as phase shifters or controllable couplers or possibly other controllable optical elements, as “elements” or “controllable elements”.
Because such networks can be very complex and because precise settings of network elements are very important in systems involving interference of optical waves, it is important to have ways of setting up such networks so they perform their desired function. For networks that are to perform linear functions, such as the equivalent of matrix multiplication, progressive methods are known for setting controllable elements that allow the mesh to be set up to perform a given such linear operation. The literature has many explicit examples of networks with such controllable elements, so we will not explicitly show the details of such networks further here.
Another approach to setting up such networks can involve more global optimization algorithms that can be intended to “train” the network to perform some function. One general approach to such training is to define some cost function that can be measured at or from the output of the network, such as the measured power in some photodetector or photodetectors or some quantity calculated from such measurements, and then to adjust the controllable elements in the network so as to optimize (e.g., maximize, or, alternatively, minimize) such a cost function. In such optimizations, it is desirable to be able to measure changes in the cost function as each parameter or controllable element (e.g., phase shifter or controllable coupler) is varied. We can refer to the amount of change in the cost function resulting from a small change in a controllable element or in the drive, such as a control voltage, that leads to such a change in the controllable element as a “gradient” (with respect to that change in the element setting or that drive). Such measured gradients or changes can then be used as inputs to some optimization algorithm, such as one based on gradient descent, to then make related changes to the parameter settings as part of the overall optimization process.
Such optimization processes are particularly important for artificial neural networks. In particular, they can allow the optical neural network to be trained directly, rather than requiring a separate external training process, allowing the system to implement approaches like machine learning directly. Key to any such process is the ability to conveniently and rapidly make the necessary measurements of such gradients or changes.
Previous approaches required intensity measurements inside the mesh combined with some further computations and optical processes. Here, we provide a method that not only avoids such intensity measurements but allows the direct and simultaneous measurement of multiple gradients corresponding to changes in multiple network elements, based only on measurements made outside the mesh. This approach can simplify and speed up the process of setting up such meshes by optimization, and can be used directly in training optical neural networks.
An optical architecture based on interferometer meshes that could in principle implement a layered neural network is known from the literature. This architecture alternates linear transforms performed by the interferometric meshes (“Optical interference units”—OIU) with “columns” of nonlinear elements (“Optical nonlinearity units”—ONU). A key point in implementing a training method for such a mesh is that we need to know how the cost function varies as we change the linear transform Wl corresponding to the lth mesh or OIU.
In our method, we directly measure appropriate gradients, which we can also view equivalently as the derivatives of the cost function. In this approach, in a simple version, we can directly measure at the output of the whole system as an optical power. Then we can directly vary the drive of each controllable element, such as a phase shifter or controllable coupler, in the meshes. For the sake of definiteness we will refer to such drive of an element as the drive voltage vlq for element q in mesh l, though we understand that such drive could be some other quantity, such as displacement of part of an element in space or such as an electrical current, for example. We then measure the resulting change in the optical power that corresponds to . That then directly gives the desired gradient ∂/∂vlq, and we can then use those gradients in a learning strategy to update the drive voltage of element q in mesh l.
Now, such an approach, based on measuring the cost function before and after some such change in the drive or setting of a controllable element, is known for the case of varying one element at a time, and can be called a finite difference method for gradient measurement. But such an approach would be time-consuming to implement because we would need successive such measurements for the case of changes in each such element in the mesh. Our approach circumvents this limitation, allowing simultaneous (or “parallel”) measurements of multiple such gradients.
For the purposes of discussion and as an example, we presume that the architecture of the network itself is as in network 102 as shown on
For an example training approach, we then combine network 102 with two further objects, as shown in
More specifically, we add some apparatus at the input of the system—the “input vector generator” 202—that takes the mathematical input vectors |X0, which might be the training vectors, for example, and turns them into corresponding vectors |X0 of actual mode amplitudes in the waveguide inputs to the mesh. Then, after the mesh 102, we need some “measuring unit” 204 that measures a cost function constructed from the comparison of the output vector |XL and some mathematical target vector |T.
This measuring unit will allow us to measure the cost function directly as an optical power or a corresponding photocurrent (or voltage) or other measurable physical quantity. Then, as we change a drive vlq of one of the controllable elements in the mesh, we can directly see the resulting change in or, equivalently, measure the gradient ∂/∂vlq.
It might seem that we therefore have to step through all the elements in the network, one after the other, to evaluate all of the gradients ∂/∂vlq. However, now we come to a key step in our approach.
The amplitude of this oscillation constitutes the measure of the corresponding gradient (including the sign of the gradient, which appears as an “antiphase” signal for negative signs). Such a gradient signal (including the sign) is easily extracted by standard electronic circuits; one approach would use a “lock-in” amplifier for example to measure the component of the electrical signal at some such frequency ωlq. Other approaches could include digital signal processing techniques.
Now, if we modulate a number of drives for different devices at different frequencies, then we will be able simultaneously to detect multiple different gradients at once by detecting the amplitudes of oscillation of the photodetector signal at each of these different frequencies. Again, the electronic circuits to extract all such frequency components at once from these different frequencies of modulations are straightforward in principle and known to those skilled in the art. One approach would be to connect the output voltage from the photodetector amplifier circuit simultaneously and in parallel to the inputs of multiple lock-in amplifiers, each looking for the component at a different frequency of interest, and hence each measuring a different gradient. Another approach would use digital signal processing techniques to extract components at the necessary different frequencies from a set of samples of the signal amplitude at multiple different times, as known to those skilled in the art.
An important point is that, if we keep all the different frequencies within one octave (i.e., between a lower frequency ωlow and an upper frequency ωhigh<2ωlow), then any intermodulation signals (from sums and differences of the different frequencies frequencies) and any higher order derivative signals (e.g., second derivatives would show up at twice the modulation frequency) will lie outside the octave, and the measurements are then essentially non-interacting.
Just how many different controllable elements we would drive at once with such different frequencies is a matter of engineering choice. We do not have to drive all the controllable elements at the same time, and can work in groups (e.g., one mesh at a time) if that is more convenient, but we see that we can work with two or more controllable elements being varied simultaneously, and we can measure simultaneously the required gradients associated with those two or more controllable elements.
So the net result of this approach is that, in one “forward” process, we can simultaneously yet independently measure the gradient of the cost function with respect to variations in multiple ones of the drives of the elements in the mesh. Hence, we can greatly speed up a key process required for optimizing such networks.
In using such a network that works by interference of light inside the mesh units, the input vector preferably includes mutually coherent fields at the same polarization. We can generate such vectors using an “optical setup machine” (OSM). Such an OSM 502 is shown in
Technically, a method like that in
So, with this OSM, we can mathematically choose some vector |X0, then simply calculate the necessary phases in the Mach-Zehnder interferometers in the OSM, program those in by choosing the corresponding required drive voltages for the phase shifters in the OSM, turn on the input light source power Ps, and thereby generate the required corresponding vector |X0 of modal amplitudes in the input guides to the network.
It can be desirable in making measurements to be able to look at modulations at relatively high frequencies, such as megahertz, or even gigahertz, frequencies. By increasing the modulation frequency range, we widen the available frequency band and so allow a larger set of modulated signals to be applied and detected simultaneously. Another equally important advantage is that at higher frequencies, low-frequency noise sources, such as “1/f” (“one over f”) noise, well known to those skilled in the art, are relatively unimportant.
However, some controllable elements, such as those based on thermal changes in properties, such as phase shifters whose phase shift is controlled by temperature, or some micromechanical approaches whose phase shifts or coupling depend on physical displacements, may have restrictions on how fast they can make controllable changes. Other ways of making high-speed elements are known, including electro-optical effects and materials, though making large changes in necessary properties, such as refractive index, can be difficult with such effects. It is necessary to be able to make large changes in phase shifts or coupling strengths in setting up the elements in the network so it performs its desired function, in this approach, but the changes we want to make at the modulation frequencies to measure gradients need not be large.
So, we can consider a “two stage” controllable phase shifter element, with two phase shifter elements optically in series, as illustrated as phase shifter A and phase shifter B in
Conveniently, one of these two phase shifting elements can be a high-speed phase shifting element that is only required to make small changes in phase delay (phase shifter B in the example in
The high-speed, low-amplitude phase shifters can be implemented in a variety of ways. One possible basic structure is shown in
The electrooptical material shown in the cross section can be a polymer, an organic crystal (e.g., a liquid crystal (LCD)), a semiconductor or an inorganic crystal or any other electrooptic material (i.e., a material that changes its index of refraction in response to an electric field). If an electrooptic polymer is used, it can be “poled” to give it the electrooptic effect by either using corona poling or by using the electrode shown in the cross section to apply the poling field. If a semiconductor is used, the electrooptic effect can be created in many ways including the plasma effect, bandgap effects and quantum well effects. Electrooptic organic crystals comes in many different forms, including Liquid Crystals (LCDs). Many inorganic crystalline materials, including Lithium Niobate (LiNbO3), exhibit the electrooptic effect and can be used in the configuration.
In an alternative implementation, we use an “optooptic” material (i.e. a material that changes its refractive index as a function of an applied optical field). The optooptic phase shifter could be addressed by laser beams pointed at the phase shifters.
The fast phase shifters can be placed in a multitude of ways in each MZI (Mach Zehnder Interferometer) as shown in
The phase shift induced by the fast phase shifter will typically not depend on the voltage V applied to the slow (or “static”) phase shifter. However, the voltage change δvθ required to make a given small phase change δθ in the slow phase shifter may vary as we change V. To use such a series phase shifter approach, we should preferably know this required voltage change δvθ as a function of V so that we know how much voltage change δvθ we should apply to get a phase change equal to (or proportional to) the actual phase change being made in the fast phase shifter. We can calibrate this δvθ as a function of V once for a phase shifter, and use this information in calculating the required change in V to implement a given change in phase shift in setting the network.
For this scheme, we need a measuring unit at the end to be able to measure the cost function relatively easily, and to be able to detect changes in that function as we modulate elements in the network. Just what apparatus we need for this depends on the cost function. One option is an apparatus for the mean-squared cost function. We also discuss schemes for measuring another cost function, which we call here an “orthogonal” cost function, in more detail.
One possible cost function is a “mean squared” cost function
In using this cost function, generally in optimizing we are trying to minimize its value in some way.
A vector of waveguide amplitudes of the form (|XL−|T) could be directly generated optically from the output vector |XL of modal amplitudes from the mesh. We would use an OSM to generate a vector of waveguide amplitudes |T in a set of waveguides (we would use the same light source for this as used for the input vector generator OSM so that |XL and |T were mutually coherent). Then we could use sets of 50:50 beam splitters to interfere the two vectors, element by element. One output of each beam splitter could generate the corresponding component of (|XL−|T) and the other would generate the corresponding component of (|XL+|T) (the meaning of the sign here really has to do with the relative phase of the light beams at the power inputs of the two OSMs here). We could put photodetectors on each of the (|XL−|T) waveguides and add the resulting electrical signals to generate an electrical signal proportional to giving us our desired measured result.
Another cost function would be what we could call an “orthogonal” cost function. We take the power in some convenient units in a given vector of mode amplitudes |Q to be
PQ=Q∥Q≡Q|Q (2)
where Q| (a row vector with complex-conjugated elements) is the Hermitian adjoint of |Q. In general, we can presume in optics that the power is proportional to the modulus squared of the electric field amplitude, at least when comparing fields in the same cross-section of waveguides made in the same material, so with |Q representing the electric field mode amplitudes in the guides, Eq. (2) gives the power in some convenient units.
For the same mathematical vectors |XL and |T as above, with our understanding of the power in a given vector as in Eq. (2), let us first formally define normalized versions of them (so “unit power” versions).
We can now define our orthogonal cost function as
This kind of cost function has a straightforward meaning. Essentially we are projecting out only the component of |XL that is orthogonal to |T and measuring its power P⊥. Possibly, a normalized version of this cost function is more useful, i.e., dividing by the total output power, which we take to be
PXL=X L/X L (6)
we have
This cost function is the fraction of the output power from the mesh that is in a vector orthogonal to |T. Reducing ⊥N moves the vector |XL towards being in the same “direction” as |T independent of the power of either of these. In using either of these cost functions, ⊥ or ⊥N, generally in optimization we are trying to minimize them.
If the direct goal of the output of the system is to categorize into “probabilities” that the input vector corresponds to one type of object or another, then ⊥N is a reasonable cost function. Indeed, that gives a very simple way to understand how to work with this cost function. Suppose in our simple 4-waveguide example that for the normalized target vectors we have the following simple mapping
So, the power in the top output waveguide gives the network's relative judgement that the input represents a cat, that in the second that the input represents a dog, that in the third that it represents an apple, and that in the fourth that it represents a flower. Then, the construction of the apparatus to measure ⊥N in each case is very straightforward, as is illustrated in
In general, detectors DT1 to DT4 give electrical signals that are proportional to power; for the sake of definiteness, let us say these electrical signals proportional to power are voltages (we could also use currents, which are actually somewhat simpler to collect from reverse biased photodiodes as signals proportional to power). For simplicity, we will just refer to these electrical signals as “powers” (though they represent optical powers, not necessarily electrical ones).
Here, our choice of target vector is limited to just the four possibilities, labelled cat, dog, apple or flower. The “insertion” of the mathematical target vector into the measuring unit just corresponds to depressing the corresponding selection switch. In
Hence, for a network and measuring unit like this that is intended to perform the actual characterization, this output signal ⊥N is just the fractional weight the system gives to categorizing the original input signal as whatever “target” option was chosen (here “dog”).
Electronic circuits to implement the additions are very straightforward (e.g., with simple operational amplifier circuits). Division is also relatively straightforward with some analog electronic circuits. The “divisor” power PXL only needs to be present as a time-averaged number—it does not need to follow the modulations of the signal by the various possible drive frequencies, so it can be essentially an averaged and relatively constant voltage that is fed into an appropriate divider circuit. As an alternative to performing an actual division, we could simply adjust the input source power to the mesh to keep the time-averaged total power signal PXL essentially constant in magnitude, and such an overall feedback control of power could be straightforward to implement.
If we want to have an unrestricted choice of target vectors |T (so, not just cat, dog, apple and flower) we can use a system as in
In this case, we use what we can call an “optical vector extractor” (OVE) 904. This OVE 904 is similar in many ways to the OSM 502 of
Having calibrated all the phase shifters, we can now turn off the power PC. We calculate the settings for the phase shifters so that, if the input optical vector was some specific one of interest, say a vector |T, then all of the power associated with that vector would pass into the “top” output port (here with the detector DT). (Note in this case we will end up setting the lowest Mach-Zehnder interferometer so that it functioning only as a phase shifter, and if there were still any power from the calibration light source, it would just be dumped into detector DC). Any remaining light would pass into the detectors DP1, DP2 and DP3 giving electrical signals that can be summed to give P⊥. Similar electrical summation and division as in
Machine learning training protocols typically involve optimizing the cost function over a large set of training examples. In ‘supervised learning’ protocols, these may be represented by input and target output pairs. When training the optical hardware on a machine learning task using gradient-based optimization, one therefore must compute the average gradient over a subset or “batch” of training examples, and use this “batch gradient” to update the phase shifters representing the weight matrices of the model. The backpropagation procedure in the literature and the scheme discussed above for parallelization of the measurement of gradients by use of different modulation frequencies both offer ways of computing and/or measuring the gradients of the cost function with respect to multiple controllable elements, but, as discussed so far, they perform this computation for just one such training example at a given time.
However, the same procedure may be extended to a set of training examples by using the measuring apparatus or devices to generate the desired averages over such training examples. Preferably, all the training examples in a given such set are intended to optimize the same cost function for the same target vector |T. Two basic approaches to such averaging include time-division multiplexing (TDM) and wavelength-division multiplexing (WDM). In such averaging approaches, when averaging over N training examples or vectors, the required number of measurements can be reduced by a factor of order N.
In a TDM approach, each training example or vector is sent sequentially into the network. The resulting detected signal from gradient measurements from the cost function and measurement apparatus as described above, or from measurements made inside the network in back-propagation schemes, is then averaged over multiple such training vectors by the measurement process. Hence the averaging is performed as part of the overall measurement process, and the results of that averaging measurement process are then used to calculate or deduce the corresponding averaged gradient or gradients of the cost function for the controllable element or elements in the network.
In a preferred embodiment of this TDM approach to averaging, the training examples or vectors are fed in at a much faster rate than the response time of the measurement system. For example, the rate at which different training examples or vectors are fed in may be chosen preferably to be much faster than any of the modulation frequencies ωlq. Then the resulting measurements of gradients deduced from measurements of the modulation of cost function at such modulation frequencies ωlq will directly give a useful measure of the corresponding gradient averaged over such a set of training examples.
In a WDM approach, we can exploit the fact if networks of MZIs are fabricated with substantially equal optical path lengths for all different optical paths from the inputs to a given point in the network, then the network may have substantially similar response and behavior for optical wavelengths within a substantially wide range. We could expect such similarity of behavior over wavelength ranges of 10s of nanometers around about some underlying wavelength of the order of 1550 nm, for example, in a network designed with such substantial equality of optical path lengths. Then we can send in multiple training vectors at the same time, each on different wavelengths within such a substantially wide range. Such a WDM set of training vectors can be combined substantially without loss using WDM combining techniques known to those skilled in the art, including waveguide grating routers, for example. When such signals at multiple different wavelengths arrive at optical detectors, either within the network as in back-propagation schemes, or outside the network at some cost function or measuring unit, as long as the wavelengths are sufficiently different, the detector will simply add the signals corresponding to each of the different wavelengths, hence averaging the signals over the set of training vectors. A criterion for the wavelengths being sufficient different is that the beat frequency that would result from interfering different wavelengths is much higher than the electrical measurement bandwidth of the photodetector. Since wavelengths separated by nanometer scales in, say, the vicinity of 1550 nm free-space wavelength will have carrier frequencies that vary by numbers of 100 gigahertz or more, which would lead to beat frequencies similarly in the range of 100 gigahertz or more, such beat frequencies generally can lie well above the electrical bandwidth of optical photodetectors, which can be readily engineered using circuit design or other means to lie in ranges of megahertz or low gigahertz frequencies, for example.
We can optionally usefully modulate the power or other attribute, such as phase, of the optical input vector at some other frequency ωs. This can be accomplished by modulating the input light source power PS as in
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2020/049913 | 9/9/2020 | WO |
Number | Date | Country | |
---|---|---|---|
62897657 | Sep 2019 | US |