This disclosure generally relates to self-sufficient and standalone artificial intelligence (AI) devices requiring no support from any backend servers, and particularly relates to self-sufficient AI devices including a multilayer convolutional neural network.
Deep learning models that are trained and deployed with convolutional neural networks (CNNs) may include many convolutional layers, pooling layers, rectification layers, and fully connected layers, and generally require millions of trained model parameters for processing complex input data such as images, speeches, and natural languages. Deployment of such a model thus requires massive amount of memory for storing the model parameters and intermediate calculation results, and further relies on large-scale parallel processing along multiple computation paths through layers of neurons. As such, trained CNN models are traditionally deployed in powerful backend servers equipped with a combination of processors and coprocessors, such as Graphics Processing Units (GPUs) with large graphics memories. An edge device such as a mobile phone and any other special purpose device (e.g., an IoT (Internet-of-Things) device) seeking AI services may only need to transmit necessary input data to and receiving processing outcome from the backend servers. Placing core AI functions completely within an edge device has been difficult without new processing and memory architectures.
This disclosure is directed to AI edge devices that do not need support from any backend servers. Further objects, features, and advantages of this invention will become readily apparent to persons of ordinary skill in the art after a review of the following description, with reference to the drawings and claims that are appended to and form a part of this specification.
In one implementation, a self-contained device is disclosed. The device may include a convolutional neural network (CNN) logic circuit; a plurality of non-volatile random access memory cells embedded with the CNN logic circuit on a same semiconductor substrate and storing a full set of trained parameters for a CNN model containing multiple neurons; a sensor; an actuator circuitry; a program memory storing instructions; and a microcontroller unit in communication with the program memory, the sensor, the CNN logic circuit, the plurality of non-volatile random access memory cells, and the actuator circuitry. The microcontroller unit, when executing the instructions in the program memory, may be configured to cause the sensor to detect a signal according an external stimulus; process the detected signal to obtain a processed data set and communicate the processed data set to the CNN logic circuit; instruct the CNN logic circuit to read trained parameters from the plurality of non-volatile random access memory cells and to forward propagate the processed data set via multiple propagation paths through the multiple neurons in parallel to obtain output label data for the processed data set; process the output label data into a control signal; and control the actuator circuitry according to the control signal.
In the devices above, the plurality of non-volatile random access memory cells may include magnetic random access memory cells (MRAM cells).
In any of the devices above, at least one of the MRAM cells includes a spin torque transfer type of MRAM cell. In any of the devices above, the MRAM cells may be of at least two different cell sizes. In any of the devices above, the MRAM cells bay be arranged with at least two different pitches. In any of the devices above, a read access time for the plurality of non-volatile random access memory cells by the CNN logic circuit may be faster than 5 nanoseconds.
In any of the devices above, the plurality of non-volatile random access memory cells may be programmed with the full set of trained parameters for the CNN model at one of a wafer level, a chip level, or a printed circuit board level.
In any of the devices above, the sensor may include an image sensor and the processed data set may include at least one two dimensional array of pixel values.
Any of the devices above may further include an optical lens assembly for imaging an object field external to the device onto the image sensor. In any of the devices above, the image sensor may include a CMOS active sensor matrix. In any of the devices above, the image sensor may be integrated on the same semiconductor substrate for the CNN logic circuit with the plurality of non-volatile random access memory cells.
In any of the devices above, the CMOS active sensor matrix may be fabricated over the plurality of non-volatile random access memory cells. In any of the devices above the plurality of non-volatile random access memory cells may be fabricated over the CNN logic circuit and the CNN logic circuit may be fabricated over the same semiconductor substrate.
In any of the devices above, the CNN logic circuits and the plurality of non-volatile random access memory cells may be fabricated on different areas of the same semiconductor substrate, and the CMOS active sensor matrix may be fabricated over the CNN logic circuits and the plurality of non-volatile random access memory cells.
In any of the devices above, the plurality of non-volatile random access memory cells may include MRAM cells.
In any of the devices above, the plurality of non-volatile random access memory cells comprises MRAM cells and static random access memory (SRAM) cells.
In any of the devices above, the plurality of non-volatile random access memory cells comprises MRAM cells and resistive random access Memory (RRAM) cells.
In any of the devices above, the plurality of non-volatile random access memory cells comprises MRAM cells and phase change random access memory (PCRAM) cells.
In any of the devices above, the plurality of non-volatile random access memory cells comprises MRAM cells and at least a plurality of one time programmable (OTP) memory cells.
In another implementation, another self-contained AI device is disclosed. The device ma include a convolutional neural network (CNN) logic circuit; a memory comprising a plurality of non-volatile MRAM cells, the memory storing a set of instructions and a full set of trained parameters for a CNN model containing multiple neurons; a sensor; an actuator circuitry; and a microcontroller unit in communication with the memory, the sensor, the CNN logic circuit, and the actuator circuitry. The microcontroller unit, when executing the set of instructions in the memory, may be configured to: cause the sensor to detect a signal according an external stimulus; process the detected signal to obtain a processed data set and communicate the processed data set to the CNN logic circuit; instruct the CNN logic circuit to read trained parameters from the plurality of non-volatile MRAM cells and to forward propagate the processed data set via multiple propagation paths through the multiple neurons in parallel to obtain output label data for the processed data set; process the output label data into a control signal; and control the actuator circuitry according to the control signal.
In the device above, the plurality of non-volatile MRAM cells may be programmed with the full set of trained parameters for the CNN model at one of a wafer level, a chip level, or a printed circuit board level.
In any of the devices above, the sensor may include an image sensor and the processed data set may include at least one two dimensional array of pixel values. In any of the devices above, the image sensor may include a CMOS active sensor matrix. In any of the devices above, the image sensor may be integrated on a same semiconductor substrate for the CNN logic circuit.
Artificial intelligence techniques have been widely used for processing large amount of input data to extract categorical information. These techniques, in turn, may then be incorporated into a wide range of applications to perform various intelligent tasks. For example, deep learning techniques based on convolutional neural networks (CNNs) may provide trained CNN models for processing particular types of input data. For example, a CNN model trained for classifying images may be used to analyze an input image and determine a category of the input image among a predetermined set of image categories. For another example, a CNN model may be trained to produce segmentation of an input image in the form of, e.g., an output segmentation mask. Such segmentation mask, for example, may be designed to indicate where human faces are, and the CNN model may be further trained to determine and recognize the segmented human face among a known set of human faces.
A deep learning CNN model, may typically contain multiple cascading convolutional, pooling, rectifying, and fully connected layers of neurons, with millions of weight and bias parameters. These parameters may be determined by training the model using a sufficient collection of labeled input data. Once a CNN model is trained and the model parameters are determined, it may be used to process unknown input data and to predict labels for the unknown input data. These labels may be classification, segmentation mask, or any other type of labels for the input data.
In a training process of a CNN model, each of a large number of labeled training data sets is forward propagated through layers of neurons of the CNN network embedded with the training parameters to calculate an end labeling loss. Back propagation is then performed through the layers of neurons to adjust the training parameters to reduce labeling loss based on gradient descent. The forward/back propagation training process for all training input data sets iterates until the neural network produces a set of training parameters that provide converging minimal overall loss for the labels predicted by the neural network over labels given to the training data sets. A converged model then includes a final set of training parameters and may then be tested and used to process unlabeled input data sets via forward propagation. Such a CNN model typically must be of sufficient size in terms of number of layers and number of neurons/features in each layer for achieving acceptable predictive accuracy. The number of training parameters is directly correlated with the size of the neural network, and is typically extraordinarily large even for a simple AI model (on the order of millions, tens of millions, hundreds of millions, and thousands of millions of parameters). The forward and back propagations thus require a massive amount of memory to hold these parameters and extensive computation power for iteratively calculating states of a massive number of neurons.
In addition, a large working memory may also be needed during training or deployment of a CNN model for holding a large amount of intermediate calculation results, such as feature maps at various convolutional layers. This working memory may be reusable and shared by non-parallel neurons or layers during forward and back propagations, and thus may be frequently written and read.
The training process for a CNN model is thus typically handled by centralized or distributed backend servers having sufficient memory and computing power in order to train the CNN model in a reasonable amount of time. These calculations may be performed by special co-processors included in the backend servers that are based on parallel data processing. For example, a Graphics Processing Unit (GPU) with large embedded memory or with external memory connected to the GPU core via high speed data buses may be included in the backend servers and used to accelerate the forward/back propagations in neural networks, thanks to similarity in parallel data manipulation between graphics data and neural networks.
Once trained, a CNN model may be deployed in the backend servers and provided as a service, taking advantage of the memory capacity and the parallel computing power of the backend servers. The service would include forward propagating an input data set through the layers of neurons of the trained CNN model to obtain an output label for the input data set. Such a service may be provided to edge devices. Edge devices may include but are not limited to mobile phones and any other devices, such as Internet-of-Things (IoT) devices. These devices may be designed to handle limited tasks and with limited computing power and memory capacity, and thus incapable of efficiently performing forward propagation locally. As such, these edge devices may communicate with the backend servers via communication network interfaces to provide input data sets to the backend servers and obtained labels for the input data sets from the backend server after the input data sets are processed by the CNN model in the backend servers.
In many applications, local processing of the input data may be desired. For example, when an input data set is large (e.g., high-resolution 2D or 3D images), transmission of the input data set from the edge device to the backend servers may consume an unacceptable or unsupported level of communication bandwidth and/or power. Further, some edge devices may have only intermittent communication network connection or no communication network connection at all.
One implementation of an edge device capable of storing a CNN model and locally processing input data via forward propagation through a locally stored neural network is illustrated in
The MCU 120 acts as a central control unit of the edge device 100. Specifically, the MCU 120 may execute instructions stored in the program memory 130 to control other components in the edge device 100 to perform the functions of the entire edge device 100. The program memory 130, for example, may be a non-volatile Read-Only Memory (ROM) and may be programmed when the program memory circuitry is fabricated at wafer level or at the chip level. Alternatively, the instructions may be load into the program memory 130 after the edge device 100 is manufactured via the optional programming interface 140. The instructions may be loaded into the program memory 130 as, e.g., a firmware via the programming interface 140. In addition, the instructions loaded into the program memory 130 may be upgradable by erasing and rewrite its content via the programming interface 140. As such, the program memory may be implemented as an erasable and reprogrammable memory, such as EPROM.
The sensor and sensor circuitry 112 may be used to detect and monitor external contextual data in real-time or under the control of the MCU 120. Depending on the application of the edge device 100, the detected external environmental data may include but are not limited to images, voices, environmental temperature, humidity, barometric pressure, latitude, device orientation, device motion, lighting level. As such, the sensor may be implemented, for example, as an image sensor, a microphone, a thermometer, a hygrometer, a barometer, a GPS sensor, a gyroscope, and an optical detector. Peripheral components for the sensor 112 may be further included in the edge device 100. For example, when the sensor 112 include an image sensor (e.g., a CCD or CMOS sensor), a peripheral optics including imaging lenses may be included for creating optical images onto the image sensor for detection. Signal detected by the sensor/sensor circuitry 112, if analog, may be further converted into digital form and processed by the MCU 120 into a form compatible with a data set that may be processed by the AI engine 110.
The processed sensor data may then be provided to the AI engine 110 for forward propagation under the general control of the MCU 120. The AI engine 110 may be embedded with memory for storing trained model parameters and any intermediate data that may need to be stored during the forward propagation process. The memory for storing model parameters (such as parameters for the convolutional layers and full connected layers with hidden layers) and the working memory for storing intermediate calculation results such as feature maps maybe be the same type or different types of memories, as will be described in more detail below.
The trained model parameters may be loaded into the embedded memory of the AI engine 110 at the time of manufacturing of the AI engine chip, at the time of manufacturing the edge device 100, or loaded via the programming interface 140 and the MCU 120. If needed, the trained model parameters may be updated by reloading a new version of training parameters into the embedded memory of the AI engine 110 via the programming interface 140 and the MCU 120. The output of the AI engine may be a predictive label for the input data set. Such an output may be provided to the actuator/actuation logic 116 to provide actuation of a desired control. Alternatively, the output label of the AI engine 110 may be converted into a control signal 190 or into a signal that is transmitted into the communication network via the network interface 160. Alternatively, the output of the AI engine may be processed by the MCU 120 and the processed data may then be communicated to the actuator and actuation logic circuit 116. The actuator 116 may be used for producing desired action according to the outcome of the AI engine 110. The actuation performed by the actuator 116 may be any type including but not limited to electric, mechanical, thermal, magnetic, and hydraulic. In some implementations, the actuator may be external to the edge device 100, as such, the actuator circuitry 116 may provide a signal 190 and the edge device may transmit the actuator signal 190 to the external actuator via the network interface 160.
The arrows and lines in
The edge device 100 of
The edge device 100 may include more than one AI models. As such, the embedded memory of the edge device may store parameters for multiple AI models and may function as working memory for forward propagation of the multiple AI models. For example, the edge device may be used to detect both images and voices and control the actuator based on both image and voice recognition. As such, the edge device 100 may include at least two different AI models (e.g., different CNN models) including at least one model for image analysis and recognition and another model for speech analysis and voice recognition. The embedded memory for the edge device 100 thus would be configured to hold parameters for both models. The embedded memory may further function as working memory for both models.
As discussed above, processing of input data by a CNN model usually requires a large amount of memory (for model parameters and for intermediate calculation results) and parallel processing capability for forward propagation. In real-time applications, there may be further processing speed requirement that places stringent limitation on the communication speed between the logic circuits of the CNN and the memory for storing the training parameters and intermediate calculation results. Further, the memory for storing the training parameters of the CNN model is preferably non-volatile, as it serves as the only repository for these parameters in the edge device 100. The working memory for storing intermediate results may be preferably fast and durable as the working memory may be frequently written and read. In the implementation for the edge device 100 of
As illustrated in
In some implementations, as shown by 310 of
In some other implementations, as shown in 320 of
In some other implementations, as shown in 330 of
In yet some other implementations, as shown in 340 of
Implementations of RRAM in OTP configuration or any other configurations are described in U.S. patent application Ser. No. 15/989,515 by the same applicant as the current application, which is herein incorporated by reference in its entirety.
Embedding memories cells with the CNN logic circuits 220 may be implemented as shown in
Alternatively, as shown in
In
The description and accompanying drawings above provide specific example embodiments and implementations. Drawings containing circuit and system layouts, cross-sectional views, and other structural schematics, for example, are not necessarily drawn to scale unless specifically indicated. Subject matter may, however, be embodied in a variety of different forms and, therefore, covered or claimed subject matter is intended to be construed as not being limited to any example embodiments set forth herein. A reasonably broad scope for claimed or covered subject matter is intended. Among other things, for example, subject matter may be embodied as methods, devices, components, or systems. Accordingly, embodiments may, for example, take the form of hardware, software, firmware or any combination thereof.
Throughout the specification and claims, terms may have nuanced meanings suggested or implied in context beyond an explicitly stated meaning. Likewise, the phrase “in one embodiment/implementation” as used herein does not necessarily refer to the same embodiment and the phrase “in another embodiment/implementation” as used herein does not necessarily refer to a different embodiment. It is intended, for example, that claimed subject matter includes combinations of example embodiments in whole or in part.
In general, terminology may be understood at least in part from usage in context. For example, terms, such as “and”, “or”, or “and/or,” as used herein may include a variety of meanings that may depend at least in part on the context in which such terms are used. Typically, “or” if used to associate a list, such as A, B or C, is intended to mean A, B, and C, here used in the inclusive sense, as well as A, B or C, here used in the exclusive sense. In addition, the term “one or more” as used herein, depending at least in part upon context, may be used to describe any feature, structure, or characteristic in a singular sense or may be used to describe combinations of features, structures or characteristics in a plural sense. Similarly, terms, such as “a,” “an,” or “the,” may be understood to convey a singular usage or to convey a plural usage, depending at least in part upon context. In addition, the term “based on” may be understood as not necessarily intended to convey an exclusive set of factors and may, instead, allow for existence of additional factors not necessarily expressly described, again, depending at least in part on context.
Reference throughout this specification to features, advantages, or similar language does not imply that all of the features and advantages that may be realized with the present solution should be or are included in any single implementation thereof. Rather, language referring to the features and advantages is understood to mean that a specific feature, advantage, or characteristic described in connection with an embodiment is included in at least one embodiment of the present solution. Thus, discussions of the features and advantages, and similar language, throughout the specification may, but do not necessarily, refer to the same embodiment.
Furthermore, the described features, advantages and characteristics of the present solution may be combined in any suitable manner in one or more embodiments. One of ordinary skill in the relevant art will recognize, in light of the description herein, that the present solution can be practiced without one or more of the specific features or advantages of a particular embodiment. In other instances, additional features and advantages may be recognized in certain embodiments that may not be present in all embodiments of the present solution.
From the foregoing, it can be seen that this disclosure provides a semiconductor chip architecture including logic circuits embedded with various types of memories for improving memory access speed and reducing power consumption. In particular, memories of distinct types embedded with logic circuits on a same semiconductor substrate are disclosed. These memories may include static random access memory, magnetoresistive random access memory, and various types of resistive random access memory. These different types of memories may be combined to form an embedded memory subsystem that provide distinct memory persistency, programmability, and access characteristics tailored for storing different type of data in, e.g., application involving convolutional neural networks.