This patent document relates generally to systems and methods for providing artificial intelligence solutions. Examples of training a convolution neural network model in an artificial intelligence semiconductor solution are provided.
Artificial intelligence (AI) semiconductor solutions include using embedded hardware in an AI integrated circuit (IC) to perform AI tasks. Hardware-based solutions, as well as software solutions, still encounter the challenges of obtaining an optimal AI model, such as a convolutional neural network (CNN) for the hardware. For example, if the weights of a CNN model are trained outside the chip, they are usually stored in floating point. When the weights of a CNN model in floating point are loaded into an AI chip they usually lose data bits from quantization, for example, from 16- or 32-bits to 1- to 8-bits. The loss of data bits in an AI chip compromises the performance of the AI chip due to lost information and data precision. Further, existing training methods are often performed in a high performance computing environment, such as in a central processing unit (CPU) or graphics processing unit (GPU), without accounting for hardware constraints in a physical AI chip. This often causes performance degradation when an AI model trained in a CPU/GPU is loaded into an AI chip.
The present solution will be described with reference to the following figures, in which like numerals represent like items throughout the figures.
As used in this document, the singular forms “a”, “an”, and “the” include plural references unless the context clearly dictates otherwise. Unless defined otherwise, all technical and scientific terms used herein have the same meanings as commonly understood by one of ordinary skill in the art. As used in this document, the term “comprising” means “including, but not limited to.”
An example of “artificial intelligence logic circuit” and “AI logic circuit” includes a logic circuit that is configured to execute certain AI functions such as a neural network in AI or machine learning tasks. An AI logic circuit can be a processor. An AI logic circuit can also be a logic circuit that is controlled by an external processor and executes certain AI functions.
Examples of “integrated circuit,” “semiconductor chip,” “chip,” and “semiconductor device” include integrated circuits (ICs) that contain electronic circuits on semiconductor materials, such as silicon, for performing certain functions. For example, an integrated circuit be a microprocessor, a memory, a programmable array logic (PAL) device, an application-specific integrated circuit (ASIC), or others. An AI integrated circuit may include an integrated circuit that contains an AI logic circuit.
Examples of “AI chip” include hardware- or software-based device that is capable of performing functions of an AI logic circuit. An AI chip may be a physical IC. For example, a physical AI chip may include an embedded cellular neural network (CeNN), which may contain weights and/or parameters of a CNN. The AI chip may also be a virtual chip, i.e., software-based. For example, a virtual AI chip may include one or more processor simulators to implement functions of a desired AI logic circuit.
Examples of “AI model” include data containing one or more parameters that, when loaded inside an AI chip, are used for executing the AI chip. For example, an AI model for a given CNN may include the weights, biases, and other parameters for one or more convolutional layers of the CNN. Here, the weights and parameters of an AI model are interchangeable.
In
In some examples, the communication system 100 may be a centralized system. System 100 may also be a distributed or decentralized system, such as a peer-to-peer (P2P) system. For example, a host device, e.g., 110, 112, 114, and 116, may be a node in a P2P system. In a non-limiting example, a client devices, e.g., 120a, 120b, 120c, and 120d may include a processor and an AI physical chip. In another non-limiting example, multiple AI chips may be installed in a host device. For example, host device 116 may have multiple AI chips installed on one or more PCI boards in the host device or in a USB cradle that may communicate with the host device. Host device 116 may have access to dataset 156 and may communicate with one or more AI chips via PCI board(s), internal data buses, or other communication protocols such as universal serial bus (USB).
In some scenarios, the AI chip may contain an AI model for performing certain AI tasks. Executing an AI chip or an AI model may include causing the AI chip (hardware- or software-based) to perform an AI task based on the AI model inside the AI chip and generate an output. Examples of an AI task may include image recognition, voice recognition, object recognition, data processing and analyzing, or any recognition, classification, processing tasks that employ artificial intelligence technologies. In some examples, an AI training system may be configured to include a forward propagation neural network, in which information may flow from the input layer to one or more hidden layers of the network to the output layer. An AI training system may also be configured to include a backward propagation network to fine tune the weights of the AI model based on the output of the AI chip. An AI model may include a CNN that is trained to perform voice or image recognition tasks. A CNN may include multiple convolutional layers, each of which may include multiple parameters, such as weights and/or other parameters. In such case, an AI model may include parameters of the CNN model.
In some examples, a CNN model may include weights, such as masks and scalars for one or more convolution layers of the CNN model. In a non-limiting example, in a CNN model, a computation in a given layer in the CNN may be expressed by Y=w*X+b, where X is input data, Y is output data, w is a kernel, and h is a bias; all variables are relative to the given layer. Both the input data and the output data may have a number of channels. Operation “*” is a convolution. Kernel w may include binary values. For example, a kernel may include 9 cells in a 3×3 mask, where each cell may have a binary value, such as “1” and “−1.” In such case, a kernel may be expressed by multiple binary values in the 3×3 mask multiplied by a scalar. In other examples, for some or all kernels, each cell may be a signed 2 or 8 bit integer. Other bit length or values may also be possible.
The scalar may include a value having a bit width, such as 12-bit or 16-bit. Other bit length may also be possible. Alternatively, and/or additionally, a kernel may contain data with non-binary values, such as 7-value. The bias b may contain a value having multiple bits, such as 18 bits. Other bit length or values may also be possible. For example, an output channel of a CNN layer may include one or more bias values that, when added to the output of the output channel, adjust the output values to a desired range. In a non-limiting example, the output Y may be further discretized into a signed 5-bit or 10-bit integer constrained by the activation layer of the CNN. Other bit length or values may also be possible. In some examples, a kernel in a CNN layer may be represented by a mask that has multiple values in lower precision multiplied by a scalar in higher precision. In some examples, a CNN model may include other parameters.
In the case of physical AI chip, the AI chip may include an embedded cellular neural network that has memory containing the multiple parameters in the CNN. In some scenarios, the memory in a physical AI chip may be a one-time-programmable (OTP) memory that allows a user to load a CNN model into the physical AI chip once. Alternatively, a physical AI chip may have a random access memory (RAM), magnetoresistive random access memory (MRAM), or other types of memory that allows a user o update and load a CNN model into the physical AI chip multiple times.
In the case of virtual AI chip, the AI chip may include a data structure that simulates the cellular neural network in a physical AI chip. In other examples, a virtual AI chip may directly execute an AI logic circuit without needing to simulate a physical AI chip. A virtual AI chip may be particularly advantageous when higher precision is needed, or when there is a need to compute layers that cannot be accommodated by a physical AI chip.
In the case of a hybrid AI chip, part of an AI logic circuit may be computed using a physical AI chip, while the remainder can be computed with a virtual chip. In a non-limiting example, the physical AI chip may include all convolutional, MaxPool, and some of the ReLU layers in a CNN model, while the virtual AI chip may include the remainder of the CNN model. In some examples, a host device may compute one or more layers of a CNN before sending the output to a physical AI chip. In some examples, the host device may use the output of a physical AI chip to generate output of an AI task. For example, a host device may receive the output of convolution layers of a CNN from a physical AI chip and perform the operations of fully connected layers.
With further reference to
In some examples, quantizing the weights at 204 may include converting the weights from floating points to fixed points for uploading to an AI chip. Quantizing the weights at 204 may include quantizing the weights according to the one or more quantization levels. In some examples, the number of quantization levels may correspond to the hardware constraint of the AI chip so that the quantized weights can be uploaded to the AI chip for execution. For example, the AI chip may include an embedded CeNN. In the embedded CeNN, the weights may include 1-bit (binary value), 2-bit, or other suitable bits, such as 5-bit. In case of 1-bit, the number of quantization levels will be two. In some scenarios, quantizing the weights to 1-bit may include determining a threshold to properly separate the weights into two groups: one below the threshold and one above the threshold, where each group takes one value, such as {1, −1}. In some examples, quantizing the weights into two quantization levels may include a uniform quantization in which a threshold may be determined at the middle of the range of the weight values, such as zero. In such case, the weights having positive values may be quantized to a value of 1 and weights having a negative or zero value may be quantized to a value of −1. In some examples, determining the threshold for quantization may be based on the values of the weights to be quantized or the distribution of the weights.
In
Returning to
The process 200 may further include determining a change of weights at 210 based on the output of the CNN model. In some examples, the output of the CNN model may be the output of the activation layer of the CNN. The process 200 may further update the weights of the CNN model at 212 based on the change of weights. The process may repeat updating the weights of the CNN model in one or more iterations. In some examples, blocks 208-212 may be implemented using a gradient descent method. For example, a loss function may be defined as:
where Yi′ is the ground truth of ith training instance, and Yi is the prediction of the network, e.g., the output of the CNN based on the ith training instance. In other words, the loss function H( ) may be defined based on a sum of loss values over a plurality of training instances in the training data set, wherein the loss value of each of the plurality of training instances is a difference between an output of the CNN model for the training instance and a ground truth of the training instance. In some examples, the prediction Yi in the cost function may be calculated by a softmax classifier in the CNN model.
In some examples, the gradient descent may be used to determine a change of weight
ΔW=f(WQt)
by minimizing the loss function H( ), where WQt stands for the quantized weights at time t. The process may update the weights from a previous iteration based on the change of weight, e.g., Wt+1=Wt+ΔW, where Wt and Wt+1 stand for the weights in a preceding iteration and the weights in the current iteration, respectively. In some examples, the weights (or updated weights) in each iteration, such as Wt and Wt+1, may be stored in floating point. The quantized weights WQt at each iteration t may be stored in fixed point. In some examples, the gradient descent may include known methods, such as a stochastic gradient descent method.
With further reference to
In some examples, blocks 204-214 may be repeated iteratively in a layer-by-layer fashion for multiple convolution layers in a CNN model. In such case, the weights are updated in each of the multiple convolution layers of the CNN model. Upon the completion of iterations for all of the multiple convolution layers, the process 200 may proceed with uploading the quantized weights of the multiple convolution layers of the CNN model at 216.
Once the trained quantized weights are uploaded to the AI chip, the process 200 may further include executing the AI chip to perform an AI task at 218 in a real-time application, and outputting the result of the AI task at 220. In training the weights of the CNN model, the AI chip may be a physical AI chip, a virtual AI chip or a hybrid AI chip, and the AI chip may be configured to execute the CNN model based on the trained weights. The AI chip may be residing in any suitable computing device, such as a host or a client shown in
In a non-limiting example, the trained weights of a CNN model may be uploaded to the AI chip. For example, the quantized weights may be uploaded to an embedded CeNN of the AI chip so that the AI chip may be capable of performing an AI task, such as recognizing one or more classes of object from an input image, e.g., a cry and a smiley face. In an example application, an AI chip may be installed in a camera and store the trained weights and/or other parameters of the CNN model, such as those quantized weights generated in the process 200. The AI chip may be configured to receive a captured image from the camera, perform an image recognition task based on the captured image and the stored CNN model, and transmit the recognition result to an output display. For example, the camera may display, via a user interface, the recognition result. In a face recognition application, the CNN model may be trained for face recognition. A captured image may include one or more facial images associated with one or more persons. The recognition result may include the names associated with each input facial image. The AI chip may transmit the recognition result to the camera, which may present the output of the recognition result on a display. For example, the user interface may display a person's name next to or overlaid on each of the input facial image associated with the person.
It is appreciated that the disclosures of various embodiments in
An optional display interface 530 may permit information from the bus 500 to be displayed on a display device 535 in visual, graphic, or alphanumeric format. An audio interface and audio output (such as a speaker) also may be provided. Communication with external devices may occur using various communication ports 540 such as a transmitter and/or receiver, antenna, an RFID tag and/or short-range, or near-field communication circuitry. A communication port 540 may be attached to a communications network, such as the Internet, a local area network, or a cellular telephone data network.
The hardware may also include a user interface sensor 545 that allows for receipt of data from input devices 550 such as a keyboard, a mouse, a joystick, a touchscreen, a remote control, a pointing device, a video input device, and/or an audio input device, such as a microphone. Digital image frames may also be received from an imaging capturing device 555 such as a video or camera that can either be built-in or external to the system. Other environmental sensors 560, such as a GPS system and/or a temperature sensor, may be installed on system and communicatively accessible by the processor 505, either directly or via the communication ports 540. The communication ports 540 may also communicate with the AI chip to upload or retrieve data to/from the chip. For example, a trained AI model with updated quantized weights obtained from process 200 may be shared by one or more processing devices on the network running other training processes or AI applications. A device on the network may receive the trained AI model from the network and upload the trained weights, to an AI chip for performing an AI task via the communication port 540 and an SDK (software development kit). The communication port 540 may also communicate with any other interface circuit or device that is designed for communicating with an integrated circuit.
Optionally, the hardware may not need to include a memory, but instead programming instructions are run on one or more virtual machines or one or more containers on a cloud. For example, the various methods illustrated above may be implemented by a server on a cloud that includes multiple virtual machines, each virtual machine having an operating system, a virtual disk, virtual network and applications, and the programming instructions for implementing various functions in the robotic system may be stored on one or more of those virtual machines on the cloud.
Various embodiments described above may be implemented and adapted to various applications. For example, the AI chip having a CeNN architecture may be residing in an electronic mobile device. The electronic mobile device may use the built-in AI chip to produce recognition results and generate performance values. In some scenarios, training the CNN model can be performed in the mobile device itself, where the mobile device retrieves training data from a dataset and uses the built-in AI chip to perform the training. In other scenarios, the processing device may be a server device in the communication network (e.g., 102 in
The various systems and methods disclosed in this patent document provide advantages over the prior art, whether implemented standalone or combined. For example, using the systems and methods described in
It will be readily understood that the components of the present solution as generally described herein and illustrated in the appended figures could be arranged and designed in a wide variety of different configurations. Thus, the detailed description of various implementations, as represented herein and in the figures, is not intended to limit scope of the present disclosure, but is merely representative of various implementations. While the various aspects of the present solution are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The present solution may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the present solution is, therefore, indicated by the appended claims rather than by this detailed description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.
Reference throughout this specification to features, advantages, or similar language does not imply that all of the features and advantages that may be realized with the present solution should be or are in any single embodiment thereof. Rather, language referring to the features and advantages is understood to mean that a specific feature, advantage, or characteristic described in connection with an embodiment is included in at least one embodiment of the present solution. Thus, discussions of the features and advantages, and similar language, throughout the specification may, but do not necessarily, refer to the same embodiment.
Furthermore, the described features, advantages, and characteristics of the present solution may be combined in any suitable manner in one or more embodiments. One ordinarily skilled in the relevant art will recognize, in light of the description herein, that the present solution can be practiced without one or more of the specific features or advantages of a particular embodiment. In other instances, additional features and advantages may be recognized in certain embodiments that may not be present in all embodiments of the present solution.
Other advantages can be apparent to those skilled in the art from the foregoing specification. Accordingly, it will be recognized by those skilled in the art that changes, modifications, or combinations may be made to the above-described embodiments without departing from the broad inventive concepts of the invention. It should therefore be understood that the present solution is not limited to the particular embodiments described herein, but is intended to include all changes, modifications, and all combinations of various embodiments that are within the scope and spirit of the invention as defined in the claims.
This application claims the benefit under 35 U.S.C. 119 of the earlier filing date of U.S. Provisional Application 62/830,269 entitled “TRAINING AN ARTIFICIAL INTELLIGENCE MODEL IN A SEMICONDUCTOR SOLUTION”, filed Apr. 5, 2019. The aforementioned provisional application is hereby incorporated by reference in its entirety, for any purpose.
Number | Date | Country | |
---|---|---|---|
62830269 | Apr 2019 | US |